Priors on exchangeable directed graphs

Directed graphs occur throughout statistical modeling of networks, and exchangeability is a natural assumption when the ordering of vertices does not matter. There is a deep structural theory for exchangeable undirected graphs, which extends to the directed case via measurable objects known as digraphons. Using digraphons, we first show how to construct models for exchangeable directed graphs, including special cases such as tournaments, linear orderings, directed acyclic graphs, and partial orderings. We then show how to construct priors on digraphons via the infinite relational digraphon model (di-IRM), a new Bayesian nonparametric block model for exchangeable directed graphs, and demonstrate inference on synthetic data.

Many such models assume exchangeability, i.e., that the joint distribution of the edges is invariant under permutations of the vertices. Undirected exchangeable graphs have been extensively studied. The foundational Aldous-Hoover theorem (Aldous, 1981;Hoover, 1979) characterizes undirected exchangeable graphs in terms of certain measurable functions. Our perspective in this paper is closer to the equivalent characterization in terms of graphons due to Lovász and Szegedy (2006). A graphon is a symmetric, measurable function W : [0, 1] 2 → [0, 1]. Given a graphon W , there is an associated countably infinite exchangeable graph G(N, W ) with random adjacency matrix (G ij ) i,j∈N defined as follows (see Figure 1): and set G ji = G ij for i < j, and G ii = 0. Every exchangeable undirected graph can be written as a mixture of such sampling procedures. For n ∈ N, we write G(n, W ) to denote the finite random undirected graph on underlying set {1, . . . , n} induced by this sampling procedure. For more details on graphons and exchangeable graphs, see the survey by Diaconis and Janson (2008) and book by Lovász (2012).
Most work involving priors on exchangeable graphs has focused on undirected graphs; for various extensions, see the end of Section 5. For directed graphs, much of the work has extended the undirected case by using a single asymmetric measurable function W asym : [0, 1] 2 → [0, 1] to model the directed graph (see Orbanz and Roy (2015, §4) for a survey of such models). While such an asymmetric function is appropriate for exchangeable bipartite graphs (Diaconis and Janson, 2008), this representation cannot express all exchangeable directed graph models (see Section 3.1). Exchangeable directed graphs are also characterized by a sampling procedure given by the Aldous-Hoover theorem. As with the undirected case, we will work with an equivalent formulation in terms of measurable objects known as digraphons (Diaconis and Janson, 2008); see also Offner (2009), Aroskar (2012), and Aroskar and Cummings (2014). The Aldous-Hoover theorem implies that exchangeable directed graphs are determined by specifying a distribution on digraphons. Indeed, a digraphon is a more complicated representation for exchangeable directed graphs than a single asymmetric measurable function; a digraphon describes the possible directed edges between each pair of vertices jointly, rather than independently. We define digraphons in Section 2; for related work, see Section 5. . (b-d) top: Samples from the finite random graphs G(50, W ), G(100, W ), and G(500, W ), shown as "pixel pictures" of the adjacency matrix, where black corresponds to 1 and white to 0; bottom: The samples resorted by increasing order of the sampled uniform random variables U i .

Contributions
This paper presents two main contributions. We first show how digraphons can be used to model directed graphs, highlighting special cases that make use of dependence in the edge directions. In particular, we characterize the form of digraphons that produce tournaments, linear orderings, directed acyclic graphs, and partial orderings (Section 3). We briefly discuss how these formulations can be used to produce estimators for directed graph models (Section 3.3).
Next, we given an explicit example of a prior on digraphons: we present the infinite relational digraphon model (di-IRM), a Bayesian nonparametric block model for exchangeable directed graphs, which uses a Dirichlet process stickbreaking prior to partition the unit interval and Dirichlet-distributed weights for each pair of classes in the partition (Section 4). We derive a collapsed Gibbs sampling inference procedure (Section 6), and demonstrate applications of inference on synthetic data (Section 7), showing some limitations of using the infinite relational model with an asymmetric measurable function to model edge directions independently.

Background
We begin by defining notation and providing relevant background on directed exchangeable graphs. Our presentation largely follows Diaconis and Janson (2008).

Notation
Let [n] := {1, . . . , n}. For a directed graph (or digraph) G whose vertex set V is [n] or N, we write (G ij ) i,j∈V for its adjacency matrix, i.e., G ij = 1 if there is an edge from vertex i to vertex j, and 0 otherwise. We will omit mention of the set V when it is clear. In general, for a directed graph, (G ij ) may be asymmetric, and we allow self-loops, which correspond to values G ii = 1 on the diagonal. The adjacency matrix of an undirected graph (without self-loops) is a symmetric array (G ij ) satisfying G ii = 0 for all i.
We write X d = Y to denote that the random variables X and Y are equal in distribution.

Exchangeability for directed graphs
A random (infinite) directed graph G on N is exchangeable if its joint distribution is invariant under all permutations π of the vertices: By the Kolmogorov extension theorem, it is equivalent to ask for this to hold only for those permutations π that move a finite number of elements of N. Such an array (G ij ) is sometimes called jointly exchangeable. The case where the distribution is preserved under permutation of each index separately, i.e., where (G ij ) d = (G π(i)σ(j) ) for arbitrary permutations π and σ, is called separately exchangeable, and arises for adjacency matrices of bipartite graphs.

Digraphons
As described by Diaconis and Janson (2008), using the Aldous-Hoover theorem one may show that every exchangeable countably infinite directed graph is expressible as a mixture of G(N, W) with respect to some distribution on digraphons W.
We now define digraphons; in Section 2.4 we will describe the sampling procedure that yields G(N, W). The functions W ab represent the joint probability of G ij = a and G ji = b for a, b ∈ {0, 1}, i.e., conditioned on U i and U j . In this way, W 00 determines the probability of having neither edge direction between vertices i and j, W 01 of only having a single edge to j from i ("right-to-left"), W 10 of a single edge from i to j ("left-to-right"), and W 11 of directed edges in both directions between i to j. The function w represents the probability of G ii ; because it is {0, 1}-valued, this merely states whether or not i has a self-loop. (There is an equivalent alternative set of objects that may be used to specify an exchangeable digraph, where W 00 , W 01 , W 10 , W 11 are as before and p ∈ [0, 1] gives the marginal probability of a self-loop, which is independent of the other edges; see Diaconis and Janson (2008) for details.)

Sampling from a digraphon
The adjacency matrix (G ij ) i,j∈N of a countably infinite random graph G(N, W) is determined by the following sampling procedure: 2. For each pair of distinct vertices i, j, assign the edge values for G ij and G ji according to an independent Categorical(W 4 (U i , U j )) such that Equation (4) holds. 3. Assign self-loops G ii = w(U i ) for all i.
In other words, in step 2 we assign ( where we interpret the categorical random variable as a distribution over the choices (0, 0), (0, 1), (1, 0), (1, 1), in that order. Note that step 2 is well-defined by the symmetry condition in Equation (3). Figure 2 illustrates this sampling procedure via a schematic.
An analogous sampling procedure yields finite random digraphs: Given n ∈ N, in step 1, instead sample only U i for i ∈ [n]. Then determine G ij for i, j ∈ [n] as before. We write G(n, W) to denote the random digraph thereby induced on [n].
Fig 2: Schematic illustrating digraphon sampling procedure for W = (W 00 , W 01 , W 10 , W 11 , w). The x-axis is vertical and y-axis horizontal, with (0, 0) in the upper left, so that the notation W ab (x, y) coheres with the usual (row, column) convention for matrix indexing.

Aldous-Hoover theorem for directed graphs
Diaconis and Janson (2008) derived the following corollary of the Aldous-Hoover theorem for directed graphs.
Theorem 2.2 (Diaconis-Janson). Every exchangeable random countably infinite directed graph is obtained as a mixture of G(N, W); in other words, as G(N, W) for some random digraphon W.
Therefore the problem of specifying the distribution of an infinite exchangeable digraph may be equivalently viewed as the problem of specifying a distribution on digraphons.

Digraphons and statistical modeling
We first motivate the use of digraphons instead of asymmetric measurable functions for modeling exchangeable directed graphs. We then discuss the representations via digraphons for several random structures which are special cases of directed graphs. Finally, we discuss how to estimate digraphons, in the context of both Bayesian and frequentist estimation.

Modeling limitations of asymmetric measurable functions
Asymmetric measurable functions W asym : [0, 1] 2 → [0, 1] characterize exchangeable bipartite graphs by the Aldous-Hoover theorem for separately exchangeable arrays; for details see Diaconis and Janson (2008, §8). These functions can also be used to generate and model directed graphs (without self-loops) by considering the edge directions G ij and G ji independently, i.e., Pr(G ij = 1) = W asym (U i , U j ) for all i = j, conditioned on U i and U j , according to the following sampling procedure: and G ii = 0 for i ∈ N. Currently priors on these asymmetric functions are popular in Bayesian modeling of directed graphs, as we note in Section 5.
Asymmetric measurable functions are also equivalent to the following special case of the digraphon representation. Via the above sampling procedure, every asymmetric measurable function W asym yields the same directed graph as the digraphon W = (W 00 , W 01 , W 10 , W 11 , w) given pointwise by where p := W asym (x, y) and q := W asym (y, x). In particular, conditioned on x = U i and y = U j , the marginal probability p(1 − q) + pq = p of an edge from i to j and (1 − p)q + pq = q of an edge from j to i are independent.
On the other hand, many common kinds of digraphs are not obtainable from a single asymmetric function. Consider the following two classes: 1. Undirected graphs: between any two vertices i and j, there are either no edges (G ij = G ji = 0), or edges in both directions (G ij = G ji = 1). 2. Tournaments: between any two vertices i and j, there is exactly one directed edge, i.e., G ij = 1 or G ji = 1 but not both.
For digraphs of either of these two sorts, the directions are correlated, and hence not obtainable from the above sampling procedure for an asymmetric measurable function, as this procedure generates G ij and G ji independently. This demonstrates how the use of an asymmetric measurable function is poorly suited for graphs with correlated edge directions. Though constructing a model for general digraphs using the function W asym leads to misspecification, one might hope to perform inference nevertheless; however, as we show in Section 7.2, doing so may fail to discern structure that may be discovered through posterior inference with respect to a prior on digraphons.
In contrast to the use of asymmetric measurable functions, where one considers edge directions independently, with digraphons one considers the edge directions between vertex i and vertex j jointly, as in Equation (4). Thus, digraphons give a more general and flexible representation for modeling digraphs.

Special cases
We discuss several special cases of directed graphs and specify the form of their digraphon representations.

Undirected graphs
Undirected graphs can be viewed as directed graphs with no self-loops, where each pair of distinct vertices either has edges in both directions or in neither. Hence a digraphon that yields an undirected graph is one having no probability in the single edge directions, i.e., such that W 01 = W 10 = 0 (or equivalently, W 00 + W 11 = 1) and w = 0. Such a digraph is therefore determined by merely specifying the graphon W 11 , where W 00 = 1 − W 11 is implicit.

Tournaments
A tournament is a directed graph without self-loops, where for each pair of vertices, there is an edge in exactly one direction. In other words, a tournament has G ij = 1 if and only if G ji = 0 for i = j, and G ii = 0. Therefore a digraphon yielding a tournament is one satisfying w = 0 and W 01 + W 10 = 1 (or equivalently, W 00 = W 11 = 0).
Example 3.2. An example of a tournament digraphon is displayed in Figure 4: The random tournament induced by sampling from this digraphon is almost surely isomorphic to a countable structure known as the generic tournament.
(For more details on this example, see Chung and Graham (1991) and Diaconis and Janson (2008, Example 9.2).) As discussed in Section 3.2.1, exchangeable undirected graphs can be specified in terms of single functions W 11 (graphons) and their associated sampling procedure (described in Equation 1). Similarly, tournaments also have a singlefunction representation and associated sampling procedure. Namely, a tournament digraphon is determined by a measurable function ) for i < j, and then set G ji = 1 − G ij (and G ii = 0). The digraphon in Example 3.2 corresponds to the anti-symmetric, measurable function W T (x, y) = 1 /2. Tournament digraphons have recently been studied in detail by Thörnblad (2016), which calls the single function W T a tournament kernel.
Statistical models for tournaments appear in the ranking theory literature, often using a variant of the Bradley-Terry model (Bradley and Terry, 1952), first described by Zermelo (1929). For more details, including the relation to graphons, see Chatterjee (2015, §2.7). This literature, and related estimation papers such as Chatterjee and Mukherjee (2016), is also often framed in terms of a single-function representation.

Linearly ordered sets
A digraph is a (strict) linear ordering when the directed edge relation is transitive, and every pair of distinct vertices has an edge in exactly one direction. Consider the digraphon given by W 00 = W 11 = w = 0 and W 01 = 1 − W 10 , where The countable directed graph induced by sampling from this digraphon is almost surely a linear order. In fact, this is essentially the only such exampleby Glasner and Weiss (2002, §8), its distribution is the same as that of every exchangeable linear ordering. (In other words, any digraphon yielding the (unique) exchangeable linear ordering is weakly isomorphic to this one; see Section 7 for details.) Furthermore, the countable linear ordering obtained from sampling this digraphon is almost surely dense and without endpoints, and hence isomorphic to the rationals. A finite sample with n vertices has distribution equal to the uniform measure on all n! ways of linearly ordering {1, . . . , n}. This digraphon is displayed in Figure 5 alongside a 20 vertex random sample, rearranged by increasing U i ; note that for almost every sample, the corresponding rearranged graph will have all vertices strictly above the diagonal.

Directed acyclic graphs
A directed acyclic graph (DAG) is a directed graph having no directed path from any vertex to itself. Various work has focused on models for DAGs (e.g.,  (2004)), and especially their use in describing random instances of directed graphical models (also known as Bayesian networks). DAGs also arise naturally as networks describing non-circular dependencies (e.g., among software packages), and in other key data structures.
One can show, using the main result of Hladký et al. (2015) (which we describe in Section 3.2.5), that any exchangeable DAG can be obtained from sampling a digraphon satisfying W 10 (x, y) = 0 for x ≥ y and W 11 = w = 0. Note that this constrains the digraphon to have the same zero-valued regions as those in the canonical presentation of a linear ordering digraphon (as described above and displayed in Figure 5), except that W 00 may be arbitrary. (Equivalently, for x < y, the value W 10 (x, y) may be chosen arbitrarily, so that the remaining terms are given by W 01 (x, y) = W 10 (y, x) and W 00 = 1 − W 01 − W 10 .) A digraphon of this form thereby specifies one way in which the exchangeable DAG can be topologically ordered (i.e., extended to some exchangeable linear ordering).
Specifying a digraphon in this way always yields a DAG upon sampling, as the standard linear ordering on [0, 1] does not admit directed cycles, and one can show that all exchangeable DAGs arise in this way, as mentioned above.
Example 3.3. An example of a digraphon that yields exchangeable DAGs is the generic DAG digraphon given by . This example is displayed in Figure 6. We can see that the reordered sample is indeed a DAG, as the edges clearly all lie above the diagonal in the adjacency matrix.

Partially ordered sets
A partially ordered set, or poset, is a set with a binary relation that is reflexive, antisymmetric, and transitive. A poset can be viewed as a digraph having a directed edge from a to b if and only if a b. Note that the transitive closure of any DAG is a poset, i.e., if in a DAG, there is a directed path from a to b, the transitive closure has an edge from a to b, thereby producing a partial ordering.
(One can similarly define the "transitive closure digraphon" of a digraphon that yields DAGs to obtain a digraphon yielding the corresponding transitive closures). Conversely, any poset (with self-loops removed) is already a DAG. Therefore exchangeable posets are obtainable by some digraphon of the form described in Section 3.2.4 (except with w = 1), though not all such digraphons yield posets. Analogously, representing an exchangeable poset via a digraphon of this form amounts to specifying a linearization of the poset. Janson (2011) develops a theory of poset limits (or posetons) and their relation to exchangeable posets. By Hladký et al. (2015), any exchangeable poset is given by some digraphon W for which W 10 (x, y) > 0 implies that x < y, i.e., W is compatible with the standard linear ordering on [0, 1].
Example 3.4. Consider the following example of a digraphon that yields an exchangeable poset, specified by the following blockmodel: 1 /2 if x < 1 /4 and 1 /4 ≤ y < 3 /4, 1 /2 if 1 /4 ≤ x < 3 /4 and y ≥ 3 /4, 1 if x < 1 /4 and y ≥ 3 /4, and 0 otherwise, where W 11 = 0, where W 01 is such that W 01 (x, y) = W 10 (y, x), and where W 00 = 1 − W 01 − W 10 . This example is displayed in Figure 7. In particular, the block structure of the model is reflected in the rearranged sample on the right. We can see that this is an exchangeable poset: if the loops (the diagonal) are removed from this digraph, it is a DAG (as all the edges in the rearranged sample are above the diagonal), and one can check that it is transitively closed. This is a key example among posets. Work of Kleitman and Rothschild (1975) and Compton (1988), characterizing the combinatorial structure of a typical large finite poset, implies that the sequence of uniform distributions on labeled posets of size n converges (in the sense of poset limits) to this example.

Digraphon estimation
For undirected graphs, the graphon estimation problem has received considerable attention in recent years. In graphon estimation, one seeks to infer either the From the frequentist perspective, one is interested in producing an estimator for a fixed graphon, and many such algorithms have been developed, including histogram and degree-sorting based methods. To produce a frequentist digraphon estimator, one can extend methods developed for graphons. Just as a single asymmetric measurable function is insufficient for representing correlated edge directions, one must likewise estimate the edge directions jointly. Although a directed graph can be simply represented with a single asymmetric matrix, digraphon estimators must consider the impact on pairs of entries (G ij , G ji ) jointly when partitioning, rearranging, or otherwise manipulating vertices i, j.
Histogram estimators for digraphons A histogram estimation procedure for graphons partitions the vertices into several classes, and then uses the average edge density across each pair of classes as an estimate of the probability of an edge between two vertices in those classes. This reduces the problem to that of estimating a partition that yields a good estimate of these edge densities. To estimate a digraphon (ignoring loops), we must estimate four edge-direction histograms, where the goal is to estimate a partition of the vertices that simultaneously yields good estimates of the four types of edge densities. After producing a partition of the vertices, one likewise computes the average edge density in each of the four cases, resulting in four histograms. (If considering loops, there is one additional 1-dimensional histogram whose estimates are to be jointly optimized by the partition.) The Frieze-Kannan and Szemerédi regularity lemmas lead to bounds on how well a large graph can be approximated using edge densities across a partition (Lovász, 2012, Chapters 9 and 10); see also Kallenberg (1999). The generalization of the Szemerédi regularity lemma to directed graphs by Alon and Shapira (2004) likewise provides a bound in terms of directed edge densities.
Degree-sorting estimators for digraphons Many degree-sorting algorithms have been proposed for graphon estimation. These algorithms often involve "sorting" followed by "smoothing". In the sorting step, the vertices are sorted by their degrees, where the degree of a vertex i is defined to be n j=1 G ij , In the smoothing step, the {0, 1}-valued adjacency matrix is used to produce a [0, 1]-valued matrix using some smoothing algorithm. For example, Chan and Airoldi (2014) compare a degree-sorting algorithm that uses total variation distance minimization as a smoothing step to one that uses universal singular value thresholding (Chatterjee, 2015) as the smoothing step. Degree-sorting estimators assume that the degree distribution 1 0 W (x, y)dy is strictly monotonizable in x, i.e., in order for sorting to be effective, the degrees of the vertices must vary.
This idea can be similarly applied to digraphon estimation: First sort the degrees of the vertices by the four types of edge directions, to obtain four adjacency matrices, and then smooth these matrices. It would suffice to require, after possibly applying a single measure-preserving transformation to [0, 1], that the map x → In this paper, we do not comment further on digraphon estimators, but many other graphon estimation techniques should generalize similarly. One way of describing the general pattern is to jointly consider the corresponding techniques applied to the four matrices obtained from the adjacency matrix restricted to each joint edge type.
Priors on digraphons Bayesian approaches may also be use to estimate a digraphon; this is the focus of much of the rest of the paper. One may likewise use similar techniques to those that have been developed for graphons. Analogously to the case of undirected graphs, a Bayesian model for exchangeable directed graphs involves placing a prior on digraphons. This is justified by the characterization in Section 2.5 of exchangeable directed graphs in terms of random digraphons. We discuss such an approach in depth in Section 4, where we present a Bayesian nonparametric model based on random partitions using the Dirichlet process.

Infinite relational digraphon model
We now proceed to describe a prior on digraphons that makes use of block structure. For directed graphs, the infinite relational model (IRM) (Kemp et al., 2006) models edges between vertices using an asymmetric measurable function and is a nonparametric extension of the (asymmetric) stochastic block model. In this section, we present the infinite relational digraphon model (di-IRM), which gives a prior on digraphons. This model can be viewed as a generalization of the symmetric IRM, a graphon model, to the digraphon case. We then show how the di-IRM can be used to model a variety of digraphs, including ones that cannot be modeled using an asymmetric IRM.

Generative model
We present two equivalent representations of the di-IRM model: (1) a digraphon representation and (2) a clustering representation. The digraphon representation uses a stick-breaking Dirichlet process prior to partition the unit interval, while the clustering representation uses a Chinese restaurant process prior to partition the vertices. The difference between the two representations is analogous to that between the representations of the IRM given by Orbanz and Roy (2015, §4.1).
In Section 4.2 we show different types of random digraphons that arise from various settings of β. The partition is drawn from a Dirichlet stick-breaking process: for each i ∈ N, draw X i iid ∼ Beta(1, α), and for every k ∈ N, The self-loops can be specified using the same partition of [0, 1], either with a deterministic {0, 1}-valued function w or a single weight p, as described in Section 2.3. For our purposes, we assume w = 0. This generative process fully specifies a random digraphon W, from which random digraphs G(n, W) can then be sampled according to the process given in Section 2.4.

Clustering representation
An alternative representation of the generative process for a partition described above can be formulated directly in terms of clustering: in this generative process, each vertex i has a cluster assignment z i . This yields an equivalent assignment to that given by the digraphon formulation if, after sampling the uniform random variable U i , we assign vertex i to the cluster corresponding to the class of the partition of [0, 1] that U i belongs to.
Thus, in place of the first step of the generative process given in the digraphon representation (Section 4.1.1), we draw a partition of the vertices from a Chinese restaurant process (CRP) (as described in, e.g., Aldous (1985)): z ∼ CRP(α), where each z i gives the cluster assignment of vertex i, and α > 0 is a hyperparameter. The weights η are drawn in the same manner as in the second step of the digraphon representation of the di-IRM. Finally, edges are drawn analogously to the general digraphon sampling procedure: (G ij , G ji ) ind ∼ Categorical(η zi,zj ), so that Equation (4) holds, where again the Categorical distribution is over the choices (0, 0), (0, 1), (1, 0), (1, 1).
This representation is particularly convenient for performing inference, especially when using a collapsed Gibbs sampling procedure, as we show in Section 6.

Special cases obtained from the di-IRM
In Figure 8, we display examples of random di-IRM draws using several settings of the hyperparameter vector β. The parameter settings were specifically chosen to illustrate some of the special cases the di-IRM model can cover.
Undirected To get a prior on graphons using the di-IRM, we can set β (01) = β (10) = 0. Figure 8a shows a parameter setting that produces undirected graphs and is equivalent to a symmetric IRM when taking W 11 to be the IRM; we can see from the sample on the right that the graph is indeed undirected.
Tournaments We can specify a di-IRM tournament prior by setting β (00) = β (11) = 0. Figure 8c shows the parameter setting β = (0, 2, 1, 0), which puts all the mass on the middle two functions. The tournament structure is easy to see in the 20-vertex sample; for distinct i and j, whenever there is an edge from i to j, there is not an edge from j to i.  shows a less extreme (non-tournament) variant that still has strong correlations between the edge directions, by virtue of retaining most of the mass on the functions W 01 and W 10 . Here we set β = (0.9, 2, 1, 0.5). Note that the block structure in a sample from this digraphon is more subtle than in the undirected sample, demonstrating the importance of counting all four edgedirection combinations rather than just marginals for the two directions.
Directed acyclic graphs To obtain a directed acyclic graph from the di-IRM, we set the hyperparameters so that the resulting function W 11 is empty and W 10 has nonzero values only on blocks above the diagonal, as in Section 3.2.4. To achieve this, we set the Dirichlet weight parameters β such that β (01) = β (11) = 0 for the weights η r,s where r = s, and for each class r let β r refer to the 3-tuple of hyperparameters used for the η r,r weights on the diagonal, each set to β r = (1, 0, 0). With these hyperparameter choices, we obtain a directed acyclic di-IRM, as seen in Figure 8g. We can see in both samples that the directed edges in the resorted sample lie above the diagonal. Note that we make use of β r only in this section, to show how to get a DAG prior; in our later inference examples, we use the di-IRM model as introduced in the previous subsection with the single vector of hyperparameters β.
Near-ordering Consider the hyperparameter settings β = (0, 0, 1, 0) for the weights η r,s when r = s, and β r = (1, 0, 0) for every class r. The resulting digraph is "nearly" ordered, in the sense that it is linearizable and any two elements in different classes are comparable, as seen in Figure 8i. Here η (10) r,s = 1 for any blocks (r, s) above the diagonal, and the resulting partial ordering is apparent in both of the resorted samples, with all directed edges above the diagonal.

Other partitions for the di-IRM
Any block model digraphon can be specified in a similar manner: first define a partition of [0, 1], which then gives a partition of [0, 1] 2 ; next let each block on [0, 1] 2 be piecewise constant such that the symmetry requirements in Equation (3) are satisfied.
In the case where the number of classes and the size of the classes are fixed parameters, the directed IRM behaves similarly to some random directed SBM. In addition to the CRP, we can also consider other partitioning schemes. Alternatively, one can consider other random partitions of [0, 1] as well. For instance, if one is interested in power law scaling in the number of clusters (and the sizes of particular clusters), the Pitman-Yor process (Pitman and Yor, 1997) provides a suitable generalization of the Dirichlet process. It has both a stick-breaking and urn representation analogous to those for the Dirichlet process.

Related work
The stochastic block model (see Holland et al. (1983) and Wasserman and Faust (1994)) has been well-studied in the case of directed graphs (Holland and Leinhardt, 1981;Wang and Wong, 1987), including from a Bayesian perspective (Gill and Swartz, 2004;Nowicki and Snijders, 2001;Wong, 1987). Although working within a restricted class of models, already Holland and Leinhardt (1981) consider the full joint distribution on edge directions, rather than making independence assumptions.
The directed stochastic blockmodel (di-SBM) can be represented as a digraphon W 4 given by four step-functions that are piecewise constant on a finite number of classes. We display an example of a directed SBM in Figure 9. The di-IRM model presented in this paper can be seen as a nonparametric extension of the di-SBM, just as the undirected IRM (introduced independently by Kemp et al. (2006) and Xu et al. (2007)) is a nonparametric undirected SBM.
Any prior on exchangeable undirected graphs can be described in terms of a corresponding prior on graphons. As alluded to in the introduction, many Bayesian nonparametric models for graphs admit a nice representation in this form (even if not originally described in these terms); see Orbanz and Roy (2015, §4) for additional details and examples from the machine learning literature, including the IRM. Likewise, priors on exchangeable digraphs (which have been less thoroughly explored) can be described in terms of the corresponding priors of digraphons, as we have begun to do here. As noted in Lloyd et al. (2012), when existing models are expressed in these terms, various restrictions (and in particular, unnecessary independence assumptions) become more apparent. As we have seen, the use of the IRM on directed graphs models the edge directions as independent (see Kemp et al. (2004) for examples), a condition that can be straightforwardly relaxed when the model is expressed in the general setting provided by digraphons.
Exchangeable directed graphs have also been considered by Austin (2008), via an application of the Aldous-Hoover theorem, although this work does not describe digraphons explicitly. We conclude this section by describing several extensions of the graphon formalism, some of which can be combined with the directed case. In particular, edges may be more general than {0, 1}-valued. Variants of graphons for weighted and edge-colored graphs have been considered by Lovász (2012, Chapter 17) and Austin (2008). Graphs with edge multiplicity, or multigraphs, can be viewed as integer-valued arrays, a case also covered by the Aldous-Hoover theorem, although the corresponding extension of graphons is more complicated when the edge multiplicities are unbounded; see Kolossváry and Ráth (2011), Lovász (2012, Chapter 17), and Kunszenti-Kovács et al. (2014. Graphs (that are not necessarily symmetric) with real-valued edges are also covered by the Aldous-Hoover theorem through real-valued exchangeable arrays, and have many applications in statistics and machine learning; see Lloyd et al. (2012) and Orbanz and Roy (2015). The Aldous-Hoover theorem also covers real-valued d-dimensional arrays for d > 2, although the corresponding extension of graphons to the case of hypergraphs is considerably more in- volved; for details, see Lovász (2012, Chapter 23.3), Austin (2008), and Lloyd et al. (2013).

Posterior inference
In this section, we perform collapsed Gibbs sampling for the di-IRM. We use the notation for the clustering representation of the di-IRM, so we can use Gibbs sampling to repeatedly sample the cluster assignment of each vertex. Let G be a digraph on [n]; for simplicity we assume that G has no self-edges, and that, as in Section 4.1.1, the di-IRM parameters are chosen so that no selfedges are produced. Consider a partition of [n] into a countably infinite number of clusters, and for i ∈ [n], let z i ∈ N denote the cluster assignment of i. Write z for the vector of all cluster assignments, and η for the 4-tuple of weight matrices. Because of the symmetry requirement of the diagonal, we are able to simplify notation as follows: let m * r := m r,r , and let β * := β (01) +β (10) . Let β * := (β (00) , β * , β (11) ) be the 3-tuple of hyperparameters for the diagonal blocks.
The likelihood of G being drawn from the di-IRM, given cluster assignments z and weights η, is given by r,s denotes the number of directed edges of type (ab) between class r and class s, for a, b ∈ {0, 1} and r, s ∈ N.
Since the weights η have a factorized Dirichlet distribution prior, we have where B(θ) := i Γ(θi) Γ( i θi) is the multivariate beta function. We sample each cluster assignment z i conditional on all other assignment variables: where z −i denotes the vector of all assignments z j such that j = i.
To compute the first term in Equation (5), we can integrate out the parameters η (ab) r,s : where we simplify calculations on the diagonal using the shorthand m r := (m r,r , m * r , m r,r ), and η r := (η (00) r,r , η * r , η r,r ). The second term in Equation (5) comes from the CRP distribution on z: where c r denotes the number of elements in cluster r, and α > 0 is the concentration hyperparameter.

Experiments
In this section, we experimentally evaluate the di-IRM model on synthetic data. We present two examples: the first is meant to illustrate the correct behavior of inference on di-IRM parameters, and the second is designed to show the advantage of using a digraphon representation (given by the di-IRM) over using an asymmetric function (given by the IRM). Multiple digraphons may induce the same distribution on exchangeable digraphs, in which case they are said to be weakly isomorphic. This is not just because a digraphon can be perturbed on a measure-zero set without changing the induced distribution on digraphs, but also because measurable rearrangements of the digraphon will also leave the distribution invariant (analogously to how relabeling the vertices of a digraph does not change it up to isomorphism). Hence a digraphon W is not identifiable from the random digraph G(N, W); in general only its weak isomorphism class can be determined. For details (in the analogous setting of graphons), see Diaconis and Janson (2008, §7) and Orbanz and Roy (2015, §3.4).
Therefore, in the following inference problems, we can only expect to estimate a digraphon up to its weak isomorphism class. In a block model, this results in the nonidentifiability of the order of the blocks.

Random di-IRM from uniform weights
We first draw a random di-IRM W with the weights β = (1, 1, 1, 1), which is displayed in Figure 10a. We then generate a 100-vertex sample from this digraphon (Figure 10c). We ran a collapsed Gibbs sampling procedure for 200 iterations, beginning from a random initial clustering. This inference procedure is able to recover the original weights, up to reordering; the inferred weight In the schematic, arrows show the probability of connecting in that direction; i.e., any two distinct vertices in the same group have probability 1 /2 of an arrow in both directions (and 1 /2 of an arrow in neither direction), while for any vertex from group 1 and vertex from group 2, either there is just an arrow from the first to the second, or there is just an arrow from the second to the first, each occurring with probability 1 /2. The bottom right shows the random sample from the digraphon and the results of collapsed Gibbs sampling in the di-IRM and the IRM. White indicates no edge, red indicates an edge between vertices from group 1, blue indicates an edge between vertices from group 2, and purple indicates an edge between vertices from different groups. Black lines indicate the inferred partition.
matrices are displayed in Figure 10, drawn in proportion to the inferred cluster sizes.
This digraphon is displayed in Figure 11a, and a schematic illustrating the model is in Figure 11b. This example demonstrates the importance of being able to distinguish regions having different correlations between edge directions (but the same marginal left-to-right and right-to-left edge probabilities). We generated a synthetic digraph sampled from G(100, W) and then ran a collapsed Gibbs sampling procedure for the di-IRM. We also ran a similar collapsed Gibbs sampler for the IRM. Both samplers began with a random clustering and ran until the cluster assignments approximately converged. The results are shown in Figure 11c; here the random sample is displayed alongside the sample resorted according the clusters inferred using the di-IRM model, as well as the clusters inferred by the IRM model. In both resorted images, the true clusters are colored, white indicates no edge, red indicates an edge between vertices from group 1, blue indicates an edge between vertices from group 2, and purple indicates an edge between vertices from different groups. Note that the true clusters are correctly inferred using the di-IRM model, as reordering the vertices according to the inferred clusters identifies the true groups, while the IRM model fails to discern the correct structure. The IRM only considers the marginal left-to-right and right-to-left edge probabilities, which do not distinguish the two clusters; in this particular inference run, almost all vertices were put into the first of the two clusters, which is consistent with not being able to distinguish between vertices with similar marginal edge probabilities. This result is what one would expect from an algorithm that has inferred uniform independent edge probabilities, i.e., the edge probabilities of an Erdős-Rényi graph.

Discussion
We have described how priors on digraphons can be used in the statistical modeling of exchangeable dense digraphs, and have exhibited several key classes of structures that one can model with particular subclasses of these priors. We have also illustrated why merely using asymmetric measurable functions is insufficient, as this produces a misspecified model for any exchangeable digraphs having correlations between the edge directions.
While models based on digraphons (and graphons) are almost surely dense (or empty) and not directly suitable for real-world network applications that are sparse, it is still useful to study models using digraphons (see, e.g., the discussion in Orbanz and Roy (2015, §7.1)). Some recent work, e.g., Borgs et al. (2015Borgs et al. ( , 2016; Cai et al. (2016); Caron and Fox (2014); Crane and Dempsey (2016); Herlau and Schmidt (2016); Veitch and Roy (2015), has pointed to methods for extending exchangeable graphs to the case of sparse graphs, but many interesting problems remain.