Exchangeable Trait Allocations

Trait allocations are a class of combinatorial structures in which data may belong to multiple groups and may have different levels of belonging in each group. Often the data are also exchangeable, i.e., their joint distribution is invariant to reordering. In clustering---a special case of trait allocation---exchangeability implies the existence of both a de Finetti representation and an exchangeable partition probability function (EPPF), distributional representations useful for computational and theoretical purposes. In this work, we develop the analogous de Finetti representation and exchangeable trait probability function (ETPF) for trait allocations, along with a characterization of all trait allocations with an ETPF. Unlike previous feature allocation characterizations, our proofs fully capture single-occurrence"dust"groups. We further introduce a novel constrained version of the ETPF that we use to establish an intuitive connection between the probability functions for clustering, feature allocations, and trait allocations. As an application of our general theory, we characterize the distribution of all edge-exchangeable graphs, a class of recently-developed models that captures realistic sparse graph sequences.


Introduction
Representation theorems for exchangeable random variables are a ubiquitous and powerful tool in Bayesian modeling and inference. In many data analysis problems, we impose an order, or indexing, on our data points. This indexing can arise naturally-if we are truly observing data in a sequence-or can be artificially created to allow their storage in a database. In this context, exchangeability expresses the assumption that this order is arbitrary and should not affect our analysis. For instance, we often assume a sequence of data points is infinitely exchangeable, i.e., that the distribution of any subset of the data is invariant to reordering (Aldous, 1985). Though this assumption may seem weak, de Finetti's theorem (de Finetti, 1931;Hewitt and Savage, 1955) tells us that in this case, we can assume that a latent parameter exists, that our data are independent and identically distributed (i.i.d.) conditional on this parameter, and that the parameter itself has a distribution. Thus, de Finetti's theorem may be seen as a justification for a Bayesian model and prior-and, in fact, for the infinite-dimensional priors provided by Bayesian nonparametrics.
De Finetti-style representation theorems have provided many similar insights for modeling and inference within Bayesian analysis. For example, in clustering problems, the inferential goal is to assign data points to mutually exclusive and exhaustive groups called clusters. It is typical to assume that the distribution of the clustering-i.e., the assignment of data points to clusters-is invariant to the ordering of the data points. Pitman (1995) showed that this notion of exchangeability for clustering is equivalent to the existence of an exchangeable partition probability function (EPPF). The EPPF and similar developments have led to algorithms that allow practical inference specifically with the Dirichlet process mixture (Escobar, 1994) and more generally in other clustering models (Pitman and Yor, 1997;Lee et al., 2013). Similarly, Kingman (1978) showed that we can assume that a latent set of probabilities (known as the "Kingman paintbox") exists over the clusters, and the cluster assignments are chosen i.i.d. from these probabilities. Many real clustering problems, such as disambiguating census data or clustering academic papers by originating lab, exhibit sublinear growth in cluster size as a function of the size of the total data; the Kingman paintbox representation has been used to show that exchangeable clustering models enforce linear growth in cluster size and thus are misspecified for these examples (Wallach et al., 2010;Broderick and Steorts, 2014;Miller et al., 2016). As another example, a natural assumption for graph (or network) data is that the distribution over vertices and edges is invariant to reordering of the vertices. In this case, a consequence of the Aldous-Hoover representation theorem (Aldous, 1981;Hoover, 1979) is that any sequence of such vertex-exchangeable graphs must have a number of edges that grows quadratically in the number of vertices (Orbanz and Roy, 2015). By contrast, in many real network-data applications we observe sub-quadratic growth (Mitzenmacher, 2003;Newman, 2005;Clauset et al., 2009). Here, the representation theorem can be used to evaluate all vertex-exchangeable models (a list of which may be found in Lloyd et al. (2012)) at once, demonstrating their misspecification.
In this work, we develop and characterize a generalization of clustering models that we call trait allocation models. In clustering, each data point must belong to exactly one group. It is often the case, however, that data naturally belongs to more than one group: a document might exhibit multiple words in a number of topics, a participant in a social network might send multiple messages to each of her friend groups, and a DNA sequence might exhibit different numbers of genes from different ancestral populations. Feature allocation models (Griffiths and Ghahramani, 2005;Broderick et al., 2013) are one way to capture this broader structure, wherein each data point may belong to multiple groups or no groups. However, they require that the degree of belonging in each group is binary; each data point is either a member of a group, or it is not. Trait allocation models extend feature allocation models. Each data point may belong to each group-which we call a trait-by any non-negative integer amount. Authors have recently proposed a number of models for trait allocations (Titsias, 2008;Zhou et al., 2012;Zhou, 2014;James, 2014;Broderick et al., 2015a;Roychowdhury and Kulis, 2015). But as of yet, there is no characterization either of the class of exchangeable trait allocation models or of classes of exchangeable trait allocation models that are particularly amenable to inference. The consequences of the exchangeability assumption in this setting are still unknown. In this work, we provide characterizations of both the full class of exchangeable trait allocations and those with EPPF-like probability distributions. This not only unifies and generalizes past work on partitions and feature allocations, but provides a natural avenue for the formulation and study of other exchangeable combinatorial structures.
We begin by formally defining trait allocations in Section 2. Representation theorems for infinitely exchangeable structures invariably require infinite sequences: sequences of random variables in the case of de Finetti's theorem; sequences of clustered indices, or partitions, in the case of the Kingman paintbox; or sequences of graphs in the case of the Aldous-Hoover theorem. Therefore, in Section 2, we also develop the machinery to discuss random sequences of trait allocations. In Section 3, we develop an organic ordering on the initially unordered traits so as to make sense of limiting values of the sequence of trait allocations. This setup allows us to establish a trait allocation paintbox representation in Section 4 that is analogous to the Kingman paintbox representation for clustering. Our new representation handles dust, the case where some traits may appear for just a single data point. This work therefore also extends previous work on the special case of exchangeable feature allocations to the fully general case, whereas previously it was restricted to the dustless case (Broderick et al., 2013). In Section 5, we develop an EPPF-like function to describe distributions over exchangeable trait allocations and characterize the class of trait allocations to which it applies. We call these exchangeable trait probability functions (ETPFs). Just as in the partition and feature allocation cases, the class of random trait allocations with probability functions represents a class of trait allocations that are particularly amenable to approximate posterior inference in practice-and therefore of particularly pressing interest to characterize. Moreover, though EPPF-like functions exist for feature allocations (Broderick et al., 2013), there is not yet an established connection between EPPFs and these functions; we are able to form this connection with a special form of the ETPF. In Section 5, we introduce a new concept we call constrained ETPFs, and in Sections 5 and 6, we show how constrained ETPFs unite all of the other exchangeable probability functions within a single framework. In Section 6, we apply both our trait allocation paintbox as well as the constrained ETPFs to characterize edge-exchangeable graphs, an alternative and recently developed form of exchangeability for graph models that allows sparse sequences of graphs (Cai et al., 2016;Crane and Dempsey, 2016a). A similar paintbox representation generalizing partitions and edge-exchangeable (hyper)graphs has been studied in concurrent work on relational exchangeability (Crane and Dempsey, 2016b)-but here we additionally explore the existence of a trait frequency model, the existence of a constrained trait frequency model and its connection to clustering and feature allocations, and the various connections between frequency models and probability functions. We let [N ] := {1, 2, . . . , N } for any N ∈ N. Sequences are denoted with parentheses, with indices suppressed only if they are all in N and clear from context. For example, (x k ) is the sequence (x k ) k∈N and (x kj ) is the sequence (x kj ) k,j∈N , while (x kj ) ∞ j=0 is the sequence x k1 , x k2 , . . . with k fixed. The notation A ⊂ B means A is a (not necessarily proper) subset of B. The indicator function is denoted 1 (. . . ); for example, 1 (x ∈ A) is 1 if x ∈ A, and 0 otherwise. For any multiset x of elements in a set X (we say x is a multisubset of X ), we denote x(y) to be the multiplicity of y in x for each y ∈ X . Two multisubsets x, x of X are said to be equal, denoted x = x , if the multiplicity of all elements y ∈ X are equal in both x and x , i.e. ∀y ∈ X , x(y) = x (y). For any finite or infinite sequence, we use subscript k to denote the k th element in the sequence. For sequences of (multi)sets, if k is beyond the end of the sequence, the subscript k operation returns the empty set. Equality in distribution is denoted d =, and convergence almost surely/in probability/in distribution is denoted a.s. → / p → / d →. We often use cycle notation for permutations (see Dummit and Foote (2004, p. 29)): for example, π = (12)(34) is the permutation π with π(1) = 2, π(2) = 1, π(3) = 4, π(4) = 3, and π(k) = k for k > 4. We use the notation X ∼ (θ j ) ∞ j=0 to denote sampling X from the categorical distribution on {0} ∪ N with probabilities P (X = j) = θ j for j ∈ {0} ∪ N. We use λ to denote the Lebesgue measure on (0, 1). The symbol × N S N for a sequence of sets (S N ) denotes S 1 × S 2 × . . . , their infinite product space.

Trait allocations
We begin by formalizing the concepts of a trait and trait allocation. We assume that our sequence of data points is indexed by N. As a running example for intuition, consider the case where each data point is a document, and each trait is a topic. Each document may have multiple words that belong to each topic. The degree of membership of the document in the topic is the number of words in that topic. We wish to capture the assignment of data points to the traits they express but in a way that does not depend on the type of data at hand. Therefore, we focus on the indices to the data points. This leads to the definition of traits as multisets of the data indices, i.e., the natural numbers. E.g., τ = {1, 1, 3} is a trait in which the datum at index 1 has multiplicity 2, and the datum at index 3 has unit multiplicity. In our running example, this trait might represent the topic about sports; the first document has two sports words, and the third document has one sports word.
Definition 2.1. A trait is a finite, nonempty multisubset of N.
Let the set of all traits be denoted T. It will be useful throughout to be able to pick out the multiplicity of a particular index from a trait; for this purpose, we let τ (n) denote the multiplicity of n ∈ N in τ .
A single trait is not sufficient to capture the combinatorial structure underlying the first N ∈ N data in the sequence: each datum may be a member of multiple traits (with varying degrees of membership). The traits have no inherent order just as the topics "sports", "arts", and "science" have no inherent order. And each document may contain words from multiple topics. Building from Definition 2.1 and motivated by these desiderata, we define a finite trait allocation as a finite multiset of traits. For example, t 4 = {{1}, {3, 4}, {3, 3}, {3, 3}, {1, 1, 4}} represents a collection of traits expressed by the first 4 data points in a sequence. In this case, index 1 is a member of two traits, index 2 is a member of none, and so on. Throughout, we assume that each datum at index n ∈ N, n ≤ N belongs to only finitely many latent traits. Further, for a data set of size N , any index n > N should not belong to any trait; the allocation t N represents traits expressed by only the first N data. These statements are formalized in Definition 2.2. For a trait allocation t N , t N (τ ) denotes the multiplicity of τ ∈ T in t N .
Note that T is countable by Proposition A.1, and so the infinite sums in Definition 2.2 are well-defined. Let T N be the set of trait allocations of [N ], and define T to be the set of all finite trait allocations, T := N T N . The set T is also countable, by Proposition A.2.
Two notable special cases of finite trait allocations that have featured prominently in past work are feature allocations (Broderick et al., 2013) and partitions (Kingman, 1978;Pitman, 1995 Up until this point, we have dealt solely with finite sequences of N data. However, in many data analysis problems, it is more natural (or at least an acceptable simplifying approximation) to treat the observed sequence of N data as the beginning of an infinite sequence. As each datum arrives, it adds its own index to the traits it expresses (and in the process introduces any previously uninstantiated traits). For example, if after 3 observations we have t 3 = {{1}, {1, 2}}, then observing the next might yield t 4 = {{1}, {1, 2, 4, 4}, {4, 4}}. Note that when an index is introduced, none of the earlier indices' memberships to traits are modified; the sequence of finite trait allocations is consistent. To make this rigorous, we define the restriction of a trait (respectively, trait allocation), which allows us to relate two trait allocations t N , t M ∈ T of differing N and M . The restriction operator | M -provided by Definition 2.5 and acting on either traits or finite trait allocations-removes all indices greater than M from all traits, and does not modify the multiplicity of indices less than or equal to M . If any trait becomes empty in this process, it is removed from the allocation.
and is overloaded for finite trait allocations | M : T → T M as Definition 2.6. A pair of trait allocations t M of [M ] and t N of Summing over all ω ∈ T in Definition 2.5 is not an issue, since T is countable by Proposition A.1 and only finitely many terms in the sum are nonzero by Definition 2.2. Note that consistency is not injective. E.g., we may have {{1, . But consistency is transitive in the following sense: if t N is consistent with t M , and t M is consistent with t K , with N ≥ M ≥ K, then t N is consistent with t K . This transitivity is a direct consequence of the definition of restriction: The consistency of two finite trait allocations allows us to define the notion of a consistent sequence of trait allocations. Such a sequence can be thought of as generated by the sequential process of data arriving; each data point adds its index to its assigned traits without modifying any previous index. For example, is a valid beginning to an infinite sequence of trait allocations. The first datum expresses two traits with multiplicity 1, and the second and third each express a single one of those traits with multiplicity 1. As a counterexample, ({{1, 1}}, {{1, 1}}, {{1, 3}}, . . . ) is not a valid trait allocation sequence, as the third trait allocation is not consistent with either the first or second. This sequence does not correspond to building up the traits expressed by data in a sequence; when the third datum is observed, the traits expressed by the first are modified. (2.5) Denote the set of all infinite trait allocations T ∞ ⊂ × N T N . The earlier notion of transitivity combined with Definition 2.7 implies that all pairs t N , t M of elements in the sequence are consistent, as desired. Restriction acts on infinite trait allocations in a straightforward way: given t ∞ = (t N ), restriction to M ∈ N is equivalent to the corresponding projection, t ∞ | M := t M .
Recall that the motivation for developing infinite trait allocations is to capture the latent combinatorial structure underlying a sequence of observed data. Since this sequence is random, its underlying structure may also be, and thus the next task is to develop a corresponding notion of a random infinite trait allocation. However, this is not a trivial exercise; T ∞ is infinite-dimensional and uncountable by Proposition A.3, and defining a probability space directly on it is somewhat difficult due to its consistency structure. Instead, we use the standard technique of defining a consistent sequence of probability spaces on its finite-dimensional projections, and use an extension theorem to give us the desired construction. Let T N , 2 T N , ν N be a probability space for each N ∈ N. The power set σ-algebra is a natural consequence of the requirement that singletons in T N are measurable, since T N is countable by Proposition A.2 and σ-algebras are closed under countable union. We constrain sequential probability measures ν N , ν N +1 to be consistent for all N ∈ N: the random trait allocation of [N ] has the same distribution as the trait allocation of [N + 1] restricted to N , i.e., (2.6) Given such a sequence of probability spaces, Proposition 2.8 guarantees the existence of a unique random infinite trait allocation T ∞ ∈ T ∞ that has finite marginal distributions equal to the ν N induced by restriction (a measurable mapping by Proposition A.10). The proof of this result and accompanying technical details-such as the precise probability space on which T ∞ resides-are provided in Appendix A.2, but they are not required for understanding the main text.
Proposition 2.8. Given a sequence of consistent probability measures ν N on T N , N ∈ N, there exists a unique random infinite trait allocation T ∞ ∈ T ∞ such that The properties of the random infinite trait allocation T ∞ are intimately related to those of the observed sequence of data it represents. In many applications, the data sequence has the property that its distribution is invariant to finite permutation of its elements; in some sense, the order in which the data sequence is observed is immaterial. We expect the random infinite trait allocation T ∞ associated with such an exchangeable sequence (Aldous, 1985) to inherit a similar property. As a simple illustration of the correct extension of permutation to infinite trait allocations, suppose we observe the sequence of data ( , 3}}, and so on. If we swap x 1 and x 2 in the data sequence-resulting in the new sequence (x 2 , x 1 , x 3 , . . . )-the traits expressed by x 2 become those containing index 1, the traits for x 1 become those containing index 2, and the rest are unchanged. Therefore, the correct permuted infinite trait allocation is , 3}}, and so on. Note that T 1 (resp. T 2 ) is equal to the restriction to 1 (resp. 2) of T 2 with permuted indices, while T N for N ≥ 3 is simply T N with its indices permuted. This demonstrates a crucial point -if the permutation affects only indices up to M ∈ N (there is always such an M for finite permutations), we can arrive at the sequence of trait allocations for the permuted data sequence in two steps. First, we permute the indices in T M and then restrict to 1, 2, . . . , M to get the first M permuted finite trait allocations. Then we simply permute the indices in T N for each N > M .
To make this observation precise, we let π be a finite permutation of the natural numbers, i.e., π : N → N, π is a bijection, ∃M ∈ N : ∀m > M, π(m) = m, (2.8) and overload its notation to operate on traits and (in)finite trait allocations in Definition 2.9. Note that if π is a finite permutation, its inverse π −1 is also a finite permutation with the same value of M ∈ N for which m > M implies π(m) = m. Intuitively, π operates on traits and finite trait allocations by permuting their indices. For example, if π has the cycle (123) and fixes all indices greater than 3, then π{1, 1, 2, 4} = {2, 2, 3, 4}. As discussed above, the definition for infinite trait allocations ensures that the permuted infinite trait allocation is a consistent sequence that corresponds to rearranging the observed data sequence with the same permutation. Definition 2.9 provides the necessary framework for studying exchangeable infinite trait allocations, defined as random infinite trait allocations whose distributions are invariant to finite permutation.
Definition 2.10. An exchangeable infinite trait allocation, T ∞ , is a random infinite trait allocation such that for any finite permutation π : N → N, (2.12) Proposition A.11 shows that π : T ∞ → T ∞ is measurable and thus πT ∞ is a well-defined random element of T ∞ . Note that if the random infinite trait allocation is a random infinite partition/feature allocation almost surely, the notion of exchangeability in Definition 2.10 reduces to earlier notions of exchangeability for random infinite partition/feature allocations (Kingman, 1978;Aldous, 1985;Broderick et al., 2013). Exchangeability also has an analogous definition for random finite trait allocations, though this is of less interest in the present work.
As a concrete example-and also to show the existence of exchangeable trait allocations that are neither feature allocations nor partitions-consider the construction in Fig. 1. The blue and red segments (intervals in (0, 1)) represent the probability of any index joining one of two traits (blue or red), with the multiplicities labeled beside the segment. The sequence (V N ) ∼ Unif(0, 1)-shown for indices N ∈ [4]-determines the membership of each index in the two traits. Here, V 1 falls only in the blue segment labeled "1", indicating that index 1 is a member of one trait with multiplicity 1. V 4 falls in both blue and red segments with multiplicity 2, indicating that index 4 is in the same trait as 1 with multiplicity 2, and in another trait also with multiplicity 2. Similar statements can be made about V 2 and V 3 , yielding the trait allocation T 4 = {{1, 3, 3, 4, 4}, {4, 4, 3, 3, 3}}. Since (V N ) are i.i.d. random variables, the distribution of T 4 is invariant to finite permutation of their order, and hence to permutation of its indices in [4]. Therefore, upon continuing this construction with V 5 , V 6 , . . . , this demonstrates that T ∞ is an exchangeable infinite trait allocation.

Ordered trait allocations and order-of-appearance
We impose no inherent ordering on the traits in a finite trait allocation via the use of (multi)sets; the allocations {{1}, {3, 3}} and {{3, 3}, {1}} are identical. This correctly captures our lack of a preferred trait order in many data analysis problems. However, ordered trait allocations are nonetheless often useful from standpoints both practical-such as when we need to store a finite trait allocation in an array in physical memory-and theoretical-such as in developing the characterization of all exchangeable infinite trait allocations in Section 4.
A primary concern in the development of an ordering scheme is consistency. Intuitively, as we observe more data in the sequence, we want the sequence of finite ordered trait allocations to "grow" but not be "shuffled"; in other words, if two finite trait allocations are consistent, the traits in their ordered counterparts at the same index should each be consistent. For partitions, this task is straightforward: each trait simply receives as a label its lowest index (Aldous, 1985), and the labels are used to order the traits. This is known as the order-of-appearance labeling, as traits are labeled in the order in which they are instantiated by data in the sequence. For example, in the partition t 4 = {{1, 3}, {2, 4}} of [4], {1, 3} would receive label 1 and {2, 4} would receive label 2, so {1, 3} would be before {2, 4} in the order. Restricting these traits will never change their order-for instance, {1, 3}| 2 = {1} and {2, 4}| 2 = {2}, which still each receive label 1 and 2, respectively. If a restriction leaves a trait empty, it is simply removed and does not interfere with any traits of a lower label. The restricted ordering is thus intuitively consistent, as desired. Finite feature allocations, on the other hand, are not as easy to label, since multiple traits may have a common lowest index; e.g., both traits in the feature allocation t 4 = {{1, 3}, {1, 2, 4}} of [4] have the lowest index of 1, and thus requires some method to break ties. In past work (Broderick et al., 2013), this was handled with auxiliary randomness: any tie was broken by associating i.i.d. uniform random labels to features with the same lowest index, and comparing those labels. The main disadvantage of this approach is that there is no way to uniquely identify the assignment of uniform random tie-breakers to traits, so orderings of two consistent finite trait allocations t N , t M are generally not consistent. Trait allocations, as a generalization of feature allocations, suffer similar difficulties.
This section makes the notion of consistency precise, and provides the first consistent ordering scheme for trait allocations (and therefore for feature allocations) that requires no auxiliary randomness. We begin by defining ordered finite trait allocations.
Definition 3.1. An ordered trait allocation N of [N ] is a sequence N = (τ k ) K k=1 , K < ∞, of traits τ k ∈ T such that no trait contains an index n > N .
Let L N be the set of ordered trait allocations of [N ], and let L = N L N be the set of all ordered finite trait allocations. We use subscript k on an ordered trait allocation to denote the k th trait in its ordering, or the empty set if k is greater than its length. In the notation of Definition 3.1, N k := τ k for k ≤ K, and N k := ∅ for k > K.
As in the case of unordered trait allocations, the notion of consistency is intimately tied to that of restriction. We again require that restriction to M ∈ N removes all indices m > M , and removes all traits rendered empty by that process. However, we also require that the order of the remaining traits is preserved: for example, if where the filter function removes any empty sets from a sequence while preserving the order of the nonempty sets.
In the example above, the basic restriction of 3 to 1 would yield ( Given these definitions, we are now ready to make the earlier intuitive notion of a consistent trait ordering scheme precise. Definition 3.3 states that a function [ · ] : T → L must satisfy two conditions to be a valid trait ordering. The first condition enforces that a trait ordering does not add, remove, or modify the traits in the finite trait allocation t N ; this implies that trait orderings are injective. The second condition enforces that trait orderings commute with restriction; in other words, applying a trait ordering to a consistent sequence of finite trait allocations yields a consistent sequence of ordered finite trait allocations. For example, suppose (1) The ordering is preservative: We now develop a particular trait ordering for use in the remainder of the paper. Rather than assign labels directly to traits as has been done for partitions and feature allocations (Aldous, 1985;Broderick et al., 2013), we develop a relation < between traits, show that it is a strict total order, use that to uniquely place the traits in a finite trait allocation in order, and show that this process yields a trait ordering in the sense of Definition 3.3. The proposed relation < between two traits considers the lowest index with differing multiplicity, and orders the one with higher multiplicity first. For example, {1, 1, 4} < {1, 2} since 1 is the lowest index with differing multiplicity, and the multiplicity of 1 is greater in the first trait than in the second. Similarly, {2, 3} < {2, 4} since 3 has greater multiplicity in the first trait than the second, and both 1 and 2 have the same multiplicity in both traits.
Proposition 3.5. The relation < is a strict total order on T.
To show that the relation < is a strict total order, it is sufficient to show that it is both trichotomous and transitive. The proof of Proposition 3.5 is thus the combination of Lemmas 3.6 and 3.7 below, the proofs of which may be found in Appendix B. Note that trichotomy is a stronger property than the combination of irreflexivity and antisymmetry, leading to a strict total order rather than a strict order.
Based on this strict total order, we define a (nonstrict) total order ≤: for two traits τ, ω ∈ T, we have τ ≤ ω ⇐⇒ (τ < ω or τ = ω). Therefore, for any trait : T → L as the mapping from t N to this unique ordered finite trait allocation. As is the case for all ordered trait allocations, [ · ] k denotes the k th element in the ordering, or the empty set if k exceeds its length. The mapping [ · ] is a trait ordering, as shown by Theorem 3.9. The proof of the technical result Lemma 3.8 is provided in Appendix B.
Theorem 3.9. The mapping [ · ] based on the order ≤ is a trait ordering. Proof.
[ · ] is trivially preservative: since the restriction operation ·| M acts identically to individual traits in both ordered and unordered finite trait allocations, and empty traits are removed, both [ t N | M ] and [t N ]| M have the same multiset of traits (albeit in a potentially different order). The first trait τ of [t N ] satisfies τ ≤ ω for any ω ∈ T such that t N (ω) > 0, by definition of [ · ]. By Lemma 3.8, this implies that τ | M ≤ ω| M for all ω ∈ t N . Therefore, the first trait in [t N ]| M is the same as the first trait in [ t N | M ]. Applying this logic recursively to t N with τ removed, the result follows.
The strengths of this approach are threefold. First, it is a generalization of previous techniques for partitions (Aldous, 1985), as applying [ · ] to a partition exactly reproduces this previous ordering. Thus, we adopt the same name and refer to the mapping [ · ] as the order-of-appearance. Second, it does not rely on any auxiliary randomness, as in earlier approaches for feature allocations (Broderick et al., 2013), circumventing questions about whether labelings with auxiliary randomness can be made consistent. Finally, and perhaps most importantly, it forms the basis of an exchangeable sequence of random quantities pertaining to each index n ∈ N, allowing the characterization of all exchangeable infinite trait allocations in Section 4.

The trait allocation paintbox
We now derive a de Finetti-style representation theorem for exchangeable infinite trait allocations (Definition 2.10). We will see that the distribution of any exchangeable infinite trait allocation can be expressed with a paintbox representation, which extends previous results for partitions and feature allocations (Kingman, 1978;Broderick et al., 2013). It turns out that all exchangeable infinite trait allocations have essentially the same form as in the example illustration of Fig. 1, with some additional nuance.
The high-level proof sketch is as follows. We first use the order-of-appearance from Section 3 to associate an i.i.d. sequence of uniform random labels to the traits in the sequence, in the style of Aldous (1985). We collect the multiset of labels for each index into a sequence, called the label multiset sequence; the consistency of the ordering from Theorem 3.9 implies that this construction is well-defined. We show that the label multiset sequence itself is exchangeable in the traditional sequential sense in Lemma 4.3. And we use de Finetti's theorem (Kallenberg, 1997, Theorem 9.16) to uncover its construction from conditionally i.i.d. random quantities. Finally, we relate this construction back to the original set of exchangeable infinite trait allocations to arrive at its paintbox representation in Theorem 4.5. The measure-theoretic details of this construction may be found in Appendix A.4.
As an example construction of the label multiset sequence, suppose we have ). The first trait in the ordering {1, 2, 2} receives the first label in the sequence, φ 1 , and the second trait {2, 4} receives the second label, φ 2 . For each index n ∈ [4], we now collect the multiset of labels to its assigned traits with the same multiplicity. Index 1 is a member of only the first trait with multiplicity 1, so its label multiset is {φ 1 }. Index 2 is a member of the first trait with multiplicity 2 and the second with multiplicity 1, so its label multiset is {φ 1 , φ 1 , φ 2 }. Similarly, for index 3 it is ∅, and for index 4 it is {φ 2 }. Putting the multisets in order (for index 1, then 2, 3, etc.), the label multiset sequence is therefore ({φ 1 }, {φ 1 , φ 1 , φ 2 }, ∅, {φ 2 }, . . . ), where the ellipsis represents the continuation beyond T 4 to T 5 , T 6 , and so on.
While the φ k may be seen as a mathematical convenience for the proof, an alternative interpretation is that they correspond to trait-specific parameters in a broader generative model. Indeed, our proof would hold for φ k from any nonatomic distribution, not just the uniform. In the document modeling example, each φ k could correspond to a distribution over English words; φ k with high mass on "basketball", "luge", and "curling" could represent a "sports" topic. For this reason, we call the φ k labels.
The general construction is as follows. Let the set of (possibly empty) finite multisets of (0, 1) be denoted Y, with the usual notation y(φ) for the multiplicity of φ ∈ (0, 1) in y ∈ Y. Further, let the set of sequences of distinct elements in (0, 1) be Definition 4.1. Given t ∞ ∈ T ∞ and φ ∞ ∈ F ∞ , the corresponding label multiset sequence y ∞ := (y N ) of elements y N ∈ Y is defined by (4.1) In other words, y N is constructed by selecting the N th component of t ∞ , ordering its traits τ 1 , . . . , τ K , and then adding τ k (N ) copies of φ k to y N for each k ∈ [K]. Again, the φ k can thus be thought of as labels for the traits, and y N is the multiset of labels representing the assignment of the N th datum to its traits (hence the name label multiset sequence). This construction of y ∞ ensures that, intuitively, the "same label applies to the same trait" as N increases: the consistency of the ordering [ · ] introduced in Section 3 immediately implies that (4.2) Definition 4.1 implicitly creates a mapping, which we denote ϕ : ( 4.3) The first term in the product-the indicator function-ensures that t N (τ ) is only nonzero for traits τ ∈ T that do not contain any index n > N . The second term counts the number of points φ ∈ (0, 1) for which the multiplicities in τ match those expressed by the label multiset sequence for n ≤ N . Thus, there exists another mappingφ : Whileφ recovers the infinite trait allocation t ∞ , neither ϕ norφ are injective. As a simple example of why this is true, consider the sequence showing thatφ is not injective. Despite this, the existence of the partial inverseφ is a crucial element in the characterization of all distributions on exchangeable infinite trait allocations in Theorem 4.5. As both ϕ andφ are continuous and hence measurable by Proposition A.13, we can construct a random label multiset sequence Y ∞ ∈ Y ∞ by letting T ∞ be a random infinite trait allocation, Φ ∞ be a sequence of i.i.d. Unif(0, 1) random variables (hence Φ ∞ ∈ F ∞ almost surely), and Y ∞ = ϕ(T ∞ , Φ ∞ ). In this setting, the existence of the partial inverseφ guarantees that-given the fixed choice of ∼ Unif(0, 1)-the distributions on T ∞ are in bijection with those on Y ∞ , allowing the characterization of those on Y ∞ (a much simpler space) instead. As the primary focus of this work is exchangeable infinite trait allocations, we therefore must deduce the particular family of distributions on Y ∞ that are in bijection with the exchangeable infinite trait allocations on T ∞ . Lemma 4.3 shows that this family is, as one might suspect, the exchangeable (in the classical, sequential sense) label multiset sequences. The main result required for its proof is Lemma 4.2, which states that permutation of t ∞ essentially results in the same permutation of the components of y ∞ , modulo reordering the labels in φ ∞ . In other words, permuting the data sequence represented by t ∞ leads to the same permutation of y ∞ . As an example, consider a setting in which . For a finite permutation π, we define πy ∞ := (y π −1 (N ) ) and πφ ∞ := (φ π −1 (k) ), i.e., permutations act on sequencesincluding both Y ∞ and F ∞ -by reordering elements. If we permute the observed data sequence that t 4 represents by π = (12)(34), this leads to the permutation of the indices in t 4 also by π, resulting in πt This y ∞ is the reordering of y ∞ by π, the same permutation that was used to reorder the observed data; the main result of Lemma 4.2 is that a π always exists to reorder φ ∞ such that this is the case. The proof of Lemma 4.2 may be found in Appendix B.
Lemma 4.2. For each finite permutation π and infinite trait allocation t ∞ , there exists a finite permutation π such that (4.5) Proof. If Y ∞ is exchangeable, then using the definition ofφ in Eq. (4.4), If T ∞ is exchangeable, then by Lemma 4.2 there exists a finite permutation π T∞ such that But since Φ ∞ is a sequence of i.i.d. random variables and hence also exchangeable, We are now ready to characterize all distributions on exchangeable infinite trait allocations in Theorem 4.5 using the paintbox representation provided by Definition 4.4 and depicted in Fig. 2. At a high level, this is a constructive representation involving three steps. First, we generate two (random) sequences of subsets of (0, 1), represented by the colored segments labeled (C kj ) and (C jt ) in Fig. 2, and associate the first (C kj ) with i.i.d. labels from Unif(0, 1), represented by the colors blue, red, and green. Next, we generate for each N ∈ N an auxiliary variable V N ∼ Unif(0, 1). The subsets (C kj ) containing V N determine which traits index N is a member of and its multiplicity: if V N falls in C kj , it is a member of the k th ordered trait with multiplicity j. For example, here V 1 falls in C 11 , C 22 , and C 31 , implying it is a member of two traits (blue and green) with multiplicity 1, and one trait (red) with multiplicity 2. The subsets (C jt ) containing V N determine the membership of index N in dust traits, i.e., traits that are unique to that index. If V N falls in C jt , there are an additional t traits containing only index N with multiplicity j. For example, here V 1 falls in C 13 and C 21 , meaning there are an additional three unique traits containing index 1 with multiplicity 1, and an additional trait containing index 1 with multiplicity 2. Finally, we collect these results together to form an infinite trait allocation T ∞ . Fig. 2 shows the construction for the first 5 indices: the first step involving the (C kj ) yields the allocation {{1, 2, 2, 3, 4}, {1, 1, 2}, {1, 3, 3, 3}}, the second step involving the ( The trait allocation T 5 is simply the concatenation of these two multisets. The two sequences (C kj ) and (C jt ) allow the representation of random infinite trait allocations containing both traits that are shared by multiple data-known as regular traits-and those that are uniquely expressed by a single datum-known as dust traits. For example, in a sequence of documents generated by latent topics, one author may write a single document with a number of words that are never If V i falls in C kj , then index i is a member of regular trait k with multiplicity j. If V i falls in C jt , then index i is a member of t unique dust traits of multiplicity j.
again used by other authors (e.g. Jabberwocky, by Lewis Carroll); in the present context, these words would be said to arise from a dust topic. Meanwhile, common collections of words expressed by many documents will group together to form regular topics. Each element Y N of the label multiset sequence Y ∞ in general has some contribution R N from the regular traits and some contribution D N from the dust traits (either or both may be empty). We say a random infinite trait allocation is regular if it has no dust traits with probability 1, and irregular otherwise.
The particular details of the paintbox representation in Definition 4.4 are as follows. The first bullet point guarantees that each datum selects a single multiplicity in each regular trait, and has a single number of dust traits at each multiplicity. The second guarantees that each datum selects only a finite number of traits overall. Eqs. (4.9) and (4.10) construct the regular and dust label multiset for each datum, and then these are collected into the label multiset sequence and finally transformed into T ∞ usingφ. Note that the way the regular subsets (C kj ) and dust subsets (C jt ) are used necessarily differs in the construction of Definition 4.4. When the auxiliary variable V N ∼ Unif(0, 1) for the N th datum is generated, V N ∈ C kj implies that index N is a member of the regular trait labeled φ k with multiplicity j, as shown in Eq. (4.9). In contrast, V N ∈ C jt implies that index N is a member of t of its own dust traits of multiplicity j, as shown in Eq. (4.10). The dust traits for each index N are kept separate by having a separate sequence of labels (φ N j ) j, ∈N for that index.
Definition 4.4. A random infinite trait allocation T ∞ has a paintbox representation if there exist two random sequences (C kj ), (C jt ) of Lebesgue-measurable open subsets of (0, 1) such that • any V ∈ (0, 1) is an element of finitely many sets C kj and C jt , where T ∞ has distribution induced by the following construction: (3) assemble the label multiset sequence Y ∞ = (Y N ) and set T ∞ =φ(Y ∞ ).
Theorem 4.5 is the main result of this section, which shows that exchangeable infinite trait allocations are precisely those which have a paintbox representation. To obtain a full characterization of exchangeable infinite trait allocations, we must handle both those which are regular and irregular. We will attack the problem by characterizing the distribution of the exchangeable label multiset sequence Y ∞ with de Finetti's theorem; however, the existence of a dust component D N of Y N leads to a diffuse component in the directing measure which is difficult to study. Instead, we introduce a technique whereby we map the dust to a discrete space in the proof of Theorem 4.5, which renders the mixing measure discrete and easier to characterize.
Theorem 4.5. T ∞ is exchangeable iff it has a paintbox representation.
Proof. If T ∞ has a paintbox representation, then it is exchangeable by the fact that the V N are i.i.d. uniform random variables. In the other direction, let T ∞ be an exchangeable infinite trait allocation, and let Φ ∞ be a sequence of i.i.d. Unif(0, 1) random variables. Then there is a random label multiset sequence Y ∞ = ϕ(T ∞ , Φ ∞ ) which is exchangeable by Lemma 4.3. Since we can recover T ∞ from Y ∞ via T ∞ =φ(Y ∞ ), it suffices to characterize the distribution of Y ∞ and then reconstruct T ∞ .
We split Y N into its regular R N and dust D N components-that represent, respectively, traits that are expressed by multiple data points and those that are expressed only by data point N -defined for φ ∈ (0, 1) by (4.13) Next, we collect the multiplicities in D N into a multiset of natural numbers, (4.14) Note that we can recover the distribution of Y ∞ from that of (R N , satisfies some other technical conditions (see Appendix A.4 and Proposition A.12 in particular), de Finetti's theorem (Kallenberg, 1997, Theorem 9.16) states that there   Fig. 3a shows the partition paintbox (with regular/dust components shown together for clarity); note that the subsets form a partition of (0, 1), and the only possible multiplicity is 1. Fig. 3b shows the feature paintbox; note that the only allowed multiplicity is 1, but each index may be a member of multiple regular and dust features. We condition on known random probability measure α = µ. Since each φ ∈ R N appears in some R M for M = N by definition in Eq. (4.13), the probability of φ appearing in R for (R, D) ∼ µ must be strictly greater than 0. Denoting the countable collection of labels {φ k } across all (R N ), µ must be supported on the product (finite multisets of {φ k })×(finite multisets of N). Since this itself is a countable (discrete) set, the probability of R N , D N under µ must be strictly greater than 0. Therefore, µ can be written as µ = i θ i δ (ri,di) , where 1 ≥ θ 1 ≥ θ 2 ≥ · · · > 0, i θ i = 1, and the r i , d i are the unique pairs of R N , D N ordered to correspond to the θ i .
Define the interval I i = i−1 j=1 θ j , i j=1 θ j . Next, let C kj = {i : r i (φ k ) = j} and C kj = i∈C kj I i for k, j ≥ 1, and let C jt = {i : d i (j) = t} and C jt = i∈C jt I i for t, j ≥ 1. Note that C kj ∩ C k = ∅ if = j and C jt ∩ C j = ∅ if t = by construction. Furthermore, since µ is random, so is the countable collection of subsets C kj and C jt . Any φ ∈ (0, 1) is in finitely many C kj and finitely many C jt by construction. The C kj and C jt are Lebesgue-measurable and open since they are countable unions of open intervals. Constructing R N using the auxiliary variables V N selects R N , D N = r j , d j with probability θ j , as required.
The representation in Theorem 4.5 generalizes both the Kingman partition paintbox (Kingman, 1978) and the feature paintbox (Broderick et al., 2013), given by Corollaries 4.7 and 4.8 and shown in Figs. 3a and 3b. Further, Corollary 4.8 is the first feature paintbox representation that accounts for the possibility of dust features; previous results were limited to regular feature allocations (Broderick et al., 2013). It also makes the distinction between regular and irregular trait allocations straightforward, as shown by Corollary 4.6. Let λ denote the Lebesgue measure on (0, 1).
Corollary 4.6. If T ∞ is regular, C jt = ∅ for all j, t almost surely.
Corollary 4.8. A random infinite feature allocation T ∞ is exchangeable iff it has a paintbox representation such that ∀k, j, t ∈ N : j ≥ 2, C kj = C jt = ∅.
Theorem 4.5 enforces no conditions that rely on the order of the regular subsets in the paintbox, i.e., the order of the sequences (C kj ) j∈N across k ∈ N. Any reordering of the regular subsets in the paintbox for T ∞ yields another equally valid paintbox for T ∞ . However, it is often useful to ensure the paintbox elements are generated in a particular order-for example, it would make sense for the most popular or frequently expressed traits to be ordered before those that are less popular; and those with the same popularity should be ordered by how likely they are expressed with multiplicity 1, then 2, 3, etc. Fig. 4 depicts such an ordering: Fig. 4a shows the unordered regular paintbox from Fig. 2a with its subsets marked by their Lebesgue measure, and Fig. 4b shows the resulting ordered paintbox. Since the red trait has total measure 0.35 which is less than that of the blue and green traits, it is ordered last. The blue and green traits have the same total measure of 0.5, so they are ordered by the measure of their multiplicity 1 subsets; and since the blue trait's measure 0.40 exceeds that of the green trait at 0.20, the final size-ordering puts the traits in the order blue, green, then red.
The probability that a datum expresses the trait labeled φ k in the paintbox is j λ (C kj )-the 1 -norm of the sequence (λ(C kj )) j∈N -and the probability that a datum expresses it with multiplicity j is λ(C kj ). We thus require an ordering on the sequences of Lebesgue measures of (C kj ) j∈N across k ∈ N, which is provided by Definition 4.9 and Lemma 4.10; and given the ordering, we can actually strengthen the result of Theorem 4.5 to enforce the existence of such a size-ordered paintbox, as shown in Proposition 4.11 and depicted in Fig. 4. Definition 4.9. For two vectors a, b ∈ (0, 1) ∞ with finite 1 -norm, we say a ≺ b if a 1 > b 1 or if a 1 = b 1 and there exists j ∈ N such that a j > b j and ∀k < j, a k = b k . This induces a nonstrict relation a b ⇐⇒ a ≺ b or a = b.
Proof. The norm comparison creates a strict partial order trivially; this induces a total order on the equivalence classes of vectors with equal 1 norm. Within each equivalence class, the element-wise comparison creates a strict total order, which can be shown using similar techniques to the proof of Proposition 3.5. The overall relation is therefore a strict total order.
Proposition 4.11. T ∞ is exchangeable iff it has a size-ordered paintbox representation, satisfying (in addition to all the conditions in Theorem 4.5) k ≤ k ⇐⇒ (λ(C kj )) j∈N (λ(C k j )) j∈N .
(4.16)  Proof. If T ∞ has a size-ordered paintbox, then it has a paintbox, and thus is exchangeable by Theorem 4.5. Next assume T ∞ is exchangeable, and therefore has the paintbox (C kj ), (C jt ) from Theorem 4.5. Given any bijection π : N → N, we have that (C π(k)j ), (C jt ) is also a paintbox for T ∞ . Therefore, we seek such a bijection that size-orders the regular subsets in the standard paintbox. Define an equivalence relation ∼ on paintbox elements, where (C kj ) j∈N ∼ (C k j ) j∈N if (λ(C kj )) j∈N = (λ(C k j )) j∈N . The relation is a total order on the induced equivalence classes. Therefore, let K i be the set of indices k ∈ N such that (C kj ) j∈N is in the i th equivalence class under that order. Thus (K i ) is a partition of N, and |K i | < ∞ for all i ∈ N since otherwise the paintbox violates the constraint that each V ∈ (0, 1) must be in finitely many sets. Therefore, the mapping π : N → N defined by is a bijection such that (C π −1 (k)j ), (C jt ) is a size-ordered paintbox.

Frequency models and probability functions
The set of exchangeable infinite trait allocations encompasses a very expressive class of random infinite trait allocations: membership in different regular traits at varying multiplicities can be correlated, membership in dust traits can depend on membership in regular traits, etc. While interesting, this generality makes constructing models with efficient posterior inference procedures difficult. A simplifying assumption one can make is that given the measures of the trait paintbox subsets, the membership of an index in a particular trait is independent of its membership in other traits. This assumption is often acceptable in practice, and limits the exchangeable infinite trait allocations to a subset-which we refer to as frequency models-for which efficient inference is often possible. Frequency models, as used in the present context, generalize the notion of a feature frequency model (Broderick et al., 2013) for feature allocations. A pictorial representation of a frequency model is shown in Fig. 5.
At a high level, this constructive representation consists of three steps. First, we generate random sequences of nonnegative reals (θ kj ) and (θ j ) such that k,j θ kj < ∞, j θ j < ∞, and ∀k ∈ N, j θ kj ≤ 1. The quantity θ kj is the probability that an index joins regular trait k with multiplicity j, while θ j is the average number of dust traits of multiplicity j for each index. Next, each index N ∈ N independently samples its multiplicity ξ N k in regular trait k from the discrete distribution (θ kj ) ∞ j=0 , where θ k0 := 1 − j θ kj is the probability that the index is not a member of trait k. In Fig. 5, for example, ξ 11 = 1, ξ 12 = 2, and ξ 13 = 1, so index 1 selects the blue trait with multiplicity 1, the red trait with multiplicity 2, and the green trait with multiplicity 1. For each j ∈ N, each index N ∈ N is a member of an additional ξ N j indep ∼ Poiss(θ j ) dust traits of multiplicity j. In Fig. 5, for example, ξ 11 = 3, ξ 12 = 1, and ∀j > 2, ξ 1j = 0, so index 1 is a member of an additional three dust traits with multiplicity 1 and one dust trait with multiplicity 2. Finally, we collect these results together to form an infinite trait allocation T ∞ . Fig. 5 shows the construction for only the first five indices, resulting in the same trait allocation T 5 of [5] as discussed earlier in Section 4 and shown in Fig. 2.
Although considerably simpler than general exchangeable infinite trait allocations, we still cannot exactly represent or store the frequency model parameters (θ kj ), (θ j ) in practice, as they can have potentially infinitely many nonzero values. This means it is not possible to condition on their values in any exact posterior inference algorithm for T N given a sequence of N data points. We often do, however, have an exact formula for the marginal distribution of T N that we can use instead (Griffiths and Ghahramani, 2005;Thibaux and Jordan, 2007;James, 2014;Broderick et al., 2015b). But since that is indeed the case, it is more natural and easier in practice to place a simplifying assumption directly on the marginal distribution of T N itself, rather than on its paintbox representation.
It remains to decide which such assumption should be made. One choice is to enforce that the distribution of T N only depends on the multiplicities of indices in its traits (rather than the particular indices themselves), and the number of unique orderings of t N . Similar formulations of this assumption for other exchangeable combinatorial structures-collectively known as exchangeable probability functionshave been introduced in previous work (Pitman, 1995;Broderick et al., 2013), and have been used extensively in the design of posterior inference algorithms (Escobar,  (b) The dust frequency model. Figure 5. The trait frequency model. For each index i ∈ N, the variables (ξ ik ) and (ξ ij ) determine its membership in the regular and dust traits, respectively. Fig. 5a: for each k ∈ N, ξ ik is the multiplicity of index i in trait k, and is sampled independently across k from the discrete distribution (θ kj ) ∞ j=0 . Fig. 5b: for each j ∈ N, ξ ij is the number of index i dust traits of multiplicity j, and is sampled independently across j from Poiss(θ j ).
1994; Lee et al., 2013). We let κ(t N ) be the number of unique orderings of traits in t N , and use the multiplicity profile 1 of t N , given by Definition 5.2, to capture the multiplicities of indices in its traits. The multiplicity profile of a trait is defined to be the multiset of multiplicities of its elements, while the multiplicity profile of a finite trait allocation is the multiset of multiplicity profiles of its traits. As an example, the multiplicity profile of a trait {1, 3, 4, 4, 2, 2, 2, 2} is {1, 1, 2, 4}, since there are two elements of multiplicity 1, one element of multiplicity 2, and one of multiplicity 4 in the trait. If we are given the finite trait allocation {{1, 1, 2}, {2}, {3}, {3, 3, 3, 3, 1}}, then its multiplicity profile is {{1, 2}, {1}, {1}, {1, 4}}. Note that a multiplicity profile is itself a trait allocation, though not always of the same indices (here, the trait allocation is of [3], and its multiplicity profile is a trait allocation of [4]).
Definition 5.2. The multiplicity profile · : T → T of a trait τ ∈ T is defined as τ (n) := |{m ∈ N : τ (m) = n}| , (5.5) and is overloaded for finite trait allocations · : T → T as We also extend Definition 5.2 to ordered trait allocations N , where the multiplicity profile is simply the ordered multiplicity profiles of its traits, i.e. N is defined such that ∀k ∈ N, N k := N k . Note the similarity between Eq. (5.5) and the earlier notation introduced in Eq. (4.14) in the proof of the paintbox representation; Definition 5.2 provides the natural extension of that definition as applied to traits.
1 A very similar quantity is known in the population genetics literature as the site (or allele) frequency spectrum (Bustamante et al., 2001), though it is typically defined there as an ordered sequence or vector rather than as a multiset.
The precise simplifying assumption on the marginal distribution of T N that we employ in this work is provided in Definition 5.3, which generalizes past work on exchangeable probability functions (Pitman, 1995;Broderick et al., 2013).
Definition 5.3. A random infinite trait allocation T ∞ has an exchangeable trait probability function (ETPF) if there exists a function p : N × T → R + such that for all N ∈ N, (5.7) One of the primary goals of this section is to relate exchangeable infinite trait allocations with frequency models to those with ETPFs. The main result of this section, Theorem 5.4, shows that these two assumptions are actually equivalent: any random infinite trait allocation T ∞ that has a frequency model (including those with random (θ kj ), (θ j ) of arbitrary distribution) has an ETPF, and any random infinite trait allocation with an ETPF has a frequency model. Therefore, we are able to use the simple construction of frequency models in practice via their associated ETPFs.
Theorem 5.4. T ∞ has a frequency model iff it has an ETPF.
The key to the proof of Theorem 5.4 is the uniformly ordered infinite trait allocation, given in Definition 5.7: intuitively, if L ∞ ∈ L ∞ is the uniform ordering of a random infinite trait allocation T ∞ ∈ T ∞ , then for each N ∈ N, L N +1 is constructed by inserting the new traits in T N +1 relative to T N into uniformly random positions among the elements of L N . This guarantees that L N is marginally a uniform random permutation of [T N ] for each N ∈ N, and that L ∞ is a consistent sequence, i.e. L ∞ ∈ L ∞ . There are two advantages to analyzing L ∞ rather than T ∞ itself. First, the ordering removes the combinatorial difficulties associated with analyzing T ∞ . Second, the traits are independent of their ordering, thereby avoiding the statistical coupling of the ordering based solely on [ · ].
The precise definition of the uniform ordering L ∞ in Definition 5.7 is based on associating traits with uniformly distributed distinct values in (0, 1), φ ∞ ∈ F ∞ , and ordering the traits based on the order of those values. To do so, we require a definition of the finite permutation π n that rearranges the first n elements of a sequence φ ∞ ∈ F ∞ to be in order and leaves the rest unchanged, known as the n th order mapping π n of φ ∞ . For example, if φ ∞ = (0.4, 0.1, 0.3, 0.2, 0.5, . . . ), then π 3 is represented in cycle notation as (321), and π 3 φ ∞ = (0.1, 0.3, 0.4, 0.2, 0.5, . . . ). The precise formulation of this notion is provided by Definition 5.5.
(5.8) Definition 5.6 shows how to use the n th order mappings for a sequence φ ∞ ∈ F ∞ to order a trait allocation and develop a label multiset sequence from that ordering: we rearrange the order-of-appearance of T N using the K th N order mapping π K N where K N is the number of traits in T N , and then associate labels π K N φ ∞ with the traits to form the label multiset sequence.
Definition 5.6. Given the sequence φ ∞ ∈ F ∞ and its sequence of order mappings (π n ), the φ ∞ -ordering of t ∞ ∈ T ∞ is denoted ∞ = ( N ) ∈ L ∞ and defined by (5.9) where ρ N := π K N and K N = τ ∈T t N (τ ) is the number of traits in t N . The corresponding φ ∞ -ordered label multiset sequence y ∞ = (y N ) ∈ Y ∞ is defined by (5.10) If the sequence φ ∞ in Definition 5.6 is taken to be randomly generated i.i.d. from Unif(0, 1), then the resulting φ ∞ -ordered trait allocation is said to be uniformly ordered.
Definition 5.7. The uniform ordering L ∞ of an infinite trait allocation T ∞ is the Φ ∞ -ordering of T ∞ , where Φ ∞ is a sequence of i.i.d. Unif(0, 1) random variables.
We also say that L ∞ is a uniformly ordered infinite trait allocation, with the associated uniform label multiset sequence Y ∞ from Definition 5.6. Note that generating Φ ∞ i.i.d. from any purely diffuse distribution on (0, 1) would suffice to make L ∞ uniformly ordered; we specify the uniform distribution only for concreteness. Further, for any φ ∞ and t ∞ , we can recover t ∞ from the φ ∞ -ordered label multiset sequence y ∞ using the same transformation as for the earlier definition of label multiset sequences, i.e. t ∞ =φ(y ∞ ) withφ as defined in Eq. (4.4). Therefore we The proof of Theorem 5.4 relies on Lemma 5.8, a collection of two technical results associated with uniformly ordered infinite trait allocations L ∞ for which the associated unordered infinite trait allocation T ∞ has an ETPF. The first result states that L N and L N +k are conditionally independent given L N for any N, k ∈ N; essentially, if the distribution of L N only depends on its multiplicity profile, knowing the multiplicity profiles of further uniformly ordered trait allocations in the sequence L ∞ does not provide any extra useful information about L N . The second result simply states that the distribution of L N | L N is uniform. The proof of Lemma 5.8 may be found in Appendix B.
Lemma 5.8. If T ∞ has an ETPF, and L ∞ is the uniform ordering of T ∞ , then (5.11) and L N | L N is uniformly random over the ordered trait allocations of [N ] that are consistent with L N .
Proof of Theorem 5.4. Assume T ∞ has a frequency model as in Definition 5.1. If L ∞ is its uniform ordering, then (5.12) where N is any ordering of the finite trait allocation t N . It remains to show that there exists a function p such that The major difficulty in doing so is that there is ambiguity in how L N = N was generated from the frequency model; any trait N k for which N k is a singleton (i.e., N k contains a single unique index) may correspond to either a dust or regular trait. Therefore, we must condition on both the frequency model parameters and the (random) dust/regular assignments of the K traits in N . We let A j ⊂ [K], j ∈ N be the set of components of N corresponding to dust traits of multiplicity j. We further let Q be the set of sequences (A j ) such that k ∈ A j =⇒ N k = {j} for all k, j ∈ N, i.e., those that are possible dust/regular assignments of the traits given N . Note in particular that Q is a function of N but not N . Then by the tower property, (5.14) Expanding the inner conditional probability, and defining A = [K] \ j A j , . (5.15) The first term in the product relates to the dust. Given that we know the positions and multiplicities of dust in L N , the only remaining randomness is in which index expresses each dust trait; and since L N has a uniformly random order, the probability of any index expressing dust at an index is simply 1/N . The indicator expresses the fact that the probability of observing L N = N is 0 if it is inconsistent with the dust assignments (A j ). The second and third terms are simply the sum over the probabilities of all ways the (θ kj ) could have generated the observed regular traits. Note that the expression in Eq. (5.15) is a function of only N and N , and therefore so is P (L N = N ) in Eq. (5.14). But since L N is a uniformly ordered trait allocation, P(L N = N ) is invariant to reordering N , so it is invariant to reordering N ; and since N is some ordering of the traits in t N , P(L N = N ) is a function of only t N and N . Therefore, there exists some function p such that (5.16) and T ∞ has an ETPF as required.
Next, assume T ∞ has an ETPF. We will show that the uniform label multiset sequence Y ∞ is generated from a frequency model by studying the random finite sequence (5.17) where ρ N Φ ∞ is the N th ordering of Φ ∞ , and L N is the uniform ordering of T N . We simulate the above quantity by iteratively constructing Y n | (Y m ) is a function of L N and ρ N Φ ∞ , and L N | L N has the uniform distribution by Lemma 5.8, we can generate Y n | (Y m ) n−1 m=1 , ρ N Φ ∞ , L N by setting, independently across the traits k ∈ N, the multiplicity Y nk of Φ ρ −1 N (k) ∈ (0, 1) in Y n to j with probability proportional to L N k (j) minus the number of label multisets in (Y m ) n−1 m=1 that already selected that same trait with multiplicity j. Mathematically, With this iterative construction, we are free to terminate at any M < N yielding the marginal distribution of (Y m ) (5.23) If we define the σ-algebra based on these two quantities is a reverse martingale, since G N +1 ⊂ G N for each N ∈ N, and by the tower property By the guaranteed almost-sure convergence of reverse martingales (Kallenberg, 1997, Theorem 6.23), we have for any M ∈ N and (y m ) M m=1 , Note that Y n ⊥ ⊥ Y 1 , . . . , Y n−1 | G for each n = 1, . . . , M , since → 0 as N → ∞ by the assumption that each index is a member of only finitely many traits. Therefore, (Y n ) M n=1 | G are i.i.d. random multisets of (0, 1) conditioned on G. Since this holds for any M ∈ N, the Kolmogorov extension theorem guarantees that Y ∞ | G is an i.i.d. sequence of random multisets of (0, 1) conditioned on G; it thus suffices to characterize the distribution of Y 1 | G.
To do so, we provide an alternate construction of Y 1 | G N to the one in Eq. (5.18). Define D N j to be the set of component indices in L N corresponding to the "dust-like" traits of multiplicity j, and R N to be the remaining component indices corresponding to nonempty "regular-like" traits, The construction of Y 1 | G N from Eq. (5.18) can be equivalently performed in two steps. First, independently for every k ∈ R N , we set the multiplicity of Φ ρ −1 N (k) in Y 1 to j ∈ N with probability L N k (j)/N , and to 0 with probability 1 − j L N k (j)/N . Then for each j ∈ N, we generate S j ∼ Binom(|D N j | , 1/N ), select a subset of D N j of size S j uniformly at random, and set the multiplicity in Y 1 of Φ ρ −1 N (k) for each k in the subset to j.
Considering the convergence of Y 1 | G N d → Y 1 | G as N → ∞, the first step implies the existence of a countable sequence (Φ k ) in (0, 1) (a rearrangement of some subset of the sequence Φ ∞ ) and sequences of nonnegative reals (θ kj ) ∞ j=0 such that independently across k ∈ N. Using the law of small numbers (Ross, 2011, Theorem 4.6) on the binomial distribution for S j (with shrinking probabilities 1/N as N → ∞), and the fact that Φ ∞ i.i.d.
∼ Unif(0, 1), the second step implies that there exists a sequence of positive reals (θ j ) such that where Y 1 | G additionally has Poiss(θ j ) elements uniformly distributed on (0, 1) with multiplicity j. Therefore Y 1 | G has distribution given by ∼ Unif(0, 1). (5.34) Finally, j θ kj ≤ 1 by the above construction, and both k,j θ kj < ∞ and j θ j < ∞ almost surely, since otherwise the second Borel-Cantelli lemma combined with the i.i.d. nature of Y ∞ | G would imply that each Y n is not finite multiset, which contradicts the assumption that any index is a member of only finitely many traits almost surely. Thus T ∞ =φ(Y ∞ ) has a frequency model. By setting θ kj = θ j = 0 for all k, j ∈ N : j > 1, Theorem 5.4 can be used to recover the correspondence between random infinite feature allocations with an exchangeable feature probability function (EFPF) and those with a feature frequency model, both defined in earlier work by Broderick et al. (2013). In the present context, an EFPF is an ETPF where p(N, t N ) > 0 only for t N that are feature allocations per Definition 2.3. These are exactly the t N for which t N only contains traits τ of the form {1, 1, 1, . . . , 1}, i.e., t N (τ ) > 0 only if ∀n > 1, τ (n) = 0.
Corollary 5.9. A random infinite feature allocation has a feature frequency model iff it has an EFPF.
For exchangeable infinite partitions, the result is stronger: all exchangeable infinite partitions have an exchangeable partition probability function (EPPF) (Pitman, 1995), defined as a summable symmetric function of the partition sizes times K!, where K is the number of partition elements. Theorem 5.4 cannot be directly used to recover this result: no choice of (θ kj ), (θ j ) in Definition 5.1 or p(N, t N ) in Definition 5.3 guarantees that the resulting T ∞ is a partition. The key issue is that in trait allocations with frequency models, the membership of each index in the traits is independent across the traits, while in partitions each index is a member of exactly one trait. In the EPPF, this manifests itself as an indicator function that tests whether the traits exhibit a partition structure, where no such test exists in the ETPF (or EFPF, by extension). As trait allocations generalize not only partitions, but other combinatorial structures with restrictions on index membership as well (cf. Section 6), it is of interest to find a generalization of the correspondence between frequency models and ETPFs that applies to these constrained structures. We thus require a way of extracting the memberships of a single index in a trait allocationreferred to as its membership profile, as in Definition 5.10-so that we can check whether it satisfies constraints on the combinatorial structure. For example, if we have the trait allocation t 4 = {{1, 1, 2}, {1, 2, 3}, {1}}, then the membership profile of index 1 is {1, 1, 2}, since index 1 is a member of two traits with multiplicity 1, and one trait with multiplicity 2. The membership profile of an index may be empty; for example, here the membership profile of index 4 in t 4 is ∅. Finally, and crucially, the membership profile for an index does not change as more data are observed: for an infinite trait allocation t ∞ ∈ T ∞ , if τ is the membership profile of index n in t N for n ≤ N , then for all M ≥ N , τ is the membership profile of index n in t M .
Definition 5.10. The membership profile of index n in a finite trait allocation t N is the multisubset t  Definitions 5.11 and 5.12 provide definitions of a frequency model and exchangeable probability function for combinatorial structures with constraints on the membership profiles that are analogous to the earlier unconstrained versions in Definitions 5.1 and 5.3. The intuitive connection to these earlier definitions is made through rejection sampling. First, we define an acceptable set of membership profiles, known as the constraint set C ⊂ T ∪ {∅}. Then, for trait allocations with a constrained exchangeable trait probability function (CETPF) in Definition 5.12, we generate T N from the associated unconstrained ETPF and check if all indices n ∈ [N ] have membership profiles falling in C. If this check fails, we repeat the process, and otherwise output T N as a sample from the distribution. Likewise, for trait allocations with a constrained frequency model, we generate Y n , n = 1, 2, . . . , N , progressively checking if all the indices in the associated T n , n = 1, 2, . . . , N have membership profiles in C. If any check fails, we repeat the generation of Y n for that index n ∈ N until it passes. We continue this process until we reach N ∈ N, and output T N as a sample from the distribution. To sample T ∞ , we do the same thing but do not terminate the sequential construction at any finite N ∈ N.
Definition 5.11. A random infinite trait allocation T ∞ has a constrained frequency model with constraint set C ⊂ T ∪ {∅} if it has the construction in Definition 5.1 with step 2. replaced by (2) For N = 1, 2, . . . , (a) generate Y N = R N + D N as in step 2. of Definition 5.1, (c) if Y N ∈ C, continue; otherwise, go to step 2a.
Note that in Definition 5.11, Y N is precisely the membership profile of index N . That is to say, if we were to construct Using Y N instead of this construction simplifies the definition considerably.
Definition 5.12. A random infinite trait allocation T ∞ has a constrained exchangeable trait probability function (CETPF) with constraint set C ⊂ T ∪ {∅} if there exists a function p : N × T → R + such that for all N ∈ N, The extension of Theorem 5.4-a correspondence between random infinite trait allocations T ∞ with constrained frequency models and CETPFs in Definitions 5.11 and 5.12-that applies to constrained combinatorial structures is given by Theorem 5.13.
Theorem 5.13. T ∞ has a constrained frequency model with constraint set C iff it has a CETPF with constraint set C.
Proof. Suppose T ∞ has a constrained frequency model with constraint set C. For finite N ∈ N, generating T N from the constrained frequency model is equivalent to generating it from the associated unconstrained frequency model (i.e., removing the rejection in step 2c of Definition 5.11), and then rejecting T N if N n=1 1 T (n) N ∈ C = 0. Since generating T N from an unconstrained frequency model implies it has an ETPF by Theorem 5.4-which inherently satisfies the summability condition in Definition 5.3 because it is itself a probability distribution-and the final rejection step is equivalent to multiplying the distribution of T N by N n=1 1 T (n) N ∈ C and renormalizing, T ∞ has a CETPF with constraint set C.
Next, suppose T ∞ has a CETPF with constraint set C. We can simply reverse the above logic: since the associated ETPF is summable, we can generate T N by simulating from the (normalized) ETPF and rejecting if N n=1 1 T (n) N ∈ C = 0. The ETPF has an associated frequency model by Theorem 5.4. Instead of rejecting T N after generating all Y n , n = 1, 2, . . . , N , we can reject after each index n ∈ N based on progressively constructing T n , n = 1, 2, . . . , N .
We can, of course, recover Theorem 5.4 from Theorem 5.13 by setting C = T ∪ {∅}. Theorem 5.13 also allows us to recover earlier results-using a novel proof techniqueabout the correspondence of exchangeable infinite partitions and partitions with an EPPF.
Corollary 5.14. A random infinite partition T ∞ is exchangeable iff it has an EPPF.
Proof. Suppose T ∞ has an EPPF. The EPPF is a CETPF with C = {{1}}, and thus T ∞ is exchangeable by inspection of Definition 5.12 (the probability is invariant to finite permutations of the indices). In the other direction, if T ∞ is an exchangeable infinite partition, then it has a paintbox of the form specified in Corollary 4.7; for notational brevity define w k := λ(C k1 ) for k ∈ N and w 0 := λ(C 11 ). Note in particular that ∞ k=0 w k = 1, and each index n ∈ N selects its trait from the distribution (w k ) ∞ k=0 , where selecting 0 implies selecting a dust (or unique) trait. We seek a constrained frequency model equivalent to this paintbox, so we set θ kj = θ j = 0 for all k, j ∈ N : j > 1 and seek (θ k1 ) and θ 1 such that (5.39) Dividing by k θ k0 , this is equivalent to finding (θ k1 ) and θ 1 such that We have a degree of freedom in the proportionality constant, so set that equal to 1 and solve each equation by noting that θ k1 + θ k0 = 1, yielding The exchangeable infinite partition T ∞ has a constrained frequency model with constraint set C = {{1}} based on (θ kj ), (θ j ). By Theorem 5.13 it thus has a CETPF with the same constraint set C, which is an EPPF.

Application: vertex allocations and edge-exchangeable graphs
A natural assumption for graph data-arising from online social networks, protein interaction networks, co-authorship networks, email communication networks, etc. (Goldenberg et al., 2010)-is that the distribution over vertices and edges is invariant to reordering the vertices, i.e., the graph is vertex-exchangeable. Under this assumption, however, the Aldous-Hoover theorem (Aldous, 1981;Hoover, 1979) guarantees that the resulting graph is either dense or empty almost surely, an inappropriate consequence when modeling the sparse networks that occur in most applications. Standard statistical models, which are traditionally vertexexchangeable (Lloyd et al., 2012), are therefore misspecified for modeling real-world networks. This model misspecification has motivated the development and study of a number of exchangeable network models that do not preclude sparsity (Caron and Fox, 2014;Veitch and Roy, 2015;Borgs et al., 2016;Crane and Dempsey, 2016a;Cai et al., 2016;Herlau and Schmidt, 2016). One class of such models assume the edges are exchangeable rather than the vertices; these are aptly named edge-exchangeable models (Crane and Dempsey, 2016a;Cai et al., 2016). In this section we provide an alternate view of edge-exchangeable multigraphs as a subclass of exchangeable infinite trait allocations called vertex allocations, thus guaranteeing a paintbox representation. We also show that the vertex popularity model, a standard example of an edge-exchangeable model, is a constrained frequency model per Definition 5.11, thus guaranteeing the existence of a CETPF which we call the exchangeable vertex probability function (EVPF). We begin by considering multigraphs without loops, i.e., edges can occur with multiplicity and all edges contain exactly two vertices. We then discuss generalizations of this to cases such as multigraphs with edges that can contain one vertex (a loop) or two or (finitely many) more vertices (a hypergraph). In the graph setting, the traits correspond to vertices, and the data indices in each trait correspond to the edges of the graph. Each data index has multiplicity 1 in exactly two traits, encoding an edge between two separate vertices. To ensure that the vertex allocation correctly encodes a graph, this is the only possibility for each index, as expressed in Definition 6.1. Fig. 6 shows an example encoding of a graph as a vertex allocation.
Definition 6.1. A vertex allocation of [N ] is a trait allocation of [N ] in which each index has membership profile equal to {1, 1}. Definition 6.1 and Theorem 4.5 together immediately yield a paintbox representation for edge-exchangeable graphs, provided by Corollary 6.2. The first constraint limits the potentially nonempty subsets of (0, 1) in the paintbox to C 11 , C 12 , and (C k1 ), thus limiting the multiplicities of indices in the traits to 1. The second and third ensure that each index is a member of at most two traits, guaranteeing that each index represents either an edge between two vertices, a loop on a single vertex, or an edge that connects to no vertices. The final two constraints ensure that each index (edge) is a member of at least two traits (vertices), ruling out loops and edges connecting to no vertices. An example of a graph paintbox is shown in Fig. 7.
Definition 6.1 and Corollary 6.2 can be modified in a number of ways to better suit the particular application at hand. For example, if loops are allowed-useful for capturing, for example, authors citing their own earlier work in a citation networkthe membership profile of each index can be either {1, 1} or {1}, and the second last bullet in Corollary 6.2 is removed. This allows indices to be a member of a single trait with multiplicity 1, encoding a loop on a single vertex. If edges between more than two vertices are allowed-that is, we are concerned with hypergraphs-then we may simply repurpose the definition of a feature allocation in Definition 2.3, with associated paintbox representation in Corollary 4.8, where we view the features as vertices. If N-valued weights are allowed on the multigraph edges, they can be encoded using multiplicities greater than 1. In this case, the index membership profiles must be of the form {j, j} for j ∈ N, which encodes an edge of weight j. Weighted loops may be similarly obtained by allowing membership profiles of the form {j} for j ∈ N. This might be used, for example, to capture an author citing the same work multiple times in a single document. Weighted hypergraphs are simply trait allocations without any restrictions.
The vertex popularity model (Caron and Fox, 2014;Cai et al., 2016;Crane and Dempsey, 2016a;Palla et al., 2016;Herlau and Schmidt, 2016) 2 is a simple yet powerful network model. In the vertex popularity model, all (potentially infinitely many) vertices k ∈ N are associated with a weight w k ∈ (0, 1) such that k w k < ∞, and we sample an edge between vertex k and with probability proportional to w k w . For an edge-exchangeable vertex popularity model, assuming no loops, Theorem 5.4 enforces that this model has an associated exchangeable vertex probability function (EVPF), given by Definition 6.3. Definition 6.3. An exchangeable vertex probability function (EVPF) is a CETPF with constraint set C = {{1, 1}}.
Corollary 6.4. A regular exchangeable infinite vertex allocation has a vertex popularity model iff it has an EVPF.
Proof. We use a similar technique to the proof of Corollary 5.14 -we seek a constrained frequency model (a sequence (θ kj ) and set C) that corresponds to the vertex popularity model with weights (w i ), and then use Theorem 5.13 to obtain a correspondence with a CETPF (and in particular, an EVPF). We let θ j = 0 for all j ∈ N, let θ kj = 0 for all j ∈ N : j > 1, and seek (θ k1 ) such that ∀k, ∈ N : k = , θ k1 θ 1 m =k, θ m0 ∝ w k w j . (6.1) Dividing by k θ k0 , and setting the proportionality constant to 1, this is equivalent to This may be solved, noting that ∀k ∈ N, θ k0 + θ k1 = 1, by Therefore the vertex popularity model with weights (w i ) is equivalent to a constrained frequency model with θ k1 = w k /(1 + w k ) for k ∈ N, θ kj = 0 for j > 1, θ j = 0 for all j ∈ N, and C = {{1, 1}} as specified above. Theorem 5.13 guarantees that the vertex popularity model has a CETPF with constraint set C, and likewise that any CETPF with constraint set C yields a vertex popularity model by inverting the relation in Eq. (6.3).

Conclusions
In this work, we formalized the idea of trait allocations-the natural extension of well-known combinatorial structures such as partitions and feature allocations to data expressing latent factors with multiplicity greater than one. We then developed the framework of exchangeable random infinite trait allocations, which represent the latent memberships of an exchangeable sequence of data. The four major contributions in this framework were a novel order-of-appearance scheme for trait allocations without auxiliary randomness, a de-Finetti style paintbox representation theorem for all exchangeable trait allocations, a correspondence theorem between random trait allocations with a frequency model and those with an ETPF, and finally the introduction and study of the constrained ETPF for capturing random trait allocations with constrained index memberships. These contributions apply directly to many other combinatorial structures, such as edge-exchangeable graphs and topic models. Proof. Suppose τ ∈ T. Define the size of a trait |τ | := ∞ n=1 τ (n), and define T N = {τ ∈ T : |τ | = N }. Define the function f N : T N → N N as f N (τ ) = (τ 1 , τ 2 , . . . , τ N ) where τ i ≤ τ j ⇐⇒ i ≤ j and N n=1 1(τ n = k) = τ (k). This map is an injection, and N N is countable since it is a finite product set of countable sets, so T N is countable. Since T = ∞ N =1 T N , T is countable. Proposition A.2. Both T and T N (for all N ∈ N) are countable.
Proof. Since T = ∞ N =1 T N , if T N is countable for all N ∈ N, so is T . Define T N M to be the set of trait allocations of [N ] with M ∈ N traits. Since we can define an injection from T N M to T M by picking any order of the traits for each t N ∈ T N M , and T M is countable since it is a finite product set of countable sets by Proposition A.1, we have that T N M is countable. Finally, since Proposition A.3. The set of all consistent trait allocations T ∞ (a strict subset of Proof. Suppose T ∞ is countable. Therefore, we can order its elements t where n i is the total multiplicity of i in t (i) i across all its traits, i.e. n i := τ ∈T t (i) i (τ )· τ (i). Define the sequence t ∞ = (t 1 , t 2 , . . . ) where t i has a single trait with each index j ≤ i having multiplicity n j + 1. This trait allocation is consistent but not contained in t , which is a contradiction to our initial assumption of countability.
A.2. Existence and uniqueness of T ∞ . The main tool used to prove the existence and uniqueness of T ∞ is the Kolmogorov extension theorem, Theorem A.7. The notation has been changed from its original source to more closely match the present paper.
Definition A.4. Two measurable spaces U, V are Borel isomorphic if there exists a measurable bijection f : U → V such that f −1 is measurable. A Borel space is a measurable space that is Borel isomorphic to a Borel subset of [0, 1].
Definition A.6. Given an arbitrary index set U , a projective family of probability measures µ u on a collection of Borel spaces S u , u ∈ U , is one in which µ I (·) = µ J (· × S J\I ) for all finite sets I, J such that I ⊂ J ⊂ U .
Theorem A.7. (Kallenberg, 1997, Theorem 5.16) For any collection of Borel spaces S u indexed by u ∈ U (with U arbitrary), consider a projective family of probability measures µ I on S I for all finite subsets I ⊂ U . Then there exist some random elements ξ u in S u , u ∈ U such that ξ I has distribution µ I for every such subset I ⊂ U .
Remark. Although it is not mentioned explicitly in the statement of Theorem A.7, the measure space for the process (ξ u ) is (S U , S U , µ) where S U is the product σ-algebra generated by the cylinder sets Proof. The essential part of the first claim is that A ∩ S is a σ-algebra, for which we verify the defining conditions: • Universal set: A ∈ A ∩ S because A ∈ S and A = A ∩ A ∈ A ∩ S.
• Closed under complement: If B ∈ A ∩ S, then there exists a C ∈ S such that B = A ∩ C, so B ∈ S and thus B c ∈ S. The complement of B relative to A is B c ∩ A, which is therefore in A ∩ S. • Closed under countable union: If B i ∈ A ∩ S for all i ∈ N, then there exist For the second, note that µ (∅) = µ(∅) = 0, µ ≥ 0, µ (A) = µ(A) = 1, and Lemma A.9. If C is a collection of sets in S and A ∈ σ(C), then A∩σ(C) = σ(A∩C).
Proof. First, note that C ⊂ σ(C), so A ∩ C ⊂ A ∩ σ(C) and therefore σ(A ∩ C) ⊂ A ∩ σ(C). Next, let D be the collection of sets C such that A ∩ C ∈ σ(A ∩ C). Then D is a σ-algebra: . Clearly C ⊂ D, and since D is a σ-algebra, σ(C) ⊂ D. But by the definition of D this means that A ∩ σ(C) ⊂ σ (A ∩ C). Both inclusion directions have been shown and therefore equality holds.
Remark (regarding Proposition 2.8). The details of the probability space are omitted in the statement of Proposition 2.8 to keep the main text accessible for a casual reader. A more precise statement could read: there exists a unique measure ν on the If we define C to be the family of cylinder sets A M × × N =M T N , A M ⊂ T M , the σ-algebra in the proposition may be equivalently stated using the consistent cylinders since Lemma A.9 implies While not required to read the main text, these measure-theoretic details are used extensively in the proofs. One major strength of Proposition 2.8 that is admittedly hidden by the simplified language in the main text is that we obtain a probability space on T ∞ rather than × ∞ N =1 T N , which is easier to work with when defining finite permutations and restrictions (since the domain/range are guaranteed to contain only consistent sequences).
Proof of Proposition 2.8. First note that since T N is countable by Proposition A.2, under the discrete metric it is complete and separable, and hence T N is a Polish space. Since the discrete metric endows T N with the power set topology, by Lemma A.5, T N , 2 T N is a Borel space for each N ∈ N.
We define the family of functionsν n1,...,n K : 2 Tn 1 × 2 Tn 2 × . . . 2 Tn K → R + for all finite collections n 1 < n 2 < · · · < n K ∈ N aŝ ν n1,...,n K (A 1 × · · · × A K ) := It is easy to see thatν ≥ 0,ν(∅) = 0, andν(T n1 × · · · × T n K ) = t∈Tn K ν n K (t) = 1. Furthermore,ν is countably additive: taking a disjoint collection A i1 × · · · × A iK , i ∈ N, and soν is a probability measure for each choice of n 1 < · · · < n K . From the consistency of the individual ν N , it is also possible to show thatν is a projective family. To do this, we examine the effects of setting A j = T nj for j < K and for j = K. When j < K, =ν n1,...,nj−1,nj+1,...,n K (A 1 ×· · ·×A j−1 ×A j+1 ×· · ·×A K ), (A.9) and when j = K, ν n1,...,n K (A 1 × · · · × A K ) (A.10) Since we can arrive at any subset of [K] by removing one element j ∈ [K] at a time, this shows thatν is projective. Theorem A.7 guarantees the existence of a probability measure ν on the measurable space ( × ∞ N =1 T N , Σ) with marginals consistent withν, where the σ-algebra Σ is generated by the cylinder sets, The result that T ∞ | N ∼ ν N in the statement of the proposition is immediate from the definition ofν. Next, we refine the space, σ-algebra, and measure to be on the set of consistent trait allocations T ∞ . To do this, we need to show that T ∞ ∈ Σ. If this is the case, we have that ν(T ∞ ) = 1 since the marginalsν guarantee that T ∞ ∈ T ∞ a.s. By Lemma A.8, (T ∞ , T ∞ ∩ Σ) is a measurable space, and ν(T ∞ ∩ ·) is a probability measure on that space. Lemma A.9 allows us to equate T ∞ ∩ Σ and σ(T ∞ ∩ C). Finally, since Kallenberg (1997, Proposition 2.2) guarantees that any two stochastic processes with the same finite-dimensional marginals have the same distribution, the process developed in this proof is unique.
It remains to show that T ∞ ∈ Σ. We have that Each inner set is a finite intersection of cylinder sets generating Σ, and T N is countable by Proposition A.2, so each infinite union is an element of Σ. Thus T ∞ is a countable intersection of elements in Σ and therefore is itself in Σ.
A.3. Mappings on random trait allocations. To ensure that the individual components of random trait allocations are well-defined random elements in T N , and that exchangeable trait allocations are well-defined, we require Propositions A.10 and A.11.
Proposition A.10. The restriction mapping | N : T ∞ → T N is measurable for all N ∈ N.
Proof. Since we endow T N with the power set σ-algebra for N ∈ N, we need that the inverse map of any A ⊂ T N is an element of σ(T ∞ ∩ C) defined in Eq. (A.3). This is trivially true, as the inverse restriction of A is Proposition A.11. The finite permutation mapping π : T ∞ → T ∞ is measurable.
Proof. It is sufficient to show that the inverse map of every set in T ∞ ∩ C defined in Eq. (A.3) is contained in σ(T ∞ ∩ C) by Kallenberg (1997, Lemma 1.4). Define such a set B = T ∞ ∩ A × × N =M T N with A ⊂ T M . Since π is finite, let K be the index such that π : N → N fixes all indices k > K. Then since π −1 is also a finite permutation mapping, there exist A i ⊂ T i , i = 1, . . . , max(M, K) such that ∈ σ (T ∞ ∩ C) . (A.21) A.4. Properties of the label multiset sequence. Since the main step in constructing the trait paintbox in Theorem 4.5 involves the application of de Finetti's theorem (Kallenberg, 1997, Theorem 9.16) to a sequence of elements in Y × I, where I is the set of multisets of N and Y is the set of multisets of (0, 1), the basic requirement is that (Y × I, Y × I) is a Borel space, in the sense of Definition A.4, given some σ-algebras Y and I. However, I is countable and under the discrete metric is a Polish space. Thus, we need to show that there is a topological space on Y that is Polish, and hence the Borel σ-algebra generated by the product topology is a Borel space by Lemma A.5. Given any y ∈ Y, we can represent the fact that y(φ) = n by the tuple (φ, n), and represent the multiset y by the finite set of such tuples with n > 0; therefore the set Y is in bijection with the set of finite subsets y ⊂ (0, 1) × N such that (φ, n) ∈ y =⇒ (φ, m) / ∈ y for m = n. Given the topology generated by the Hausdorff metric (where · is the standard Euclidean distance on (0, 1) × N ⊂ R 2 ), and the corresponding metric on Y, these sets are homeomorphic. For convenience, we will use the interpretation of Y ⊂ (0, 1) × N in the statement and proof of Proposition A.12.
Proposition A.12. Y is a Borel space given the smallest σ-algebra Y containing the topology generated by the fixed-cardinality Hausdorff balls and let the finite subsets of (0, 1) × N of size less than or equal to K be denoted Y K = {y ⊂ (0, 1) × N : |y| ≤ K} . (A.25) Note that Y K contains elements of cardinality K not contained in Y (in particular, subsets including both (φ, m) and (φ, n) for m = n). In fact, Y K is the completion of Y K within (0, 1) × N under d H . Y K also has a countable dense subset-the subsets of Q ∩ (0, 1) × N of size ≤ K. Therefore Y K is a Polish space under the topology generated by d H . Next, we show Y K is an open subset of Y K . For any element x ∈ Y K , let δ be the minimum distance between the projection of any two points in x onto the (0, 1) axis. Let = min( 1 /2, δ /2). Therefore Y K itself is a Polish space with the intersection topology that is compatible with d H . Finally, Y = ∪ ∞ k=1 Y k is a countable disjoint union of Polish spaces and hence is itself Polish with the disjoint union topology, which is generated by the union of the generators of each Y K (i.e. the fixed-cardinality Hausdorff balls). Finally, using the σ-algebra Y generated by this topology, by Lemma A.5, we have that (Y, Y) is a Borel space.
Proof. We check that inverse maps of open sets are open under each map; it suffices to check a member of the generating class for the product topology on the range. ϕ (resp.φ) is continuous because the inverse map of a cylinder set in Y ∞ (resp. T ∞ × F ∞ ) is a cylinder set in T ∞ × F ∞ (resp. Y ∞ ), which is open as it is in the generating class of the product topology on T ∞ × F ∞ (resp. Y ∞ ).

Appendix B. Proofs of results in the main text
Proof of Lemma 3.6. First, if any one of τ = ω, τ < ω, or ω < τ hold, then the other two cannot hold by direct application of Definition 3.4. In other words, at most one of the conditions holds. Next, if both τ < ω and ω < τ do not hold, then τ (n) = ω(n) for all n ∈ N, and hence τ = ω. If both τ < ω and τ = ω do not hold, then there is a minimum index n ∈ N such that τ (n) = ω(n). Further, ω(n) > τ (n) (otherwise τ < ω holds), so ω < τ . By symmetry, if both ω < τ and ω = τ do not hold, then τ < ω must hold. Therefore, at least one of the conditions holds, and the result follows.
Proof of Lemma 4.2. Suppose π fixes indices greater than N . Then using Definition 2.9, πt N is simply t N with indices permuted. Let K N = τ ∈T t N (τ ) < ∞, the number of traits in t N . Then let π be the unique finite permutation that maps the index of each trait τ in [t N ] to its corresponding trait πτ in [πt N ], while preserving monotonicity for any traits of multiplicity greater than 1. Mathematically, π fixes all k > K N , sets π [t N ] π −1 (k) = [πt N ] k for all k ∈ [K N ], and satisfies π (k + 1) = π (k) + 1 for all k ∈ [K N − 1] such that [t N ] k = [t N ] k+1 . Clearly such a permutation exists because πt N contains the same traits as t N with indices permuted by Definition 2.9, and the permutation is unique because any ambiguity (where t N contains traits with multiplicity greater than 1) is resolved by the monotonicity requirement. The monotonicity requirement also implies that π satisfies the desired ordering condition for all M ≥ N , i.e. which by the ETPF assumption is a constant for any L N with the given multiplicity profile, and 0 otherwise.