Representational Rényi Heterogeneity

Nunes, Abraham; Alda, Martin; Bardouille, Timothy; Trappenberg, Thomas

doi:10.3390/e22040417

Open AccessArticle

Representational Rényi Heterogeneity

¹

Department of Psychiatry, Dalhousie University, Halifax, NS B3H 2E2, Canada

²

Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada

³

Department of Physics and Atmospheric Sciences, Dalhousie University, Halifax, NS B3H 4R2, Canada

^*

Authors to whom correspondence should be addressed.

^†

Current address: 5909 Veterans Memorial Lane (8th Floor), Abbie J. Lane Memorial Building, QE II. Health Sciences Centre, Halifax, NS B3H 2E2, Canada.

Entropy 2020, 22(4), 417; https://doi.org/10.3390/e22040417

Submission received: 26 March 2020 / Revised: 3 April 2020 / Accepted: 4 April 2020 / Published: 7 April 2020

(This article belongs to the Special Issue Entropy in Data Analysis)

Abstract

:

A discrete system’s heterogeneity is measured by the Rényi heterogeneity family of indices (also known as Hill numbers or Hannah–Kay indices), whose units are the numbers equivalent. Unfortunately, numbers equivalent heterogeneity measures for non-categorical data require a priori (A) categorical partitioning and (B) pairwise distance measurement on the observable data space, thereby precluding application to problems with ill-defined categories or where semantically relevant features must be learned as abstractions from some data. We thus introduce representational Rényi heterogeneity (RRH), which transforms an observable domain onto a latent space upon which the Rényi heterogeneity is both tractable and semantically relevant. This method requires neither a priori binning nor definition of a distance function on the observable space. We show that RRH can generalize existing biodiversity and economic equality indices. Compared with existing indices on a beta-mixture distribution, we show that RRH responds more appropriately to changes in mixture component separation and weighting. Finally, we demonstrate the measurement of RRH in a set of natural images, with respect to abstract representations learned by a deep neural network. The RRH approach will further enable heterogeneity measurement in disciplines whose data do not easily conform to the assumptions of existing indices.

Keywords:

heterogeneity; diversity; Rényi heterogeneity; representation learning; variational autoencoder; functional diversity indices; Hill numbers; Leinster–Cobbold Index; Rao’s quadratic entropy

1. Introduction

Measuring heterogeneity is of broad scientific importance, such as in studies of biodiversity (ecology and microbiology) [1,2], resource concentration (economics) [3], and consistency of clinical trial results (biostatistics) [4], to name a few. In most of these cases, one measures the heterogeneity of a discrete system equipped with a probability mass function.

Discrete systems assume that all observations of a given state are identical (zero distance), and that all pairwise distances between states are permutation invariant. This assumption is violated when relative distances between states are important. For example, an ecosystem is not biodiverse if all species serve the same functional role [5]. Although species are categorical labels, their pairwise differences in terms of ecological functions differ and thus violate the discrete space assumptions. Mathematical ecologists have thus developed heterogeneity measures for non-categorical systems, which they generally call “functional diversity indices” [6,7,8,9,10,11]. These indices typically require a priori discretization and specification of a distance function on the observable space.

The requirement for defining the state space a priori is problematic when the states are incompletely observable: that is, when they may be noisy, unreliable, or invalid. For example, consider sampling a patient from a population of individuals with psychiatric disorders and assigning a categorical state label corresponding to his or her diagnosis according to standard definitions [12]. Given that psychiatric conditions are not defined by objective biomarkers, the individual’s diagnostic state will be uncertain. Indeed, many of these conditions are inconsistently diagnosed across raters [13], and there is no guarantee that they correspond to valid biological processes. Alternatively, it is possible that variation within some categorical diagnostic groups is simply related to diagnostic “noise,” or nuisance variation, but that variation within other diagnostic groups constitutes the presence of sub-strata. Appropriate measurement of heterogeneity in such disciplines requires freedom from the discretization requirement of existing non-categorical heterogeneity indices.

Pre-specified distance functions may fail to capture semantically relevant geometry in the raw feature space. For example, the Euclidean distance between Edmonton and Johannesburg is relatively useless since the straight-line path cannot be traversed. Rather, the appropriate distances between points must account for the data’s underlying manifold of support. Representation learning addresses this problem by learning a latent embedding upon which distances are of greater semantic relevance [14]. Indeed, we have observed superior clustering of natural images embedded on Riemannian manifolds [15] (but also see Shao et al. [16]), and preservation of semantic hierarchies when linguistic data are embedded on a hyperbolic space [17].

Therefore, we seek non-categorical heterogeneity indices without requisite a priori definition of categorical state labels or a distance function. The present study proposes a solution to these problems based on the measurement of heterogeneity on learned latent representations, rather than on raw observable data. Our method, representational Rényi heterogeneity (RRH), involves learning a mapping from the space of observable data to a latent space upon which an existing measure (the Rényi heterogeneity [18], also known as the Hill numbers [19] or Hannah–Kay indices [20]) is meaningful and tractable.

The paper is structured as follows. Section 2 introduces the original categorical formulation of Rényi heterogeneity and various approaches by which it has been generalized for application on non-categorical spaces [8,10,21]. Limitations of these indices are highlighted, thereby motivating Section 3, which introduces the theory of Representational Rényi Heterogeneity (RRH), which generalizes the process for computing many indices of biodiversity and economic equality. Section 4 provides an illustration of how RRH may be measured in various analytical contexts. We provide an exact comparison of RRH to existing non-categorical heterogeneity indices under a tractable mixture of beta distributions. To highlight the generalizability of our approach to complex latent variable models, we also provide an evaluation of RRH applied to the latent representations of a handwritten image dataset [22] learned by a variational autoencoder [23,24]. Finally, in Section 5 we provide a summary of our findings and discuss avenues for future work.

2. Existing Heterogeneity Indices

2.1. Rényi Heterogeneity in Categorical Systems

There are many approaches to derive Rényi heterogeneity [18,19,20]. Here, we loosely follow the presentation of Eliazar and Sokolov [25] by using the metaphor of repeated sampling from a discrete system X with event space

X = \{1, 2, \dots, n\}

and probability distribution

p = {(p_{i})}_{i = 1, 2, \dots, n}

. The probability that

q \in N_{> 1}

independent and identically distributed (i.i.d.) realizations of X, sampled with replacement, will be identical is

P_{X} [x_{1} = x_{2} = \dots = x_{q}] = \sum_{i = 1}^{n} p_{i}^{q} .

(1)

Now let

X_{*}

be an idealized reference system with a uniform probability distribution over

n_{*}

categorical states,

p_{*} = {(n_{*}^{- 1})}_{i = 1, 2, \dots, n_{*}}

, and let

(x_{* 1}, x_{* 2}, \dots, x_{* q})

be a sample of q i.i.d. realizations of

X_{*}

such that

P_{X} [x_{1} = x_{2} = \dots = x_{q}] = P_{X_{*}} [x_{* 1} = x_{* 2} = \dots = x_{* q}] = \sum_{i = 1}^{n_{*}} n_{*}^{- q} .

(2)

We call

X_{*}

an “idealized” categorical system because its probability distribution is uniform, and it is a “reference” system for X in that the probability of drawing homogeneous samples of q observations from both systems is identical. Substituting Equation (2) into Equation (1) and solving for

n_{*}

yields the Rényi heterogeneity of order q,

Π_{q} (p) = {(\sum_{i = 1}^{n} p_{i}^{q})}^{\frac{1}{1 - q}} = n_{*},

(3)

whose units are the numbers equivalent of system X [1,26,27,28], insofar as

n_{*}

is the number of states in an “equivalent” (idealized reference) system

X_{*}

. Thus far, we have restricted the parameter q to take integer values greater than 1 solely to facilitate this intuitive derivation in a concise fashion. However, the elasticity parameter q in Equation (3) can be any real number (but

q \neq 1

), although in the context of heterogeneity measurement only

q \geq 0

are used [1,25]. Despite Equation (3) being udefined at

q = 1

directly, L’Hôpital’s rule can be used to show that the limit

q \to 1

exists, wherein it corresponds to the exponential of Shannon’s entropy [28,29], known as perplexity [30].

Equation (3) is the exponential of Rényi’s entropy [18], and is alternatively known as the Hill numbers in ecology [1,19], Hannah–Kay indices in economics [20], and generalized inverse participation ratio in physics [25]. Interestingly, it generalizes or can be transformed into several heterogeneity indices that are commonly employed across scientific disciplines (Table 1).

2.1.1. Properties of the Rényi Heterogeneity

Equation (3) satisfies several properties that render it a preferable measure of heterogeneity. These have been detailed elsewhere [1,20,25,28,33,38], but we focus on three properties that are of particular relevance for the remainder of this paper.

First,

Π_{q}

satisfies the principle of transfers [39,40] which states that any equality-increasing transfer of probability between states must increase the heterogeneity. The maximal value of

Π_{q}

is attained if and only if

p_{i} = p_{j}

for all

(i, j) \in {1, 2, \dots, n}

. This property follows from Schur-concavity of Equation (3) [20].

Second,

Π_{q}

satisfies the replication principle [1,38,41], which is equivalent to stating that Equation (3) scales linearly with the number of equally probable states in an idealized categorical system [25]. More formally, consider a set of systems

X_{1}, X_{2}, \dots, X_{N}

with probability distributions

p_{1}, p_{2}, \dots, p_{N}

over respective discrete event spaces

X_{1}, X_{2}, \dots, X_{N}

. These systems are also assumed to satisfy the following properties:

Event spaces are disjoint: $X_{i} \cap X_{j} = \emptyset$ for all $(i, j) \in {1, 2, \dots, N}$ where $i \neq j$
All systems have equal heterogeneity: $Π_{q} (p_{1}) = Π_{q} (p_{2}) = \dots = Π_{q} (p_{i}) = \dots = Π_{q} (p_{N})$

The replication principle states that if we combine

X_{1}, X_{2}, \dots, X_{N}

into a pooled system X with probability distribution

\bar{p}

, then

Π_{q} (\bar{p}) = N Π_{q} (p_{i})

(4)

must hold (see Appendix A for proof that Rényi heterogeneity satisfies the replication principle).

The replication principle suggests that Equation (3) satisfies a property known as decomposability, in that the heterogeneity of a pooled system can be decomposed into that arising from variation within and between component subsystems. However, we require that this property be satisfied when either (A) subsystems’ event spaces are overlapping, or (B) subsystems do not have equal heterogeneity. The decomposability property will be particularly important for Section 3, and so we detail it further in Section 2.1.2.

2.1.2. Decomposition of Categorical Rényi Heterogeneity

Consider a system X defined by pooling subsystems

X_{1}, X_{2}, \dots, X_{N}

with potentially overlapping event spaces

X_{1}, X_{2}, \dots, X_{N}

, respectively. The event space of the pooled system is defined as

X = \cup_{i = 1}^{N} X_{i} = \{1, 2, \dots, n\} .

(5)

Furthermore, we define the matrix

P = {(p_{i j})}_{i = 1, 2, \dots, N}^{j = 1, 2, \dots, n}

whose i^th row is the probability of system

X_{i}

being observed in each state

j \in {1, 2, \dots, n}

.

It may be the case that some subsystems comprise a larger proportion of X than others. For instance, if the probability distribution for subsystem

X_{i}

was estimated based on a larger sample size than that of

X_{j}

, one may want to weight the contribution of

X_{i}

higher. Thus, we define a column vector of weights

w = {(w_{i})}_{i = 1, 2, \dots, N}

over the N subsystems such that

\sum_{i = 1}^{N} w_{i} = 1

and

w_{i} \geq 0

for all i. The probability distribution over states in the pooled system X may thus be computed as

\bar{p} = \sum_{i = 1}^{N} w_{i} p_{i}

, from which the definition of pooled heterogeneity follows:

Π_{q}^{P} (P, w) = {[\sum_{j = 1}^{n} {(\sum_{i = 1}^{N} w_{i} p_{i j})}^{q}]}^{\frac{1}{1 - q}} .

(6)

One can interpret

Π_{q}^{P} (P, w)

as the effective number of states in the pooled categorical system X.

Jost [28] showed that the within-group heterogeneity, which is the effective number of unique states arising from individual component systems, can be defined as

Π_{q}^{W} (P, w) = {[\frac{\sum_{i = 1}^{N} w_{i}^{q} (\sum_{j = 1}^{n} p_{i j}^{q})}{\sum_{k = 1}^{N} w_{k}^{q}}]}^{\frac{1}{1 - q}},

(7)

For example, in the case where all subsystems have disjoint event spaces with heterogeneity equal to constant

ν

, then they each contribute

ν

unique states to the pooled system X.

Deriving the between-group heterogeneity

Π_{q}^{B} (P, w)

, is thus straightforward. If the effective total number of states in the pooled system is

Π_{q}^{P} (P, w)

, and the effective number of unique states contributed by distinct subsystems is

Π_{q}^{W} (P, w)

, then

Π_{q}^{B} (P, w) = \frac{Π_{q}^{P} (P, w)}{Π_{q}^{W} (P, w)}

(8)

is the effective number of completely distinct subsystems in the pooled system X. A word of caution is warranted. If we require that within-group heterogeneity is a lower bound on pooled heterogeneity [42], then (Jost [28], see Proofs 2 and 3) showed that Equation (8) will hold (A) at any value of q when weights are equal (i.e.,

w_{i} = 1 / N

for all

i \in {1, 2, \dots, N}

), or (B) only at

q = 0

and

q = 1

if weights are unequal.

2.1.3. Limitations of Categorical Rényi Heterogeneity

The chief limitation of Rényi heterogeneity (Equation (3)) is its assumption that all states in a system X (with event space

X = {1, 2, \dots, n}

and probability distribution

p = {(p_{i})}_{i = 1, 2, \dots, n}

) are categorical. More formally, the dissimilarity between a pair of observations

(x, y) \in X

from this system is defined by the discrete metric

d^{*} (x, y) = 1 - δ_{x y},

(9)

where

δ_{x y}

is Kronecker’s delta, which takes a value of 1 if

x = y

and 0 otherwise. Since the discrete metric assumption is an idealization, we have continued to use the asterisk to qualify an arbitrary distance function

d (\cdot, \cdot)

as categorical in nature. The resulting expected pairwise distance matrix between states in X is

D^{*} = {[d^{*} (i, j)]}_{i = 1, 2, \dots, n}^{j = 1, 2, \dots, n} = 1 1^{⊤} - I,

(10)

where

1 = {(1)}_{i = 1, 2, \dots, n}

is a column vector of ones, and

I = {(δ_{i j})}_{i = 1, 2, \dots, n}^{j = 1, 2, \dots, n}

is the

n \times n

identity matrix.

Clearly, many systems of interest in the real world are not categorical. For example, although we may label a sample of organisms according to their respective species, there may be differences between these taxonomic classes that are relevant to the functioning of the ecosystem as a whole [5]. It is also possible that no valid and reliable set of categorical labels is known a priori for a system whose event space is naturally non-categorical.

2.2. Non-Categorical Heterogeneity Indices

Consider a system X with probability distribution

p = {(p_{i})}_{i = 1, 2, \dots, n}

defined over event space

X = {1, 2, \dots, n}

and equipped with dissimilarity function

d_{X} (\cdot, \cdot)

. We assume that

d_{X}

is more general than the discrete metric (Equation (9)), and further still need not be a true (metric) distance. For such systems, there are three heterogeneity indices whose units are numbers equivalent, and respect the replication principle [6,8,10,11,21]. Much like our derivation of the Rényi heterogeneity in Section 2.1, these indices quantify the heterogeneity of a non-categorical system as the number of states in an idealized reference system, but differ primarily in how the idealized reference is defined. We begin with a discussion of the Numbers-Equivalent Quadratic Entropy (Section 2.2.1), followed by the Functional Hill Numbers (Section 2.2.2) and the Leinster–Cobbold index [10] (Section 2.2.3).

2.2.1. Numbers Equivalent Quadratic Entropy

Rao [43] introduced the diversity index commonly known as Rao’s quadratic entropy (RQE),

Q_{1} (D, p) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} D_{i j} p_{i} p_{j}

(11)

where

D

is an

n \times n

matrix where

D_{i j} = d_{X} (i, j)

for states

(i, j) \in X

.

Ricotta and Szeidl [21] assume that

D_{i j} = 1

means that states i and j are maximally dissimilar (i.e., categorically different), and that

D_{i j} = 0

means

i = j

, which occurs when

X

is a categorical system. An arbitrary dissimilarity matrix

D

can be rescaled to respect this assumption by applying the following transformation:

\tilde{D} = \frac{D - {min}_{i j} D_{i j}}{{max}_{i j} D_{i j} - {min}_{i j} D_{i j}} .

(12)

Under this transformation, Ricotta and Szeidl [21] search for an idealized categorical reference system

X_{*}

with event space

X_{*} = {1, 2, \dots, n_{*}}

, probability distribution

p_{*} = {(n_{*}^{- 1})}_{i = 1, 2, \dots, n_{*}}

, and RQE equal to that of X. For a column vector of ones,

1 = {(1)}_{i = 1, 2, \dots, n_{*}}

, and the identity matrix

I = {(δ_{i j})}_{i = 1, 2, \dots, n_{*}}^{j = 1, 2, \dots, n_{*}}

, this is

Q_{1} (\tilde{D}, p) = Q_{1} (1 1^{⊤} - I, p_{*}) .

(13)

Expanding the right-hand side, we have

Q_{1} (\tilde{D}, p) = \sum_{i = 1}^{n_{*}} \sum_{j = 1}^{n_{*}} n_{*}^{- 2} (1 - δ_{i j}) = 1 - \frac{1}{n_{*}} .

(14)

Recalling that

Π_{q} (p_{*}) = n_{*}

and substituting into Equation (14) yields

Π_{q} (p_{*}) = {[1 - Q_{1} (\tilde{D}, p)]}^{- 1},

(15)

which establishes the units of

{[1 - Q_{1} (\tilde{D}, p)]}^{- 1}

as numbers equivalent.

For consistency, we require that

Π_{q} (p_{*}) = Π_{q} (p)

if

\tilde{D}

were categorical. This only holds at

q = 2

:

{[1 - Q_{1} (\tilde{D}, p)]}^{- 1} = {[1 - \sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i} p_{j} (1 - δ_{i j})]}^{- 1} = {(\sum_{i = 1}^{n} p_{i}^{2})}^{- 1} = Π_{2} (p_{*}) .

(16)

Based on this result, Ricotta and Szeidl [21] define the numbers equivalent quadratic entropy

{\hat{Q}}_{e}

as

{\hat{Q}}_{e} (\tilde{D}, p) = {(1 - Q_{1} (\tilde{D}, p))}^{- 1} .

(17)

This can be interpreted as the inverse Simpson concentration of an idealized categorical reference system whose average pairwise distance between states is equal to

Q_{1} (\tilde{D}, p)

.

2.2.2. Functional Hill Numbers

Chiu and Chao [8] derived the Functional Hill Numbers, denoted

F_{q}

, based on a similar procedure to that of Ricotta and Szeidl [21]. However, whereas

{\hat{Q}}_{e}

uses a purely categorical system as the idealized reference,

F_{q}

requires only that

Q_{1} (D, p) = \sum_{i = 1}^{n_{*}} \sum_{j = 1}^{n_{*}} Q_{1} (D, p) p_{* i} p_{* j} = \sum_{i = 1}^{n_{*}} \sum_{j = 1}^{n_{*}} Q_{1} (D, p) n_{*}^{- 2},

(18)

which means that the idealized reference system is one for which the between-state distance matrix is set to

Q_{1} (D, p)

everywhere (or to 0 along the leading diagonal and

Q_{1} (D, p) n_{*} / (n_{*} - 1)

on the off diagonals).

Chiu and Chao [8] generalized Rao’s quadratic entropy to include the elasticity parameter

q \geq 0

Q_{q} (D, p) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} D_{i j} {(p_{i} p_{j})}^{q},

(19)

and sought to find

n_{*}

for the idealized reference system satisfying Equation (18) and the following:

Q_{q} (D, p) = \sum_{i = 1}^{n_{*}} \sum_{j = 1}^{n_{*}} Q_{1} (D, p) {(\frac{1}{n_{*}} \frac{1}{n_{*}})}^{q} .

(20)

Solving Equation (20) for

n_{*}

yields the functional Hill numbers of order q:

F_{q} (D, p) = {(\frac{Q_{q} (D, p)}{Q_{1} (D, p)})}^{\frac{1}{2 (1 - q)}} = n_{*},

(21)

which is the effective number of states in an idealized categorical reference system whose distance function is scaled by a factor of

Q_{1} (D, p) n_{*} / (n_{*} - 1)

.

2.2.3. Leinster–Cobbold Index

The index derived by Leinster and Cobbold [10], denoted

L_{q}

, is distinct from

{\hat{Q}}_{e}

and

F_{q}

in two ways. First, for a given system X, the

L_{q}

is not derived based on finding an idealized reference system

X_{*}

whose average between-state dissimilarity is equal to that of X. Second, it does not use a dissimilarity matrix; rather, it uses a measure of similarity or affinity.

The Leinster–Cobbold index may be derived by simple extension of Equation (3). Assuming X has state space

X = {1, 2, \dots, n}

with probability distribution

p = {(p_{i})}_{i = 1, 2, \dots, n}

, we note that

Π_{q} (p) = {(\sum_{i = 1}^{n} p_{i}^{q})}^{\frac{1}{1 - q}} = {[\sum_{i = 1}^{n} p_{i} {(I p)}_{i}^{q - 1}]}^{\frac{1}{1 - q}} .

(22)

Here,

I

is the

n \times n

identity matrix representing the pairwise similarities between states in X. The Leinster–Cobbold index generalizes

I

to be any

n \times n

similarity matrix

S

, yielding the following formula:

L_{q} (S, p) = {[\sum_{i = 1}^{n} p_{i} {(\sum_{j = 1}^{n} S_{i j} p_{j})}^{q - 1}]}^{\frac{1}{1 - q}} .

(23)

The similarity matrix can be obtained from a dissimilarity matrix by the transformation

S_{i j} = e^{- u D_{i j}}

, where

u \geq 0

is a scaling factor. When

u = 0

, then

S

is 1 everywhere. Conversely, when

u \to \infty

, then

S

approaches

I

. The Leinster–Cobbold index can thus be interpreted as an effective number if the states are in an idealized reference system (i.e., one with uniform probabilities over states) whose topology is also governed by the similarity matrix

S

.

2.2.4. Limitations of Existing Non-Categorical Heterogeneity Indices

We illustrate several limitations of the

{\hat{Q}}_{e}

,

F_{q}

, and

L_{q}

indices using a simple 3-state system X with event space

X = {1, 2, 3}

over which we specify a probability distribution

p (κ) = \{\begin{matrix} {(1, 0, 0)}^{⊤} & κ = 0 \\ {(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})}^{⊤} & κ = 1 \\ {(0, 0, 1)}^{⊤} & κ = \infty \\ {(\frac{1}{1 + \sqrt{κ} + κ}, \frac{\sqrt{κ}}{1 + \sqrt{κ} + κ}, \frac{κ}{1 + \sqrt{κ} + κ})}^{⊤} & Otherwise \end{matrix}

(24)

where

0 \leq κ

is a parameter that smoothly varies the level of inequality. When

κ = 1

the distribution is perfectly even (Figure 1A). Since an undirected graph of the system is arranged in a triangle with height h and base b, we also specify the following parametric distance matrix,

D (h, b) = (\begin{matrix} 0 & b & \sqrt{\frac{b^{2}}{4} + h^{2}} \\ b & 0 & \sqrt{\frac{b^{2}}{4} + h^{2}} \\ \sqrt{\frac{b^{2}}{4} + h^{2}} & \sqrt{\frac{b^{2}}{4} + h^{2}} & 0 \end{matrix}),

(25)

which allows us to smoothly vary the level of dissimilarity between states in X. Importantly, Equation (25) allows us to generate distance matrices that are either metric (when

h < b \sqrt{3} / 2

; Definition 1) or ultrametric (when

h \geq b \sqrt{3} / 2

; Definition 2). This is illustrated in Figure 1B.

Definition 1 (Metric distance).

A function

d : X \times X \to R_{\geq 0}

on a set

X

is a metric if and only if all of the following conditions are satisfied for all

(x, y, z) \in X

:

1: Non-negativity: $d (x, y) \geq 0$
2: Identity of indiscernibles: $d (x, y) = 0 \Leftrightarrow x = y$
3: Symmetry: $d (x, y) = d (y, x)$
4: Triangle inequality: $d (x, z) \leq d (x, y) + d (y, z)$

Definition 2 (Ultrametric distance).

A function

d : X \times X \to R_{\geq 0}

on a set

X

is ultrametric if and only if, for all

(x, y, z) \in X

, criteria 1-3 for a metric are satisfied (Definition 1), in addition to the ultrametric triangle inequality:

d (x, z) \leq max \{d (x, y), d (y, z)\}

(26)

Figure 1C compares the

{\hat{Q}}_{e}, F_{q}

, and

L_{q}

indices when applied to X across variation in between-state distances (via Equation (25)) and skewness in the probability distribution over states (Equation (24)). With respect to the numbers equivalent quadratic entropy (

{\hat{Q}}_{e}

; Section 2.2.1), we note that its behavior is categorically different with respect to whether the distance matrix is ultrametric. That is

{\hat{Q}}_{e}

increases with the triangle height parameter h (Equation (25)) until it passes the ultrametric threshold, after which it decreases monotonically with h. The behavior of

{\hat{Q}}_{e}

is sensible in the ultrametric range. When the distance matrix is scaled, as in Equation (12), pulling one of the three states in X further away from the remaining two should function similarly to progressively merging the latter states. Thus, the behavior of

{\hat{Q}}_{e}

is highly sensitive to whether a given distance matrix is ultrametric (which will often not be the case in real-world applications).

With respect to

F_{q}

, a notable benefit in comparison to

{\hat{Q}}_{e}

is that

F_{q}

behaves consistently regardless of whether distance is ultrametric. However, Figure 1 shows other drawbacks. First, we can see that

F_{q}

becomes insensitive to

D (h, 1)

when

p (κ)

is perfectly even (shown analytically in Appendix A). Second,

F_{q}

can paradoxically estimate a greater number of states than the theoretical maximum allows. That this occurs when the state probability distribution is more unequal violates the principle of transfers [20,33,39,40] (Section 2.1.1). This is made more problematic since Figure 1C shows it occurs when one state is being pushed closer to the others (i.e., with smaller values of h). To summarize, the functional Hill numbers are estimating more states than are really present despite the reduction in between-state distances and greater inequality in the probability mass function.

Figure 1C shows that the Leinster-Cobbold index compares favorably to

F_{q}

because the former does not lose sensitivity to dissimilarity when

p (κ)

is perfectly even. However, Figure 1D shows that the Leinster-Cobbold index is particularly sensitive to the form of similarity transformation. In the present case, the maximal value of the

L_{q}

gradually approaches 3 as u grows (and only when

u \to \infty

does it reach 3), while progressively losing sensitivity to distance. As mentioned by Leinster and Cobbold [10], the choice of u or other similarity transformation is dependent on the importance assigned to functional differences between states. However, it is not clear how a given similarity transformation (e.g., u), and therefore the idealized reference system of

L_{q}

, should be validated.

Above all of the idiosyncratic limitations of existing numbers equivalent heterogeneity indices, we must highlight two basic assumptions they all share. First, they continue to assume that some valid and reliable categorical partitioning on X is known a priori. Second, they assume that a distance function specified a priori describes semantically relevant geometry of the system in question. These two limitations are not independent, since an unreliable categorical partitioning of the state space will lead to erroneous estimates of the pairwise distances between states. Thus, we seek an approach for measuring heterogeneity that has neither these limitations, nor those shown above to be specific to the other numbers equivalent heterogeneity indices for non-categorical systems.

3. Representational Rényi Heterogeneity

In this section, we propose an alternative approach to the indices of Section 2.2 that we call representational Rényi heterogeneity (RRH). It involves transforming X into a representation Z, defined on an unobservable or latent event space

Z

, that satisfies two criteria:

The representation Z captures the semantically relevant variation in X
Rényi heterogeneity can be directly computed on Z

Satisfaction of the first criterion can only be ascertained in a domain-specific fashion. Since Z is essentially a model of X, investigators must justify that this model is appropriate for the scientific question at hand. For example, an investigator may evaluate the ability of X to be reconstructed from representation Z under cross-validation. The second criterion simply means that the transformation of

X \to Z

must specify a probability distribution on

Z

upon which the Rényi heterogeneity can be directly computed.

Figure 2 illustrates the basic idea of RRH. However, the specifics of this framework differ based on the topology of the representation Z. Thus, the remainder of this section discusses the following approaches:

A.: Application of standard Rényi heterogeneity (Section 2.1) when Z is a categorical representation
B.: Deriving parametric forms for Rényi heterogeneity when Z is a non-categorical representation

3.1. Rényi Heterogeneity on Categorical Representations

Let X be a system defined on an observable space

X

that is non-categorical and

n_{x}

-dimensional. Consider the scenario in which the semantically relevant variation in X is categorical: for instance, images of different object categories stored in raw form as real-valued vectors. An investigator may be interested in measuring the effective number of states in X with respect to this categorical variation. This requires transforming X into a semantically relevant categorical representation Z upon which Equation (3) can be applied.

Assume we have a large random sample of N points

X = {(x_{i})}_{i = 1, 2, \dots, N}

from system X. We can conceptualize each discrete observation

x_{i}

in this sample as the single point in the event space of a perfectly homogeneous subsystem

X_{i}

. When pooled, the subsystems

{\{X_{i}\}}_{i = 1, 2, \dots, N}

constitute X. The contribution weights of each subsystem to X as a whole are denoted

w = {(w_{i})}_{i = 1, 2, \dots, N}

, where

\sum_{i = 1}^{N} w_{i} = 1

and

w_{i} \geq 0

.

We now specify a vector-valued function

f : X \to P (Z)

such that

x \mapsto f (x) = {[f_{j} (x)]}_{j = 1, 2, \dots, n_{z}}

is a mapping from

n_{x}

-dimensional coordinates on the observable space,

x \in X

, onto an

n_{z}

-dimensional discrete probability distribution over

Z = {1, 2, \dots, n_{z}}

. Thus,

f (x_{i})

can be conceptualized as mapping subsystem

X_{i}

onto its categorical representation

Z_{i}

. After defining

f

, the effective number of states in the latent representation of

X_{i}

can be computed as

Π_{q} (x_{i}) = {(\sum_{j = 1}^{n_{z}} f_{j}^{q} (x_{i}))}^{\frac{1}{1 - q}} .

(27)

When

Π_{q} (x_{i}) = 1

, then

f

assigns

x

to a single category with perfect certainty. Conversely, when

Π_{q} (x_{i}) = n_{z}

, then either

x_{i}

belongs to all categorical states with equal probability, or

f

is maximally uncertain about the mapping of point

x_{i}

.

Mapping all points

X

onto the categorical latent space yields a collection of subsystems

{\{Z_{i}\}}_{i = 1, 2, \dots, N}

, which generate Z when pooled. Using Equation (6), we can compute the effective number of total states in Z as the pooled heterogeneity:

Π_{q}^{P} (X, w) = {[\sum_{j = 1}^{n_{z}} {(\sum_{i = 1}^{N} w_{i} f_{j} (x_{i}))}^{q}]}^{\frac{1}{1 - q}},

(28)

Unfortunately,

Π_{q}^{P} (X, w)

counts some heterogeneity that is due to uncertainty in the model (i.e., that quantified by Equation (27)). We, therefore, compute the effective number of states in Z per point

x \in X

using the within-group heterogeneity formula (Equation (7)):

Π_{q}^{W} (X, w) = {[\frac{\sum_{i = 1}^{N} w_{i}^{q} (\sum_{j = 1}^{n_{z}} f_{j}^{q} (x_{i}))}{\sum_{k = 1}^{N} w_{k}^{q}}]}^{\frac{1}{1 - q}} .

(29)

Finally, the effective number of states (points) in X—with respect to the categorical variation modeled by Z—can then be computed using the between-group heterogeneity formula (Equation (8)):

Π_{q}^{B} (X, w) = \frac{Π_{q}^{P} (X, w)}{Π_{q}^{W} (X, w)} .

(30)

Example 1 demonstrates that current methods of measuring biodiversity and wealth concentration can be viewed as special cases of categorical RRH.

Example 1 (Classical measurement of biodiversity and economic equality as categorical RRH).

Definitions necessary for this example are shown in Table 2. The traditional analysis of species diversity and economic equality can be recovered from an RRH-based formulation when

f

is assumed to be deterministic and

w = {(N^{- 1})}_{i = 1, 2, \dots, N}

. In this case within-group heterogeneity can be shown to reduce to 1:

\begin{matrix} Π_{q}^{W} (X, w) & = {[\sum_{i = 1}^{N} \frac{N^{- q}}{\sum_{k = 1}^{N} N^{- q}} (\sum_{j = 1}^{n_{z}} f_{j}^{q} (x_{i}))]}^{\frac{1}{1 - q}} \\ = {[\sum_{i = 1}^{N} N^{- 1} (1)]}^{\frac{1}{1 - q}} \\ = 1 . \end{matrix}

(31)

Thus, we have

\begin{matrix} Π_{q}^{B} (X, w) & = Π_{q}^{P} (X, w) \\ = {[\sum_{j = 1}^{n_{z}} {(\sum_{i = 1}^{N} N^{- 1} f_{j} (x_{i}))}^{q}]}^{\frac{1}{1 - q}} \\ = {[\sum_{j = 1}^{n_{z}} {(\frac{N_{j}}{N})}^{q}]}^{\frac{1}{1 - q}}, \end{matrix}

(32)

which yields the categorical Rényi heterogeneity (Hill numbers for biodiversity analysis and Hannah–Kay indices in the economic setting [19,20]), and by extension many diversity indices to which it is connected (Table 1). Thus, traditional analysis of species biodiversity and economic equality are special cases of representational Rényi heterogeneity where the representation is specified by a mapping onto degenerate distributions over categorical labels. The only differences lie in the definition of observable and latent spaces, and the representational models.

In the case of biodiversity analysis, the model

f

in real-world practice may simply be a human expert assigning species labels to a sample of organisms from a field study. In the economic setting, one may speculate that

f

would essentially reduce to contracts specifying ownership of assets, whose value is deemed by market forces.

3.2. Rényi Heterogeneity on Non-Categorical Representations

In Section 3.1, we dealt with instances in which semantically relevant variation in X is categorical, such as when object categories are embedded in images stored as real-valued vectors. Here, we consider scenarios in which the semantically relevant information in an observable system X is non-categorical: for instance, where a piece of text contains information about semantic concepts best represented as real-valued “word vectors” [44,45]. Measuring the effective number of distinct states in X with respect to this continuous variation requires transforming X into a semantically relevant continuous representation Z upon which procedures analogous to those of Section 3.1 may be undertaken.

Let Z be defined on an

n_{z}

-dimensional event space

Z \subseteq R^{n_{z}}

over which there exists a family of parametric probability distributions

P (Z)

of a form chosen by the experimenter. Let

f : X \to P (Z)

be a model that performs the mapping

x \mapsto f (\cdot | x)

from a point

x \in X

on the observable space to a probability density on

Z

. For example, if

P (Z)

is the family of multivariate Gaussians, then

f (z | x_{i}) = N (z | μ_{i}, Σ_{i})

, where

μ_{i}

and

Σ_{i}

are the Gaussian mean and covariance functions at

x_{i}

, respectively. Given a sample

X = {(x_{i})}_{i = 1, 2, \dots, N}

, as in Section 3.1, we compute the continuous analogue of Equation (27) as follows

Π_{q} (x_{i}) = {(\int_{Z} f^{q} (z | x_{i}) d z)}^{\frac{1}{1 - q}} .

(33)

This formula yields the effective size of the domain of a uniform distribution on

R^{n_{z}}

whose Rényi heterogeneity is equal to

Π_{q} (x_{i})

(proof is given in Appendix A). Thus, it is possible for

Π_{q} (x_{i})

to be less than 1, though it will remain non-negative.

Similar to the procedure in Section 3.1, we now define a continuous version of the within-observation heterogeneity

Π_{q}^{W} (X, w) = {[\sum_{i = 1}^{N} \frac{w_{i}^{q}}{\sum_{j = 1}^{N} w_{j}^{q}} \int_{Z} f^{q} (z | x_{i}) d z]}^{\frac{1}{1 - q}},

(34)

which estimates the effective size of the latent space occupied per observable point

x \in X

.

In order to compute the pooled heterogeneity

Π_{q}^{P} (X, w)

, the experimenter must specify the form of the pooled distribution, here denoted

{\bar{f}}_{w}

. The conceptually most simple approach is non-parametric, using a model average,

{\bar{f}}_{w} (z | X) = \sum_{i = 1}^{N} w_{i} f (z | x_{i}),

(35)

whereby the pooled heterogeneity would be

Π_{q}^{P} (X, w) = {[\int_{Z} {(\sum_{i = 1}^{N} w_{i} f (z | x_{i}))}^{q} d z]}^{\frac{1}{1 - q}} .

(36)

The integral in Equation (36) may often be analytically intractable and potentially difficult to solve accurately in high dimensions with numerical methods. Furthermore, some areas of

Z

may be assigned low probability by

f (z | x_{i})

for all

i \in {1, 2, \dots, N}

. This is not a problem as the sample

X

becomes infinitely large. However, with finite samples, it may be the case that some representational states in

Z

are unlikely simply because we have not sampled from the corresponding regions of

X

. An alternative to Equation (35) is therefore to specify a parametric pooled distribution

{\bar{f}}_{w} (\cdot | X) = Ξ_{f} (X, w),

(37)

where

Ξ_{f}

is a deterministic function that combines

f (\cdot | x_{i})

for

i \in {1, 2, \dots, N}

into a valid probability density on

Z

. In this case, the pooled Rényi heterogeneity is simply

Π_{q}^{P} (X, w) = {(\int_{Z} {\bar{f}}_{w}^{q} (z | X) d z)}^{\frac{1}{1 - q}} .

(38)

Using either Equation (36) or (38) as the pooled heterogeneity and Equation (34) as the within-group heterogeneity, the effective number of distinct states in X—with respect to the non-categorical representation Z—can then be computed using Equation (30).

Figure 3 demonstrates the difference between the parametric and non-parametric approaches to pooling for non-categorical RRH, and Example 2 demonstrates one approach to parametric pooling for a mixture of multivariate Gaussians.

Example 2 (Parametric pooling of multivariate Gaussian distributions).

Let

X = {(x_{i})}_{i = 1, 2, \dots, N}

be a sample of

n_{x}

-dimensional vectors from a system X with event space

X \subseteq R^{n_{x}}

. Let Z be a latent representation of X with

n_{z}

-dimensional event space

Z = R^{n_{z}}

. Let

f (z | x_{i}) = N (z | μ_{i}, Σ_{i})

(39)

be a model that returns a multivariate Gaussian density with mean

μ_{i}

and covariance

Σ_{i}

given point

x_{i} \in X

. Finally, let

w = {(w_{i})}_{i = 1, 2, \dots, N}

be weights assigned to each sample in

X

such that

w_{i} \geq 0

and

\sum_{i = 1}^{N} w_{i} = 1

.

If one assumes that the pooled distribution over

Z

given the set of components

f (z | x_{1}), f (z | x_{2}), \dots, f (z | x_{N})

is itself a multivariate Gaussian,

{\bar{f}}_{w} (z | X) = N (z | μ_{*}, Σ_{*})

(40)

with

n_{z} \times 1

pooled mean,

μ_{*} = \sum_{i = 1}^{N} w_{i} μ_{i}

(41)

and

n_{z} \times n_{z}

pooled covariance matrix

Σ_{*} = - μ_{*} μ_{*}^{⊤} + \sum_{i = 1}^{N} w_{i} [Σ_{i} + μ_{i} μ_{i}^{⊤}],

(42)

then the pooled heterogeneity

Π_{q}^{P}

is therefore simply the Rényi heterogeneity of a multivariate Gaussian,

Π_{q} (Σ) = \{\begin{matrix} Undefined & q = 0 \\ {(2 π e)}^{\frac{n_{z}}{2}} \sqrt{|Σ|} & q = 1 \\ {(2 π)}^{\frac{n_{z}}{2}} \sqrt{|Σ|} & q = \infty \\ {(2 π)}^{\frac{n_{z}}{2}} q^{\frac{n_{z}}{2 (q - 1)}} \sqrt{|Σ|} & Otherwise \end{matrix}

(43)

evaluated at

Σ_{*}

. The derivation is provided in Appendix A [46]. Equation (43) at

Σ_{*}

is interpreted as the effective size of space

Z

occupied by the complete latent representation of X under model f.

The within-group heterogeneity can be obtained for the set of components

{[f (z | x_{i})]}_{i = 1, 2, \dots, N}

by solving Equation (34) for the Gaussian densities, yielding:

Π_{q}^{W} (Σ_{1 : N}, w) = \{\begin{matrix} Undefined & q = 0 \\ exp \{\frac{1}{2} (n_{z} + \sum_{i = 1}^{N} w_{i} log |2 π Σ_{i}|)\} & q = 1 \\ 0 & q = \infty \\ {(2 π)}^{\frac{n_{z}}{2}} {(\sum_{i = 1}^{N} \frac{{\bar{w}}_{i}^{q} {|Σ_{i}|}^{\frac{1}{2}}}{q^{\frac{n_{z}}{2}}})}^{\frac{1}{1 - q}} & Otherwise \end{matrix},

(44)

where we denote

Σ_{1 : N} = {\{Σ_{i}\}}_{i = 1, 2, \dots, N}

for parsimony, and

{\bar{w}}_{i} = w_{i} {(\sum_{j = 1}^{N} w_{j}^{q})}^{- 1 / q}

. Equation (44) estimates the effective size of the

n_{z}

-dimensional representational space occupied per state

x \in X

.

The effective number of states in X with respect to the continuous representation Z is thus the between-group heterogeneity

Π_{q}^{B}

which can be computed as the ratio

Π_{q} (Σ_{*}) / Π_{q}^{W} (Σ_{1 : N}, w)

. The properties of this decomposition—specifically the conditions under which

Π_{q}^{B} \geq 1

(Lande’s requirement [28,42])—are discussed further elsewhere [46].

4. Empirical Applications of Representational Rényi Heterogeneity

In this section, we demonstrate two applications of RRH under assumptions of categorical (Section 4.1) and continuous (Section 4.2) latent spaces. First, Section 4.1, uses a simple closed-form system consisting of a mixture of two beta distributions on the (0,1) interval to give exact comparisons of the behavior of RRH against that of existing non-categorical heterogeneity indices (Section 2.2). This experiment provides evidence that existing non-categorical heterogeneity indices can demonstrate counterintuitive behavior under various circumstances. Second, Section 4.2 demonstrates that RRH can yield heterogeneity measurements that are sensible and tractably computed, even for highly complex mappings

f : X \to P (Z)

. There, we use a deep neural network to compute the effective number of observations in a database of handwritten images with respect to compressed latent representations on a continuous space.

4.1. Comparison of Heterogeneity Indices Under a Mixture of Beta Distributions

Consider a system X with event space

X

on the open interval

(0, 1)

, containing an embedded, unobservable, categorical structure represented by the latent system Z with event space

Z = \{1, 2\}

. The systems’ collective behavior is governed by the joint distribution of a beta mixture model (BMM),

p (x, z) = 𝟙 [z = 1] (1 - θ_{1}) {Beta}_{θ_{2}, θ_{3}} (x) + 𝟙 [z = 2] θ_{1} {Beta}_{θ_{3}, θ_{2}} (x),

(45)

where

{Beta}_{α, β} (x)

is the probability density function for a beta distribution with shape parameters

α, β

, and

θ = (θ_{1}, θ_{2}, θ_{3})

are parameters. The indicator function

1 [\cdot]

evaluates to 1 if its argument is true, and to 0 otherwise. The prior distribution is

p (z) = 𝟙 [z = 1] (1 - θ_{1}) + 𝟙 [z = 2] θ_{1},

(46)

and marginal probability of observable data is as follows (see Figure 4 for illustrations):

p (x) = (1 - θ_{1}) {Beta}_{θ_{2}, θ_{3}} (x) + θ_{1} {Beta}_{θ_{3}, θ_{2}} (x) .

(47)

To facilitate exact comparisons between heterogeneity indices, below, let us assume we have a model

f : X \to P (Z)

that maps an observation

x \in X

onto a degenerate distribution over

Z

:

f_{θ} (z | x) = 𝟙 [z = 1] 𝟙 [x \leq τ (θ)] + 𝟙 [z = 2] 𝟙 [x > τ (θ)] .

(48)

The subscripting of

f_{θ}

denotes that the model is optimized such that the threshold

0 \leq τ (θ) \leq 1

is the solution to

p (z = 1 | x = τ (θ)) = p (z = 2 | x = τ (θ)),

(49)

which is

τ (θ) = \{\begin{matrix} {[{(θ_{1}^{- 1} - 1)}^{\frac{1}{2 (θ_{2} - θ_{3})}} {(1 - θ_{1})}^{\frac{1}{2 (θ_{2} - θ_{3})}} θ_{1}^{- \frac{1}{2 (θ_{2} - θ_{3})}} + 1]}^{- 1} & θ_{2} - θ_{3} \neq 0 \\ 0 & ((θ_{2} = θ_{3}) \land (θ_{1} > \frac{1}{2})) \\ 1 & Otherwise \end{matrix}

(50)

Under this model, the categorical RRH at point

x \in X

is

Π_{q} (x) = {(\sum_{i = 1}^{2} f_{θ}^{q} (z = i | x))}^{\frac{1}{1 - q}} = {(𝟙^{q} [x \leq τ (θ)] + 𝟙^{q} [x > τ (θ)])}^{\frac{1}{1 - q}} = 1 .

(51)

The expected value of

f_{θ} (z = 2 | x)

with respect to the data generating distribution (Equation (47)) is

\begin{matrix} {\bar{f}}_{θ} (z = 2) & = E_{x \sim p (x)} [f_{θ} (z = 2 | x)] \\ = \int_{0}^{1} p (x) 𝟙 [x > τ (θ)] d x \\ = \int_{τ (θ)}^{1} p (x) d x \\ = (1 - θ_{1}) I_{x}^{1} (θ_{2}, θ_{3}) + θ_{1} I_{x}^{1} (θ_{3}, θ_{2}), \end{matrix}

(52)

where

I_{x_{0}}^{x_{1}} (a, b)

is the generalized regularized incomplete beta function (BetaRegularized[ $x_{0}, x_{1}, a, b$ ] command in the Wolfram language and betainc( $a, b, x_{0}, x_{1}$ ,regularized=True) in Python’s mpmath package). Equation (52) implies that

{\bar{f}}_{θ} (z = 1) = 1 - {\bar{f}}_{θ} (z = 2)

. The pooled heterogeneity is thus expressed as a function of

θ

as follows:

Π_{q}^{P} (θ) = \{\begin{matrix} \sum_{i = 1}^{2} 𝟙 [{\bar{f}}_{θ} (z = i) > 0] & q = 0 \\ exp \{- \sum_{i = 1}^{2} {\bar{f}}_{θ} (z = i) log {\bar{f}}_{θ} (z = i)\} & q = 1 \\ {(max_{i} {\bar{f}}_{θ} (z = i))}^{- 1} & q = \infty \\ {(\sum_{i = 1}^{2} {\bar{f}}_{θ}^{q} (z = i))}^{\frac{1}{1 - q}} & Otherwise \end{matrix} .

(53)

As a function of

θ

, the within-group heterogeneity is

\begin{matrix} Π_{q}^{W} (θ) & = {[\int_{0}^{1} \frac{p^{q} (x)}{\int_{0}^{1} p^{q} (u) d u} {(\sum_{i = 1}^{2} f_{θ} (z = i | x))}^{q} d x]}^{\frac{1}{1 - q}} \\ = {[\int_{0}^{1} \frac{p^{q} (x)}{\int_{0}^{1} p^{q} (u) d u} (1) d x]}^{\frac{1}{1 - q}} \\ = 1, \end{matrix}

(54)

and therefore the between-group heterogeneity is

Π_{q}^{B} (θ) = Π_{q}^{P} (θ)

.

Analytic expressions for the existing non-categorical heterogeneity indices

{\hat{Q}}_{e}

(Equation (17)),

F_{q}

(Equation (21)), and

L_{q}

(Equation (23)) were computed as “best-case” scenarios, as follows. First, the probability distributions over states for all expressions was the true prior distribution (Equation (46)). Distance matrices—and by extension, the similarity matrix for

L_{q}

—were computed using the closed-form expectation of the absolute distance between two beta-distributed random variables (see Appendix B and the Supplementary Materials).

Figure 5 compares the categorical RRH against

{\hat{Q}}_{e}

,

F_{q}

, and

L_{q}

for BMM distributions of varying degrees of separation, and across different mixture component weights (

0.5 \leq θ_{1} < 1

). Without significant loss of generality, we show only those comparisons at

q = 1

(which excludes the numbers equivalent quadratic entropy), and

q = 2

.

The most salient differences between these indices occur when the BMM mixture components completely overlap (i.e., at

θ_{2} = θ_{3}

). The RRH correctly identifies that there is effectively only one component, regardless of mixture weights. Only the Leinster–Cobbold index showed invariance to the mixture weights when

θ_{2} = θ_{3}

, but it could not correctly identify that data were effectively unimodal.

The other stark difference arose when the mixture components were furthest apart (here when

θ_{2} = 5

and

θ_{3} = 20

). At this setting, the functional Hill numbers showed a paradoxical increase in the heterogeneity estimate as the prior distribution on components was skewed. The Leinster–Cobbold index was appropriately concave throughout the range of prior weights, but it never reached a value of 2 at its peak (as expected based on the predictions outlined in Section 2.2.3). Conversely, the RRH was always concave and reached a peak of 2 when both mixture components were equally probable.

4.2. Representational Rényi Heterogeneity is Scalable to Deep Learning Models

In this example, the observable system X is that of images of handwritten digits defined on an event space

X = {[0, 1]}^{784}

of dimension

n_{x} = 784

(the black and white images are flattened from

28 \times 28

pixel matrices into 784-dimensional vectors). Our sample

X = {(x_{i j})}_{i = 1, 2, \dots, N}^{j = 1, 2, \dots, 784}

from this space is the familiar MNIST training dataset [22] (Figure 6), which consists of

N = 60, 000

images roughly evenly distributed across digits

{0, 1, \dots, 9}

, and where approximately 10% of all images come from each class. We assume each image carries equal importance, given by a weight vector

w = {(N^{- 1})}_{i = 1, 2, \dots, N}

. We are interested in measuring the heterogeneity of X with respect to a continuous latent representation Z defined on event space

Z = R^{2}

. In the present example, this space is simply the continuous 2-dimensional compression of an image that best facilitates its reconstruction. We choose a dimensionality of

n_{z} = 2

for the latent space in order to facilitate a pedagogically useful visualization of the latent feature representation, below. Unlike Section 4.1, in the present case we have no explicit representation of the true marginal distribution over the data,

p (x)

.

Having defined the observable and latent spaces, measuring RRH now requires defining a model

f : X \to P (Z)

that maps a (flattened) image vector

x_{i} \in X

onto a probability distribution over the latent space. Our chosen model is the encoder module of a pre-trained convolutional variational autoencoder (cVAE) provided by the (https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb, Smart Geometry Processing Group at University College London) (Figure 7) [23,24]:

f_{ϕ} (z | x_{i}) = N (z | m (x_{i}), C (x_{i}))

(55)

where

ϕ

are the encoder’s parameters, which specify a convolutional neural network (CNN) whose output layer returns a

2 \times 1

mean vector

m (x_{i})

and a

2 \times 1

log-variance vector

s (x_{i})

given

x_{i}

. For simplicity, we denote the latter as the

2 \times 2

diagonal covariance matrix

C (x_{i}) = {(e^{s_{j} (x_{i})} δ_{j k})}_{j = 1, 2}^{k = 1, 2}

. Further details of the cVAE and its training can be found in Kingma and Welling [23,24], although the specific implementation in this paper was a pre-trained implementation by the (https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb, Smart Geometry Processing Group at University College London). Briefly, the cVAE learns to generate a compressed latent representation (via encoder

f_{ϕ}

, which is an approximate posterior distribution) that contains enough information about the input

x_{i}

to facilitate its reconstruction by a “decoder” module. The objective function is a lower bound on the model evidence

p (x)

, which if maximized is equivalent to minimizing the Kullback–Leibler divergence between the approximate and true (but unknown) posteriors

f_{ϕ}

and

p (z | x)

, respectively.

The continuous RRH under the model in Equation (55) for a single example

x_{i} \in X

can be computed by merely evaluating the Rényi heterogeneity of a multivariate Gaussian (Equation (43) in Example 2) for the covariance matrix given by

C (x_{i})

. This is interpreted as the effective area of the 2-dimensional latent space consumed by representation of

x_{i}

.

Since the handwritten digit images belong to groups of “Zeros, Ones, Twos, …, Nines,” this section will call the quantity

Π_{q}^{W}

the within-observation heterogeneity (rather than the “within-group” heterogeneity) in order to avoid its interpretation as measuring the heterogeneity of a group of digits. Rather, it is interpreted as the effective area of latent space consumed by representation of a single observation

x \in X

on average. It is computed by evaluation of Equation (44) at

C (X) = {\{C (x_{i})\}}_{i = 1, 2, \dots, N}

, given uniform weights on samples.

Finally, to compute the pooled heterogeneity

Π_{q}^{P}

, we use the parametric pooling approach detailed in Example 2, wherein the pooled distribution is a multivariate Gaussian with mean and covariance given by Equations (41) and (42), respectively. The pooled heterogeneity is then merely Equation (43) evaluated at

C_{*} (X)

, and represents the total amount of area in the latent space consumed by the representation of X under

f_{ϕ}

. The effective number of observations in X with respect to the continuous latent representation Z is, therefore, given by the between-observation heterogeneity:

Π_{q}^{B} (C (X), w) = \frac{Π_{q}^{P} (C_{*} (X))}{Π_{q}^{W} (C (X), w)} .

(56)

Equation (56) gives the effective number of observations in X because it uses the entire sample

X

(of course, assuming

X

provides adequate coverage of the observable event space). However, one could compute the effective number of observations in a subset of

X

, if necessary. Let

X^{(j)} = {(x_{k})}_{k = 1, 2, \dots, N_{j}}

be the subset of

N_{j}

points in

X

found in the observable subspace

X_{j} \subset X

(such as the subspace of MNIST digits corresponding to a given digit class). Given corresponding weights

w^{(j)} = {(N_{j}^{- 1})}_{k = 1, 2, \dots, N_{j}}

, Equation (56) is then simply

Π_{q}^{B} (C (X^{(j)}), w^{(j)}) = \frac{Π_{q}^{P} (C_{*} (X^{(j)}))}{Π_{q}^{W} (C (X), w^{(j)})} .

(57)

Figure 8 shows the effective number of observations in the subsets of MNIST images belonging to each image class, under the continuous representation learned by the cVAE. One can appreciate that the MNIST class of “Ones” (in the training set) has the smallest effective number of observations. Subjective visual inspection of the MNIST samples in Figure 6 may suggest that the Ones are indeed relatively more homogeneous as a group than the other digits (this claim is given further objective support in Appendix C based on deep similarity metric learning [47,48]).

Figure 9 demonstrates the correspondence of between-observation heterogeneity (i.e., the effective number of observations) and the visual diversity of different samples from the latent space of our cVAE model. For each image in the MNIST training dataset, we computed the effective location of its latent representation:

m (x_{i})

for

i \in {1, 2, \dots, N}

. For each of these image representations, we defined a “neighborhood” including the 49 other images whose latent coordinates were closest in Euclidean distance (which is sensible on the latent space given the Gaussian prior). For all such neighbourhoods defined, we then reconstructed the corresponding images on

X

, whose between-observation heterogeneity was then computed using Equation (57). Figure 9b shows the estimated effective number of observations for the latent neighborhoods with the greatest and least heterogeneity. One can appreciate that neighborhoods with

Π_{q}^{B}

close to 1 include images with considerably less diversity than neighborhoods with

Π_{q}^{B}

closer to the upper limit of 49. These data suggest that the between-observation heterogeneity—which is the effective number of observations in X with respect to the latent features learned by a cVAE—can indeed correspond to visually appreciable sample diversity.

5. Discussion

This paper introduced representational Rényi heterogeneity, a measurement approach that satisfies the replication principle [1,38,41] and is decomposable [28] while requiring neither a priori (A) categorical partitioning nor (B) specification of a distance function on the input space. Rather, the experimenter is free to define a model that maps observable data onto a semantically relevant domain upon which Rényi heterogeneity may be tractably computed, and where a distance function need not be explicitly manipulated. These properties facilitate heterogeneity measurements for several new applications. Compared to state-of-the-art comparator indices under a beta mixture distribution, RRH more reliably quantified the number of unique mixture components (Section 4.1), and under a deep generative model of image data, RRH was able to measure the effective number of distinct images with respect to latent continuous representations (Section 4.2). In this section, we further synthesize our conclusions, discuss their implications, and highlight open questions for future research.

The main problem we set out to address was that all state of the art numbers equivalent heterogeneity measures (Section 2.2) require a priori specification of a distance function and categorical partitioning on the observable space. To this end, we showed that RRH does not require categorical partitioning of the input space (Section 3). Although our analysis under the two-component BMM assumed that the number of components was known, RRH was the only index able to accurately identify an effectively singular cluster (i.e., where mixture components overlapped; Figure 5). We also showed that the categorical RRH did not violate the principle of transfers [39,40] (i.e., it was strictly concave with respect to mixture component weights), unlike the functional Hill numbers (Figure 5). Future studies should extend this evaluation to mixtures of other distributional forms in order to better characterize the generalizability of our conclusions.

Section 3.1 and Section 3.2 both showed that RRH does not require specification of a distance function on the observable space. Instead, one must specify a model that maps the observable space onto a probability distribution over the latent representation. This is beneficial since input space distances are often irrelevant or misleading. For example, latent representations of image data learned by a convolutional neural network will be robust to translations of the inputs since convolution is translation invariant. However, pairwise distances on the observable space will be exquisitely sensitive to semantically irrelevant translations of input data. Furthermore, semantically relevant information must often be learned from raw data using hierarchical abstraction. Ultimately, when (A) pre-defined distance metrics are sensitive to noisy perturbations of the input space, or (B) the relevant semantic content of some input data is best captured by a latent abstraction, the RRH measure will be particularly useful.

The requirement of specifying a representational model

f : X \to P (Z)

implies the additional problem of model selection. In Section 3, we noted that the determination of whether a model is appropriate must be made in a domain-specific fashion. For instance, the method by which ecologists assign species labels prior to measurement of species diversity implies the use of a mapping from the observable space of organisms to a degenerate distribution over species labels (Example 1). In Section 4.2, we used the encoder module of a cVAE (a generative model based on a convolutional neural network architecture [23,24]) to represent images as 2-dimensional real-valued vectors in order to demonstrate our ability to capture variation in digits’ written forms (see Figure 7B and Figure 9). Someone concerned with measuring heterogeneity of image batches in terms of the digit-class distribution could choose a categorical latent representation corresponding to the digit classes (this would return the effective number of digit classes per sample). Regardless, the model used to map between observations and the latent space should be validated using either explanatory power (e.g., maximization of a lower bound on the model evidence), generalizability (e.g., out of sample predictive power), or another approach that is justifiable within the investigator’s scientific domain of interest.

In addition to the results of empirical applications of RRH in Section 4, we were also able to show that RRH generalizes the process by which species diversity and indices of economic equality are computed (Example 1). In doing so, we are able to clarify some of the assumptions inherent in those indices. Specifically, that assignment of species or ownership labels (in ecological and economic settings, respectively) corresponds to mapping from an observable space, such as the space of organisms’ identifiable features or the space of economic resources, onto a degenerate distribution over the categorical labels (Table 2). It is possible that altering the form of that mapping may yield new insights about ecological and economic diversity.

In conclusion, we have introduced an approach for measuring heterogeneity that requires neither (A) categorical partitioning nor (B) distance measure on the observable space. Our RRH method enables measurement of heterogeneity in disciplines where categorical entities are unreliably defined, or where relevant semantic content of some data is best captured by a hierarchical abstraction. Furthermore, our approach includes many existing heterogeneity indices as special cases, while facilitating clarification of many of their assumptions. Future work should evaluate the RRH in practice and under a broader array of models.

Supplementary Materials

The following are available online at https://www.mdpi.com/1099-4300/22/4/417/s1, Supplementary materials include code for Section 2, Section 3 and Section 4 and Appendix B (RRH_Supplement_3State_BMM_CVAE.ipynb), and Appendix C (RRH_Supplement_Siamese.ipynb).

Author Contributions

Conceptualization, A.N.; methodology, A.N.; validation, A.N.; formal analysis, A.N.; investigation, A.N.; resources, T.T.; data curation, A.N.; writing—original draft preparation, A.N.; writing—review and editing, A.N., M.A., T.B., T.T.; visualization, A.N.; supervision, M.A., T.B., T.T.; project administration, A.N.; funding acquisition, A.N., M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Genome Canada (A.N., M.A.), the Nova Scotia Health Research Foundation (A.N.), the Killam Trusts (A.N.), and the Ruth Wagner Memorial Fund (A.N.).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Mathematical Appendix

Proposition A1.

Rényi heterogeneity (Equation (3)) obeys the replication principle.

Proof.

The Rényi heterogeneity for a single distribution

p_{i} = {(p_{i j})}_{j = 1, 2, \dots, n_{i}}

, where

n_{i} \in N_{+}

is the size of the state space in system i, is

Π_{q} (p_{i}) = {(\sum_{j = 1}^{n_{i}} p_{i j}^{q})}^{\frac{1}{1 - q}}

(A1)

and for the aggregation of N subsystems is

Π_{q} ({\bar{p}}_{i}) = {(\sum_{i = 1}^{N} \sum_{j = 1}^{n_{i}} {(\frac{p_{i j}}{N})}^{q})}^{\frac{1}{1 - q}} .

(A2)

The replication principle asserts that

Π_{q} (\bar{p}) = N Π_{q} (p_{i}) .

(A3)

Let

λ_{i} = \sum_{j = 1}^{n_{i}} p_{i j}^{q}

and recall that

λ_{i} = λ_{k}

for all

(i, k) \in {1, 2, \dots, N}

. Then,

\begin{matrix} {(N^{- q} \sum_{i = 1}^{N} \sum_{j = 1}^{n_{i}} p_{i j}^{q})}^{\frac{1}{1 - q}} & = N {(\sum_{j = 1}^{n_{i}} p_{i j}^{q})}^{\frac{1}{1 - q}} \\ {(N^{- q} \sum_{i = 1}^{N} λ_{i})}^{\frac{1}{1 - q}} & = N λ_{i}^{\frac{1}{1 - q}} \\ {(N^{1 - q} λ_{i})}^{\frac{1}{1 - q}} & = N λ_{i}^{\frac{1}{1 - q}} \\ N λ_{i}^{\frac{1}{1 - q}} & = N λ_{i}^{\frac{1}{1 - q}} . \end{matrix}

(A4)

Since

{lim}_{q \to 1} λ_{i}^{\frac{1}{1 - q}}

exists (it is the perplexity index), the result also holds at

q = 1

. □

Proposition A2.

For a system X with probability mass function represented by the vector

p = {(p_{i})}_{i = 1, 2, \dots, n}

on event space

X = {1, 2, \dots, n}

, with distance function

d_{X} : X \times X \to R_{\geq 0}

represented by the

n \times n

matrix

D = {[d_{X} (i, j)]}_{i = 1, 2, \dots, n}^{j = 1, 2, \dots, n}

, the functional Hill numbers family of indices

F_{q} (D, p) = {(\frac{Q_{q} (D, p)}{Q_{1} (D, p)})}^{\frac{1}{2 (1 - q)}}

(A5)

is insensitive to

d_{X} (i, j)

for all

(i, j) \in X

when

p

is uniform.

Proof.

The proof is direct given substitution of

p = {(n^{- 1})}_{i = 1, 2, \dots, n}

into Equation (A5).

F_{q} (D, p) = {(\frac{Q_{q} (D, p)}{Q_{1} (D, p)})}^{\frac{1}{2 (1 - q)}} = {(\frac{n^{- 2 q} \sum_{i = 1}^{n} \sum_{j = 1}^{n} d_{X} (i, j)}{n^{- 2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} d_{X} (i, j)})}^{\frac{1}{2 (1 - q)}} = n

(A6)

□

Proposition A3 (Rényi Heterogeneity of a Continuous System).

The Rényi heterogeneity of a system X with event space

X \subseteq R^{n}

and pdf

f \in P (X)

is equal to the magnitude of the volume of an n-cube over which there is a uniform probability density with the same Rényi heterogeneity as that given by f.

Proof.

Let the basic integral of X be defined as

\int_{X} f^{q} (x) d x

. Furthermore, let

X_{*}

be an idealized reference system with a uniform probability density

f_{*}

on

X

with lower bounds

0 = {(0)}_{i = 1, \dots, n}

and upper bounds

u = {(u_{*})}_{i = 1, \dots, n}

where

u_{*} \geq 0

is the side length of an n-cube. We assume that

X_{*}

has basic integral

\int_{X} f_{*}^{q} (x) d x

such that

\begin{matrix} \int_{X} f^{q} (x) d x & = \int_{X} f_{*}^{q} (x) d x \\ = \prod_{i = 1}^{n} u_{*}^{1 - q} \\ = u_{*}^{n (1 - q)} . \end{matrix}

(A7)

Solving Equation (A7) for

u_{*}^{n}

gives the Rényi heterogeneity of order q. At

q \neq 1

,

u_{*}^{n} = {(\int_{X} f^{q} (x) d x)}^{\frac{1}{1 - q}}

(A8)

and in the limit of

q \to 1

, Equation (A8) becomes the exponential of the Shannon (differential) entropy. Thus,

Π_{q}

is interpreted as the volume of an n-cube of side length

u_{*}

, over which there is a uniform distribution giving the same heterogeneity as X. □

Proposition A4 (Rényi heterogeneity of a multivariate Gaussian).

The Rényi heterogeneity of an n-dimensional multivariate Gaussian with probability density function (pdf)

f (x | μ, Σ) = {(2 π)}^{- \frac{n}{2}} {|Σ|}^{- \frac{1}{2}} e^{- \frac{1}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ)},

(A9)

with mean

μ = {(μ_{i})}_{i = 1, 2, \dots, n}

and covariance matrix

Σ = {(Σ_{i j})}_{i = 1, 2, \dots, n}^{j = 1, 2, \dots, n}

is

Π_{q} (Σ) = \{\begin{matrix} Undefined & q = 0 \\ {(2 π e)}^{\frac{n}{2}} \sqrt{|Σ|} & q = 1 \\ {(2 π)}^{\frac{n}{2}} \sqrt{|Σ|} & q = \infty \\ {(2 π)}^{\frac{n}{2}} q^{\frac{n}{2 (q - 1)}} \sqrt{|Σ|} & Otherwise \end{matrix} .

(A10)

Proof.

Let

Σ^{- 1} = U Λ U^{- 1}

be the eigendecomposition of the inverse covariance matrix into an orthonormal matrix of eigenvectors

U

and

n \times n

diagonal matrix

Λ

with eigenvalues

{(λ_{i})}_{i = 1, 2, \dots, n}

down the leading diagonal. Furthermore, let

\frac{d x_{i}}{d y_{j}} = U_{i j}

and use the substitution

y = U^{- 1} (x - μ)

to proceed as follows:

\begin{matrix} Π_{q} (Σ) & = {[{(2 π)}^{- \frac{q n}{2}} {|Σ|}^{- \frac{q}{2}} \int e^{- \frac{q}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ)} d x]}^{\frac{1}{1 - q}} \\ = {({(2 π)}^{- \frac{q n}{2}} {|Σ|}^{- \frac{q}{2}} \int e^{- \frac{q}{2} y^{⊤} Λ y} d y)}^{\frac{1}{1 - q}} \\ = {({(2 π)}^{- \frac{q n}{2}} {|Σ|}^{- \frac{q}{2}} {(\frac{{(2 π)}^{n}}{q^{n} \prod_{i = 1}^{n} λ_{i}})}^{\frac{1}{2}})}^{\frac{1}{1 - q}} \\ = {({(2 π)}^{- \frac{q n}{2}} {|Σ|}^{- \frac{q}{2}} {(\frac{{(2 π)}^{n}}{q^{n} |Λ|})}^{\frac{1}{2}})}^{\frac{1}{1 - q}} \\ = q^{\frac{n}{2 (q - 1)}} {(2 π)}^{\frac{n}{2}} \sqrt{|Σ|} \end{matrix}

(A11)

which holds only at

q \notin {0, 1, \infty}

. At

q = 1

, we have

\begin{matrix} lim_{q \to 1} log Π_{q} (Σ) & = lim_{q \to 1} (\frac{n}{2 (q - 1)} log q) + \frac{n}{2} log (2 π) + \frac{1}{2} log |Σ| \\ = \frac{n}{2} + \frac{n}{2} log (2 π) + \frac{1}{2} log |Σ|, \end{matrix}

(A12)

and therefore,

Π_{1} (Σ) = {(2 π e)}^{\frac{n}{2}} \sqrt{| Σ |} .

(A13)

One can then easily show that

Π_{0} (Σ)

is undefined and that as

q \to \infty

,

Π_{\infty} (Σ) = {(2 π)}^{\frac{n}{2}} \sqrt{| Σ |} .

(A14)

□

Appendix B. Expected Distance Between two Beta-Distributed Random Variables

To compute the numbers equivalent RQE

{\hat{Q}}_{e}

, the functional Hill numbers

F_{q}

, and the Leinster-Cobbold index

L_{q}

under the beta mixture model, we must derive an analytical expression for the distance matrix. This involves the following integral:

d (x, y) = \int_{0}^{1} \int_{0}^{1} | x - y | f (x) g (y) d x d y,

(A15)

where

f (x) = {Beta}_{α_{1}, β_{1}} (x)

and

g (y) = {Beta}_{α_{2}, β_{2}} (y)

. By exploiting the identity

| x - y | = x + y - 2 min {x, y},

(A16)

and expanding, the integral is greatly simplified and gives the following closed-form solution:

d (x, y) = 〈 x 〉 - 〈 y 〉 + η (Φ_{a} - α_{1} Φ_{b}),

(A17)

where

η = \frac{2 Γ (α_{1}) Γ (β_{2}) Γ (α_{1} + α_{2} + 1)}{B (α_{1}, β_{1}) B (α_{2}, β_{2})},

(A18)

and where

〈 y 〉 = \frac{α_{2}}{α_{2} + β_{2}}

,

〈 x 〉 = \frac{α_{1}}{α_{1} + β_{1}}

, and the

Φ

’s are regularized hypergeometric functions:

Φ_{a} = {}_{3}{\tilde{F}}_{2} [\begin{matrix} α_{1}, α_{1} + α_{2} + 1, 1 - β_{1} \\ α_{1} + 1, α_{1} + α_{2} + β_{2} + 1 \end{matrix}, 1]

(A19)

Φ_{b} = {}_{3}{\tilde{F}}_{2} [\begin{matrix} α_{1} + 1, α_{1} + α_{2} + 1, 1 - β_{1} \\ α_{1} + 2, α_{1} + α_{2} + β_{2} + 1 \end{matrix}, 1]

(A20)

Figure A1. Numerical verification of the analytical expression for the expected absolute distance between two Beta-distributed random variables. Solid lines are the theoretical predictions. Ribbons show the bounds between 25th–75th percentiles (the interquartile range, IQR) of the simulated values.

Figure A1 provides numerical verification of this result. One simply uses Equation (A17) to compute the analytic distance matrix

D (α_{1}, β_{1}, α_{2}, β_{2}) = (\begin{matrix} d (x, x) & d (x, y) \\ d (y, x) & d (y, y) \end{matrix}),

(A21)

which, with the component probabilities (Equation (46)), can be used to compute

{\hat{Q}}_{e}, F_{q}

, and

L_{q}

using the formulas shown in the main body.

Appendix C. Evidence Supporting Relative Homogeneity of MNIST “Ones”

In our evaluation of non-categorical RRH using the MNIST data, we asserted that the class of handwritten Ones were relatively more homogeneous than other digits. Our initial statement was based simply on visual inspection of samples from the dataset, wherein the Ones ostensibly demonstrate fewer relevant feature variations than other classes. However, to test this hypothesis more objectively, we conducted an empirical evaluation using similarity metric learning.

We implemented a deep neural network architecture known as a “siamese network” [47] to learn a latent distance metric on the MNIST classes. Our siamese network architecture is depicted in Figure A2a. Training is conducted by sampling batches of 10,000 image pairs from the MNIST test set, where 5000 pairs are drawn from the same class (i.e., a pair of Fives or a pair of Threes), and 5000 pairs are drawn from different classes (i.e., the pairs [2,3] or [1,7]). The siamese network is then optimized using gradient-based methods over 100 epochs using the contrastive loss function [48] (Figure A2a). This analysis may be reproduced in the Supplementary Materials.

After training, we sampled same-class pairs (n = 25,000) and different-class pairs (n = 25,000) from the MNIST training set (which contains 60,000 images). Pairwise distances for each sample were computed using the trained siamese network. If the “ones” are indeed the most homogeneous class, they should demonstrate a generally smaller pairwise distance than other digit class pairs. We evaluated this hypothesis by comparing empirical cumulative distribution functions (CDF) on the class-pair distances (Figure A2b). Our results show that the empirical CDF for “1–1” image pairs dominate that of all other class pairs (where the distance between pairs of “ones” is lower).

Figure A2. Depiction of the siamese network architecture and the empirical cumulative distribution function for pairwise distances between digit classes. (a) Depiction of a siamese network architecture. At iteration k, each of two samples,

X_{A}^{(k)}

and

X_{B}^{(k)}

, are passed through a convolutional neural network to yield embeddings z_A and z_B, respectively. The class label for samples A and B are denoted y_A and y_B, respectively. The L2-norm of these embeddings is computed as D_AB. The network is optimized on the contrastive loss [48]

L

. Here 𝕀[·] is an indicator function, (b) Empirical cumulative distribution functions (CDF) for pairwise distances between images of the listed classes under the siamese network model. The x-axis plots the L2-norm between embedding vectors produced by the siamese network. The y-axis shows the proportion of samples in the respective group (by line color) whose embedded L2 norms were less than the specified threshold on the x-axis. Class groups are denoted by different line colors. For instance, “0-0” refers to pairs where each image is a “zero.” We combine all disjoint class pairs, for example “0–8” or “3–4,” into a single empirical CDF denoted as “A≠B”.

Figure A2. Depiction of the siamese network architecture and the empirical cumulative distribution function for pairwise distances between digit classes. (a) Depiction of a siamese network architecture. At iteration k, each of two samples,

X_{A}^{(k)}

and

X_{B}^{(k)}

, are passed through a convolutional neural network to yield embeddings z_A and z_B, respectively. The class label for samples A and B are denoted y_A and y_B, respectively. The L2-norm of these embeddings is computed as D_AB. The network is optimized on the contrastive loss [48]

L

. Here 𝕀[·] is an indicator function, (b) Empirical cumulative distribution functions (CDF) for pairwise distances between images of the listed classes under the siamese network model. The x-axis plots the L2-norm between embedding vectors produced by the siamese network. The y-axis shows the proportion of samples in the respective group (by line color) whose embedded L2 norms were less than the specified threshold on the x-axis. Class groups are denoted by different line colors. For instance, “0-0” refers to pairs where each image is a “zero.” We combine all disjoint class pairs, for example “0–8” or “3–4,” into a single empirical CDF denoted as “A≠B”.

References

Jost, L. Entropy and diversity. Oikos 2006, 113, 363–375. [Google Scholar] [CrossRef]
Prehn-Kristensen, A.; Zimmermann, A.; Tittmann, L.; Lieb, W.; Schreiber, S.; Baving, L.; Fischer, A. Reduced microbiome alpha diversity in young patients with ADHD. PLoS ONE 2018, 13, e0200728. [Google Scholar] [CrossRef] [PubMed]
Cowell, F. Measuring Inequality, 2nd ed.; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring inconsistency in meta-analyses. BMJ Br. Med. J. 2003, 327, 557–560. [Google Scholar] [CrossRef] [Green Version]
Hooper, D.U.; Chapin, F.S.; Ewel, J.J.; Hector, A.; Inchausti, P.; Lavorel, S.; Lawton, J.H.; Lodge, D.M.; Loreau, M.; Naeem, S.; et al. Effects of biodiversity on ecosystem functioning: A consensus of current knowledge. Ecol. Monogr. 2005, 75, 3–35. [Google Scholar] [CrossRef]
Botta-Dukát, Z. The generalized replication principle and the partitioning of functional diversity into independent alpha and beta components. Ecography 2018, 41, 40–50. [Google Scholar] [CrossRef] [Green Version]
Mouchet, M.A.; Villéger, S.; Mason, N.W.; Mouillot, D. Functional diversity measures: An overview of their redundancy and their ability to discriminate community assembly rules. Funct. Ecol. 2010, 24, 867–876. [Google Scholar] [CrossRef]
Chiu, C.H.; Chao, A. Distance-based functional diversity measures and their decomposition: A framework based on hill numbers. PLoS ONE 2014, 9, e113561. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Petchey, O.L.; Gaston, K.J. Functional diversity (FD), species richness and community composition. Ecol. Lett. 2002. [Google Scholar] [CrossRef]
Leinster, T.; Cobbold, C.A. Measuring diversity: The importance of species similarity. Ecology 2012, 93, 477–489. [Google Scholar] [CrossRef] [Green Version]
Chao, A.; Chiu, C.H.; Jost, L. Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers. Annu. Rev. Ecol. Evol. Syst. 2014, 45, 297–324. [Google Scholar] [CrossRef] [Green Version]
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders, 5th ed.; American Psychiatric Publishing: Washington, DC, USA, 2013. [Google Scholar]
Regier, D.A.; Narrow, W.E.; Clarke, D.E.; Kraemer, H.C.; Kuramoto, S.J.; Kuhl, E.A.; Kupfer, D.J. DSM-5 field trials in the United States and Canada, part II: Test-retest reliability of selected categorical diagnoses. Am. J. Psychiatr. 2013, 170, 59–70. [Google Scholar] [CrossRef] [PubMed]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
Arvanitidis, G.; Hansen, L.K.; Hauberg, S. Latent Space Oddity: On the Curvature of Deep Generative Models. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–15. [Google Scholar]
Shao, H.; Kumar, A.; Thomas Fletcher, P. The Riemannian geometry of deep generative models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Nickel, M.; Kiela, D. Poincaré embeddings for learning hierarchical representations. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6339–6348. [Google Scholar]
Rényi, A. On measures of information and entropy. Proc. Fourth Berkeley Symp. Math. Stat. Probab. 1961, 114, 547–561. [Google Scholar]
Hill, M.O. Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology 1973, 54, 427–432. [Google Scholar] [CrossRef] [Green Version]
Hannah, L.; Kay, J.A. Concentration in Modern Industry: Theory, Measurement and The U.K. Experience; The MacMillan Press, Ltd.: London, UK, 1977. [Google Scholar]
Ricotta, C.; Szeidl, L. Diversity partitioning of Rao’s quadratic entropy. Theor. Popul. Biol. 2009, 76, 299–302. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. ICLR 2014 2014, arXiv:1312.6114v10. [Google Scholar]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trend. Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Eliazar, I.I.; Sokolov, I.M. Measuring statistical evenness: A panoramic overview. Phys. A Stat. Mech. Its Appl. 2012, 391, 1323–1353. [Google Scholar] [CrossRef]
Patil, A.G.P.; Taillie, C. Diversity as a Concept and its Measurement. J. Am. Stat. Assoc. 1982, 77, 548–561. [Google Scholar] [CrossRef]
Adelman, M.A. Comment on the “H” Concentration Measure as a Numbers-Equivalent. Rev. Econ. Stat. 1969, 51, 99–101. [Google Scholar] [CrossRef]
Jost, L. Partitioning Diversity into Independent Alpha and Beta Components. Ecology 2007, 88, 2427–2439. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
Eliazar, I. How random is a random vector? Ann. Phys. 2015, 363, 164–184. [Google Scholar] [CrossRef]
Gotelli, N.J.; Chao, A. Measuring and Estimating Species Richness, Species Diversity, and Biotic Similarity from Sampling Data. In Encyclopedia of Biodiversity, 2nd ed.; Levin, S.A., Ed.; Academic Press: Waltham, MA, USA, 2013; pp. 195–211. [Google Scholar]
Berger, W.H.; Parker, F.L. Diversity of planktonic foraminifera in deep-sea sediments. Science 1970, 168, 1345–1347. [Google Scholar] [CrossRef] [PubMed]
Daly, A.; Baetens, J.; De Baets, B. Ecological Diversity: Measuring the Unmeasurable. Mathematics 2018, 6, 119. [Google Scholar] [CrossRef] [Green Version]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Simpson, E.H. Measurement of Diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
Gini, C. Variabilità e mutabilità. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche; C. Cuppini: Bologna, Italy, 1912. [Google Scholar]
Shorrocks, A.F. The Class of Additively Decomposable Inequality Measures. Econometrica 1980, 48, 613–625. [Google Scholar] [CrossRef] [Green Version]
Jost, L. Mismeasuring biological diversity: Response to Hoffmann and Hoffmann (2008). Ecol. Econ. 2009, 68, 925–928. [Google Scholar] [CrossRef]
Pigou, A.C. Wealth and Welfare; MacMillan and Co., Ltd: London, England, 1912. [Google Scholar]
Dalton, H. The Measurement of the Inequality of Incomes. Econ. J. 1920, 30, 348. [Google Scholar] [CrossRef]
Macarthur, R.H. Patterns of species diversity. Biol. Rev. 1965, 40, 510–533. [Google Scholar] [CrossRef]
Lande, R. Statistics and partitioning of species diversity and similarity among multiple communities. Oikos 1996, 76, 5–13. [Google Scholar] [CrossRef]
Rao, C.R. Diversity and dissimilarity coefficients: A unified approach. Theor. Popul. Biol. 1982, 21, 24–43. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Distributed representations of words and hrases and their compositionality. In Proceedings of the NIPS 2013, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 1–9. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Nunes, A.; Alda, M.; Trappenberg, T. On the Multiplicative Decomposition of Heterogeneity in Continuous Assemblages. arXiv 2020, arXiv:2002.09734. [Google Scholar]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. In Proceedings of the Advances in Neural Information Processing Systems 6, Denver, CO, USA, 29 November–2 December 1993; pp. 737–744. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the CVPR 2006, New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar]

Figure 1. Illustration of simple three-state system under which we compare existing non-categorical heterogeneity indices. Panel A depicts a three state system X as an undirected graph, with node sizes corresponding to state probabilities governed by Equation (24). As

0 \leq κ

diverges further from

κ = 1

, the probability distribution over states becomes more unequal. Panel B visually represents the parametric pairwise distance matrix

D (h, b)

of Equation (25) (h is height, b is base length,

D_{i j}

is distance between states i and j). In the examples shown in Panels B and C, we set

b = 1

. Specifically, we provide visual illustration of settings for which the distance function on X is a metric (Definition 1; when

h < b \sqrt{3} / 2

) or ultrametric (Definition 2; when

h \geq b \sqrt{3} / 2

). Panel C compares the numbers equivalent quadratic entropy (solid lines marked

{\hat{Q}}_{e}

; Section 2.2.1), functional Hill numbers (at

q = 1

, dashed lines marked

F_{1}

; Section 2.2.2), and the Leinster–Cobbold Index (at

q = 1

, dotted lines marked

L_{1}

; Section 2.2.3) for reporting the heterogeneity of X. The y-axis reports the value of respective indices. The x-axis plots the height parameter for the distance matrix

D (h, 1)

(Equation (25) and Panel B). The range of h at which

D (h, 1)

is only a metric is depicted by the gray shaded background. The range of h shown with a white background is that for which

D (h, 1)

is ultrametric. For each index, we plot values for a probability distribution over states that is perfectly even (

κ = 1

; dotted markers) or skewed (

κ = 10

; vertical line markers). Panel D shows the sensitivity of the Leinster–Cobbold index (

L_{1}

; y-axis) to the scaling parameter

0 \leq u

(x-axis) used to transform a distance matrix into a similarity matrix (

S_{i j} = e^{- u D_{i j}}

). This is shown for three levels of skewness for the probability distribution over states (no skewness at

κ = 1

, dotted markers; significant skewness at

κ = 10

, vertical line markers; extreme skewness at

κ = 100

, square markers).

Figure 1. Illustration of simple three-state system under which we compare existing non-categorical heterogeneity indices. Panel A depicts a three state system X as an undirected graph, with node sizes corresponding to state probabilities governed by Equation (24). As

0 \leq κ

diverges further from

κ = 1

, the probability distribution over states becomes more unequal. Panel B visually represents the parametric pairwise distance matrix

D (h, b)

of Equation (25) (h is height, b is base length,

D_{i j}

is distance between states i and j). In the examples shown in Panels B and C, we set

b = 1

. Specifically, we provide visual illustration of settings for which the distance function on X is a metric (Definition 1; when

h < b \sqrt{3} / 2

) or ultrametric (Definition 2; when

h \geq b \sqrt{3} / 2

). Panel C compares the numbers equivalent quadratic entropy (solid lines marked

{\hat{Q}}_{e}

; Section 2.2.1), functional Hill numbers (at

q = 1

, dashed lines marked

F_{1}

; Section 2.2.2), and the Leinster–Cobbold Index (at

q = 1

, dotted lines marked

L_{1}

; Section 2.2.3) for reporting the heterogeneity of X. The y-axis reports the value of respective indices. The x-axis plots the height parameter for the distance matrix

D (h, 1)

(Equation (25) and Panel B). The range of h at which

D (h, 1)

is only a metric is depicted by the gray shaded background. The range of h shown with a white background is that for which

D (h, 1)

is ultrametric. For each index, we plot values for a probability distribution over states that is perfectly even (

κ = 1

; dotted markers) or skewed (

κ = 10

; vertical line markers). Panel D shows the sensitivity of the Leinster–Cobbold index (

L_{1}

; y-axis) to the scaling parameter

0 \leq u

(x-axis) used to transform a distance matrix into a similarity matrix (

S_{i j} = e^{- u D_{i j}}

). This is shown for three levels of skewness for the probability distribution over states (no skewness at

κ = 1

, dotted markers; significant skewness at

κ = 10

, vertical line markers; extreme skewness at

κ = 100

, square markers).

Figure 2. Graphical illustration of the two main approaches for computing representational Rényi heterogeneity. In both cases, we map sampled points on an observable space

X

onto a latent space

Z

, upon which we apply the Rényi heterogeneity measure. The mapping is illustrated by the curved arrows, and should yield a posterior distribution over the latent space. Panel A shows the case in which the latent space is categorical (for example, discrete components of a mixture distribution on a continuous space). Panel B illustrates the case in which the latent space has non-categorical topology. A special case of the latter mapping may include probabilistic principal components analysis. When the latent space is continuous, we must derive a parametric form for the Rényi heterogeneity.

Figure 2. Graphical illustration of the two main approaches for computing representational Rényi heterogeneity. In both cases, we map sampled points on an observable space

X

onto a latent space

Z

, upon which we apply the Rényi heterogeneity measure. The mapping is illustrated by the curved arrows, and should yield a posterior distribution over the latent space. Panel A shows the case in which the latent space is categorical (for example, discrete components of a mixture distribution on a continuous space). Panel B illustrates the case in which the latent space has non-categorical topology. A special case of the latter mapping may include probabilistic principal components analysis. When the latent space is continuous, we must derive a parametric form for the Rényi heterogeneity.

Figure 3. Illustration of approaches to computing the pooled distribution on a simple representational space

Z = R

. In this example, two points on the observable space,

(x_{1}, x_{2}) \in X

, are mapped onto the latent space via model

f (\cdot | x_{i})

for

i \in {1, 2}

, which indexes univariate Gaussians over

Z

(depicted as hatched patterns for

x_{1}

and

x_{2}

, respectively). A pooled distribution computed non-parametrically by model-averaging (Equation (35)) is depicted as the solid black line. The parametrically pooled distribution (see Example 2) is depicted as the dashed black line. The parametric approach implies the assumption that further samples from

X

would yield latent space projections in some regions assigned low probability by

f (z | x_{1})

and

f (z | x_{2})

.

Figure 3. Illustration of approaches to computing the pooled distribution on a simple representational space

Z = R

. In this example, two points on the observable space,

(x_{1}, x_{2}) \in X

, are mapped onto the latent space via model

f (\cdot | x_{i})

for

i \in {1, 2}

, which indexes univariate Gaussians over

Z

(depicted as hatched patterns for

x_{1}

and

x_{2}

, respectively). A pooled distribution computed non-parametrically by model-averaging (Equation (35)) is depicted as the solid black line. The parametrically pooled distribution (see Example 2) is depicted as the dashed black line. The parametric approach implies the assumption that further samples from

X

would yield latent space projections in some regions assigned low probability by

f (z | x_{1})

and

f (z | x_{2})

.

Figure 4. Demonstration of data-generating distribution (top row; Equations (45)–(47)), and relationship between the representational model’s decision threshold (Equations (48) and (50)) and categorical representational Rényi heterogeneity (bottom row). The optimal decision boundary (Equation (50)) is shown as a gray vertical dashed line in all plots. Each column depicts a specific parameterization of the data-generating system (parameters are stated above the top row). Top Row: Probability density functions for data-generating distributions. Shaded regions correspond to the two mixture components. Solid black lines denote the marginal distribution (Equation (47)). The x-axis represents the observable domain, which is the (0,1) interval. Bottom Row: Effect of varying categorical representational Rényi heterogeneity (RRH) for

q \in {1, 2, \infty}

across different category assignment thresholds for the beta-mixture models shown in the top row. Varying levels of decision boundary are plotted on the x-axis. The y-axis shows the resulting between-observation RRH. Black dots highlight the RRH computed at the optimal decision boundary.

Figure 4. Demonstration of data-generating distribution (top row; Equations (45)–(47)), and relationship between the representational model’s decision threshold (Equations (48) and (50)) and categorical representational Rényi heterogeneity (bottom row). The optimal decision boundary (Equation (50)) is shown as a gray vertical dashed line in all plots. Each column depicts a specific parameterization of the data-generating system (parameters are stated above the top row). Top Row: Probability density functions for data-generating distributions. Shaded regions correspond to the two mixture components. Solid black lines denote the marginal distribution (Equation (47)). The x-axis represents the observable domain, which is the (0,1) interval. Bottom Row: Effect of varying categorical representational Rényi heterogeneity (RRH) for

q \in {1, 2, \infty}

across different category assignment thresholds for the beta-mixture models shown in the top row. Varying levels of decision boundary are plotted on the x-axis. The y-axis shows the resulting between-observation RRH. Black dots highlight the RRH computed at the optimal decision boundary.

Figure 5. Comparison of categorical representational Rényi heterogeneity (

Π_{q}

), the functional Hill numbers (

F_{q}

), the numbers equivalent quadratic entropy (

{\hat{Q}}_{e}

), and the Leinster–Cobbold index (

L_{q}

) within the beta mixture model. Each row of plots corresponds to a given separation between the beta mixture components. Column 1 illustrates the beta mixture distributions upon which indices were compared. The x-axis plots the domain of the distribution (open interval between 0 and 1). The y-axis shows the corresponding probability density. Different line styles in Column 1 provides visual examples of the effect of changing the

θ_{1}

parameter over the range [0.5,1]. Column 2 compares

Π_{q}

(solid line),

F_{q}

(dashed line), and

L_{q}

(dotted line), each at elasticity

q = 1

. The x-axis shows the value of the

0.5 \leq θ_{1} < 1

parameter at which the indices were compared. Index values are plotted along the y-axis. Column 3 compares the indices shown in Column 2, as well as

{\hat{Q}}_{e}

(dot-dashed line).

Figure 5. Comparison of categorical representational Rényi heterogeneity (

Π_{q}

), the functional Hill numbers (

F_{q}

), the numbers equivalent quadratic entropy (

{\hat{Q}}_{e}

), and the Leinster–Cobbold index (

L_{q}

) within the beta mixture model. Each row of plots corresponds to a given separation between the beta mixture components. Column 1 illustrates the beta mixture distributions upon which indices were compared. The x-axis plots the domain of the distribution (open interval between 0 and 1). The y-axis shows the corresponding probability density. Different line styles in Column 1 provides visual examples of the effect of changing the

θ_{1}

parameter over the range [0.5,1]. Column 2 compares

Π_{q}

(solid line),

F_{q}

(dashed line), and

L_{q}

(dotted line), each at elasticity

q = 1

. The x-axis shows the value of the

0.5 \leq θ_{1} < 1

parameter at which the indices were compared. Index values are plotted along the y-axis. Column 3 compares the indices shown in Column 2, as well as

{\hat{Q}}_{e}

(dot-dashed line).

Figure 6. Sample images from the MNIST dataset [22].

Figure 7. Panel A: Illustration of the convolutional variational autoencoder (cVAE) [23]. The computational graph is depicted from top to bottom. An n_x-dimensional input data X_i (white rectangle) is passed through an encoder (in our experiment this is a convolutional neural network, CNN) which parameterizes an n_z-dimensional multivariate Gaussian over the coordinates z_i for the image’s embedding on the latent space

Z = R^{2}

. The latent embedding can then be passed through a decoder (blue rectangle) which is a neural network employing transposed convolutions (here denoted CNN^⊤) to yield a reconstruction

{\hat{X}}_{i}

of the original input data. The objective function for this network is a variational lower bound on the model evidence of the input data (see Kingma and Welling [23] for details). Panel B: Depiction of the latent space learned by the cVAE. This model was a pre-trained model from the (https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb, Smart Geometry Processing Group at University College London).

Figure 7. Panel A: Illustration of the convolutional variational autoencoder (cVAE) [23]. The computational graph is depicted from top to bottom. An n_x-dimensional input data X_i (white rectangle) is passed through an encoder (in our experiment this is a convolutional neural network, CNN) which parameterizes an n_z-dimensional multivariate Gaussian over the coordinates z_i for the image’s embedding on the latent space

Z = R^{2}

. The latent embedding can then be passed through a decoder (blue rectangle) which is a neural network employing transposed convolutions (here denoted CNN^⊤) to yield a reconstruction

{\hat{X}}_{i}

of the original input data. The objective function for this network is a variational lower bound on the model evidence of the input data (see Kingma and Welling [23] for details). Panel B: Depiction of the latent space learned by the cVAE. This model was a pre-trained model from the (https://colab.research.google.com/github/smartgeometry-ucl/dl4g/blob/master/variational_autoencoder.ipynb, Smart Geometry Processing Group at University College London).

Figure 8. Heterogeneity for the subset of MNIST training data belonging to each digit class respectively projected onto the latent space of the convolutional variational autoencoder (cVAE). The leftmost plot shows the pooled heterogeneity for each digit class (the effective total area of latent space occupied by encoding each digit class). The middle plot shows the within-observation heterogeneity (the effective total area of latent space per encoded observation of each digit class, respectively). The rightmost plot shows the between-observation heterogeneity (the effective number of observations per digit class). Recall that Rényi heterogeneity on a continuous distribution gives the effective size of the domain of an equally heterogeneous uniform distribution on the same space, which explains why the within-observation heterogeneity values here are less than 1.

Figure 9. Visual illustration of MNIST image samples corresponding to different levels of representational Rényi heterogeneity under the convolutional variational autoencoder (cVAE). Panel (a) illustrates the approach to this analysis. Here, the surface

Z

shows hypothetical contours of a probability distribution over the 2-dimensional latent feature space. The surface

X

represents the observable space, upon which we have projected an “image” of the latent space

Z

for illustrative purposes. We ﬁrst compute the expected latent locations m(x_i) for each image

x_{i} \in X

(A₁) We then deﬁne the latent neighbourhood of image x_i as the 49 images whose latent locations are closest to m(x_i) in Euclidean distance. (A₂) Each coordinate in the neighbourhood of m(x_i) is then projected onto a corresponding patch on the observable space of images. (A₃) These images are then projected as a group back onto the latent space, where Equation (57) can be applied, given equal weights over images, to compute the effective number of observations in the neighbourhood of x_i. Panel (b) plots the most and least heterogeneous neighbourhoods so that we may compare the estimated effective number of observations with the visually appreciable sample diversity.

Figure 9. Visual illustration of MNIST image samples corresponding to different levels of representational Rényi heterogeneity under the convolutional variational autoencoder (cVAE). Panel (a) illustrates the approach to this analysis. Here, the surface

Z

shows hypothetical contours of a probability distribution over the 2-dimensional latent feature space. The surface

X

represents the observable space, upon which we have projected an “image” of the latent space

Z

for illustrative purposes. We ﬁrst compute the expected latent locations m(x_i) for each image

x_{i} \in X

(A₁) We then deﬁne the latent neighbourhood of image x_i as the 49 images whose latent locations are closest to m(x_i) in Euclidean distance. (A₂) Each coordinate in the neighbourhood of m(x_i) is then projected onto a corresponding patch on the observable space of images. (A₃) These images are then projected as a group back onto the latent space, where Equation (57) can be applied, given equal weights over images, to compute the effective number of observations in the neighbourhood of x_i. Panel (b) plots the most and least heterogeneous neighbourhoods so that we may compare the estimated effective number of observations with the visually appreciable sample diversity.

Table 1. Relationships between Rényi heterogeneity and various diversity or inequality indices for a system X with event space

X = {1, 2, \dots, n}

and probability distribution

p = {(p_{i})}_{i = 1, 2, \dots, n}

. The function

𝟙 [\cdot]

is an indicator function that evaluates to 1 if its argument is true or to 0 otherwise.

Table 1. Relationships between Rényi heterogeneity and various diversity or inequality indices for a system X with event space

X = {1, 2, \dots, n}

and probability distribution

p = {(p_{i})}_{i = 1, 2, \dots, n}

. The function

𝟙 [\cdot]

is an indicator function that evaluates to 1 if its argument is true or to 0 otherwise.

Index	Expression
Observed richness [31]	$Π_{0} (p) = \sum_{i = 1}^{n} 𝟙 [p_{i} > 0]$
Perplexity [30]	$Π_{1} (p) = exp \{- \sum_{i = 1}^{n} p_{i} log p_{i}\}$
Inverse Simpson concentration [1]	$Π_{2} (p) = {(\sum_{i = 1}^{n} p_{i}^{2})}^{- 1}$
Berger-Parker Diversity Index [32,33]	$Π_{\infty} (p) = {({max}_{i} p_{i})}^{- 1}$
Rényi entropy [18]	$R_{q} (p) = log Π_{q} (p)$
Shannon entropy [29]	$H (p) = log Π_{1} (p)$
Tsallis entropy [34]	$T_{q} (p) = \frac{1}{q - 1} (1 - Π_{q} {(p)}^{1 - q})$
Simpson concentration [35]	$Simpson (p) = {(Π_{2} (p))}^{- 1}$
Gini-Simpson index [36]	$GSI (p) = 1 - Simpson (p)$
Generalized entropy index [3,37]	$GEI (p) = \frac{1}{q (q - 1)} [{(\frac{1}{n} Π_{q} (p))}^{1 - q} - 1]$

Table 2. Definitions in formulation of classical biodiversity and economic equality analysis as categorical representational Rényi heterogeneity. Superscripted indexing on

x = {(x_{i})}^{i = 1, \dots, n_{x}}

denotes that this is a row vector.

Table 2. Definitions in formulation of classical biodiversity and economic equality analysis as categorical representational Rényi heterogeneity. Superscripted indexing on

x = {(x_{i})}^{i = 1, \dots, n_{x}}

denotes that this is a row vector.

	Analytical Context
Symbol	Biodiversity	Economic Equality
X	Ecosystem, whose observation yields an organism denoted by vector $x = {(x_{i})}^{i = 1, \dots, n_{x}} \in X$	A system of resources, whose observation yields an asset denoted by vector $x = {(x_{i})}^{i = 1, \dots, n_{x}} \in X$
$X \subseteq R^{n_{x}}$	$n_{x}$ -dimensional feature space of organisms in the ecosystem	$n_{x}$ -dimensional feature space of assets in the economy, whose topology is such that the “economic” or monetary value is equal at each coordinate $x \in X$
$Z = \{z \in {\{0, 1\}}^{n_{z}} : \sum_{i = 1}^{n_{z}} z_{i} = 1\}$	$n_{z}$ -dimensional space of one-hot species labels	$n_{z}$ -dimensional space of one-hot labels over wealth-owning agents
$f : X \to P (Z)$	A model that performs the mapping $x \mapsto f (x)$ of organisms to discrete probability distributions over $Z$	A model that performs the mapping $x \mapsto f (x)$ of assets to discrete probability distributions over $Z$
$N_{i} \in N_{+}$	The number of organisms observed belonging to species $i \in \{1, \dots, n_{z}\}$	The number of equal valued assets belonging to agent $i \in \{1, \dots, n_{z}\}$
$N = \sum_{i = 1}^{n_{z}} N_{i}$	The total number of organisms observed	The total quantity of assets observed
$X = {(x_{i j})}_{i = 1, \dots, N}^{j = 1, \dots, n_{x}}$	A sample of N organisms	A sample of N assets
$w = {(w_{i})}_{i = 1, \dots, N}$	Sample weights, such that $w_{i} \geq 0$ and $\sum_{i = 1}^{N} w_{i} = 1$

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nunes, A.; Alda, M.; Bardouille, T.; Trappenberg, T. Representational Rényi Heterogeneity. Entropy 2020, 22, 417. https://doi.org/10.3390/e22040417

AMA Style

Nunes A, Alda M, Bardouille T, Trappenberg T. Representational Rényi Heterogeneity. Entropy. 2020; 22(4):417. https://doi.org/10.3390/e22040417

Chicago/Turabian Style

Nunes, Abraham, Martin Alda, Timothy Bardouille, and Thomas Trappenberg. 2020. "Representational Rényi Heterogeneity" Entropy 22, no. 4: 417. https://doi.org/10.3390/e22040417

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Representational Rényi Heterogeneity

Abstract

1. Introduction

2. Existing Heterogeneity Indices

2.1. Rényi Heterogeneity in Categorical Systems

2.1.1. Properties of the Rényi Heterogeneity

2.1.2. Decomposition of Categorical Rényi Heterogeneity

2.1.3. Limitations of Categorical Rényi Heterogeneity

2.2. Non-Categorical Heterogeneity Indices

2.2.1. Numbers Equivalent Quadratic Entropy

2.2.2. Functional Hill Numbers

2.2.3. Leinster–Cobbold Index

2.2.4. Limitations of Existing Non-Categorical Heterogeneity Indices

3. Representational Rényi Heterogeneity

3.1. Rényi Heterogeneity on Categorical Representations

3.2. Rényi Heterogeneity on Non-Categorical Representations

4. Empirical Applications of Representational Rényi Heterogeneity

4.1. Comparison of Heterogeneity Indices Under a Mixture of Beta Distributions

4.2. Representational Rényi Heterogeneity is Scalable to Deep Learning Models

5. Discussion

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

Appendix A. Mathematical Appendix

Appendix B. Expected Distance Between two Beta-Distributed Random Variables

Appendix C. Evidence Supporting Relative Homogeneity of MNIST “Ones”

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI