Diversity in biology: definitions, quantification and models

Diversity indices are useful single-number metrics for characterizing a complex distribution of a set of attributes across a population of interest. The utility of these different metrics or sets of metrics depends on the context and application, and whether a predictive mechanistic model exists. In this topical review, we first summarize the relevant mathematical principles underlying heterogeneity in a large population, before outlining the various definitions of ‘diversity’ and providing examples of scientific topics in which its quantification plays an important role. We then review how diversity has been a ubiquitous concept across multiple fields, including ecology, immunology, cellular barcoding experiments, and socioeconomic studies. Since many of these applications involve sampling of populations, we also review how diversity in small samples is related to the diversity in the entire population. Features that arise in each of these applications are highlighted.


Introduction
Diversity is a frequently used concept across a broad spectrum of scientific disciplines, ranging from biology [1][2][3][4][5] and ecology [6][7][8][9][10][11], to investment and portfolio theory [12][13][14][15][16], to linguistics [17,18] and sociology [19][20][21][22][23][24]. In each of these disciplines, diversity is a measure of the range and distribution of certain features within a given population. It is considered a key attribute that can be dynamically varying, influenced by intra-population interactions, and modified by environmental factors. The concept of diversity, variety, or heterogeneity can be applied to any population. The evolution of the population can also be highly correlated with its diversity. Some examples of biological population dynamics occurring at different scales are shown in figure 1. At first sight, diversity seems to be an intuitively simple concept, but since certain population attributes require a full distribution function to quantify, it can be rather complex and difficult to capture using a single metric [3,4,25,26]. We could for example think of a community with a total of four species, with one of the species dominating the total population. Consider a second community that consists of two equally common species. Which one of the two communities exhibits a higher diversity? The first one, because it harbors a larger number of species? Or the second one, because a sample is more likely to contain two species? This example shows that diversity is intrinsically linked to the total number of extant species (richness) and how the population is distributed throughout the species (evenness), and thus cannot be captured by a single number [3]. As a result, there are numerous different diversity indices and associated concepts used in different applications [3,4,[25][26][27][28][29]. Nonetheless, diversity measures are important for assessing the current condition of ecosystems, quantifying the influence of environmental factors on different species, and planning conservation efforts [2,5,9,10,[29][30][31]. In addition, the concept of diversity is important for the quantitative description of wealth distributions and, more generally, to identify mechanisms leading to variations in societies [32][33][34][35][36]. In a broader sense, diversity indices may be helpful for the design of robust energy distribution systems [37] or even to assemble well-performing teams [23]. Thus we see that, despite the ambiguity in the definition of diversity, the concept is very relevant to many different disciplines and applications.
In this topical review, we start by summarizing the basic concepts from information theory which are necessary for a quantitative treatment of diversity. We continue with describing aspects of populations and diversity that are common to many applications in biology. In the next section, we present the common 2 S Xu et al mathematical descriptions of diversity in terms of both number and species counts. Moreover, in most applications, only a small sample of a population is available. Thus, we place particular emphasis on the effects of sampling on diversity measures in section 5. In section 6 and subsections within, we survey a number of biological systems in which concepts of diversity play a key role in understanding the dynamics of the population. These include ecological populations, stem cell barcoding experiments, immunology, cancer, and societal wealth distributions. Each of these systems carry their unique attributes and thus require specific diversity measures. Finally, in section 7 we summarize the advantages and disadvantages of some common diversity measures and conclude with a discussion of possible future applications of concepts of diversity.

Entropy, relative entropy, KL divergence, KS statistic, mutual information and all that
We first provide a summary of the fundamental mathematical structures that arise in the analysis of populations in which one naturally seeks to quantitatively compare distributions or frequencies of subpopulations. These mathematical notions invariably involve ideas from information theory such as entropy and mutual information which have a rich history and deep connections to thermodynamics, coding theory, cryptography, inference, and communication [38]. To review the necessary information-theoretic concepts, we consider a discrete random variable X which takes on values from the set {x 1 , x 2 , . . . , x N } with probability P k = Pr (X = x k ) such that where the sum is taken over all possible values x k . This probability mass function may represent the relative frequency that an attribute X takes on the value x k within a large population. In the case of species diversity, we may interpret P k as the relative frequency of species k or the fraction of species with trait X = x k (see clone counts in section 4). The entropy, or 'Shannon entropy', is defined by P k log P k (2) and can be thought of as the expected uncertainty or surprise −E[log P(X)].
The continuous limit of Shannon entropy, or differ ential Shannon entropy, has also been defined, but care must be taken if X carries physical dimensions. If the probability of X taking on values in the interval [x, x + dx] is denoted by P(x) dx, the differential Shannon entropy is These expressions are synonymous with the 'Shannon index' of species diversity with some freedom in the choice of the base of the logarithm. Without any constraints on the distributions other than being compactly supported, the form of P k or P(x) that maximizes H(X) is a uniform distribution. With additional constraints there are classes of distributions that maximize the Shannon index. For example, for a fixed mean and variance on an unbounded domain, the Shannon index-or entropy-maximizing distribution is Gaussian. Within Gaussian distributions, the Shannon index increases logarithmically with the variance. In fact, within a specific class of distributions, the Shannon index is larger for flatter distributions [39,40]. As such, the Shannon index has been used as a measure of diversity [41]. One issue with the differential entropy of equation (3) is that P(x) carries dimensions X −1 , because the cumulative distribution function P(X x) = x −∞ P(x ) dx has to be dimensionless. Therefore, the argument of the logarithm in equation (3) is not dimensionless as required. To avoid such an issue, one can define a point-density function P 0 (x) according to [39] Given that the limit is well-behaved, we can express the difference between two adjacent points x k+1 and x k in terms of We now consider the continuum limit of the discrete Shannon entropy as defined in equation (2), and set In this way, it is possible to derive a continuous Shannon entropy that is invariant under parameter changes and whose logarithm depends on the dimensionless quantity P(x)/P 0 (x). We subtracted log (N) in equation (7) to obtain a finite H N (X). To characterize the diversity between two communities, we consider two discrete random variables X and Y with the corresponding joint probability mass function P X,Y (x k , y ) = Pr(X = x k , Y = y ). Given the joint distribution P X,Y (x k , y ), we can compute the marginal distributions P X (x k ) = P X,Y (x k , y ) and P Y (y ) = k P X,Y (x k , y ) by summing over the complementary variable. These definitions enable us to define the joint entropy (8) which may be also written as −E[log P X,Y ]. Moreover, the conditional entropy describes the expected uncertainty in the random variable Y given X. It can be also expressed as −E[log P Y|X ] where P Y|X is the conditional probability mass function. From symmetry, equation (9) also holds when all X and Y are interchanged. For independent random variables X and Y, we find that H(Y|X) = H(Y) and H(X|Y) = H(X).
While the Shannon index is a measure of the absolute entropy of a distribution, the relative entropy or Kullback-Leibler (KL) divergence quantifies the distance between two probability mass functions P and Q. In the case of continuous distributions P(x) and Q(x), we obtain The KL divergence is the relative entropy of P with respect to the reference distribution Q. Note that the limiting Shannon entropy is simply the KL divergence between the distribution P(x) and the associated invariant measure P 0 (x). Usually, P is an experimental or observed distribution and Q is a model that represents P. Furthermore, the KL divergence is nonnegative and equals zero if and only if P = Q [38]. It is not symmetric, D KL (P Q) = D KL (Q P), and is thus not a metric. In addition, a special case of the KL divergence is the 'mutual information' (11) Note that I(X; Y) = I(Y; X) is symmetric and quantifies how much knowing one variable reduces the uncertainty in the other. If X and Y are completely independent, I(X, Y) = 0. According to equation (11) and the definitions of joint and conditional entropy in equations (8) and (9), the mutual information can be written in terms of marginal, conditional, and joint entropies [38]: A symmetric version of the KL divergence is provided by the Jensen-Shannon divergence [42] (13) where M = (P + Q)/2 defines the mean distribution of P and Q. These divergences can be extended to include multiple and higher-dimensional distributions. The square-root of the Jensen-Shannon divergence is a distance metric between two distributions. Another useful distance metric is the Kolmogorov-Smirnov (KS) distance, which is defined as where F(x) is a cumulative reference distribution and G(x) is an empirical distribution function. The distribution G(x) is based on different samples with cumulative distribution function that can be F(x) or another distribution to be tested against F(x). The KS metric is the maximum distance between the two cumulative distributions F(x) and G(x). We outline in section 6.6 that the KS metric is related to the Hoover index which is used to quantify diversity, or inequity, in wealth or income distributions relative to a uniform distribution.

Commonly used measures of diversity
The notions of entropy and information are naturally related to the spread of a distribution P(x), and can be subsumed into a general metric for quantifying diversity. Usually, a population is measured and can be thought of as one realization of an underlying distribution. Consider a realization n = {n 1 , n 2 , . . . , n R } describing the number n i of entities of a discrete and distinguishable group/ species/type (1 i R). The total population is N = R i=1 n i . This given realization constitutes a 'distribution' across all possible types. Thus, any realization is completely described by a set of R numbers. Diversity measures are reduced representations of the distribution. An example would be a single parameter which captures the spread of the distribution of realizations {n i }. This is not different than, for example, defining a Gaussian distribution by its mean and standard deviation. Realizations {n i }, however, usually are not described by specific functions that can be defined by one or two parameters such as Gaussians. However, many different diversity indices can be unified into a single formula called 'Hill numbers' of order q [43][44][45]: where f i ≡ n i /N is the relative abundance of types i. This general formula represents different classes of 'diversity indices' for different values of q. It is also useful because one can consistently define an effective proportional abundance that corresponds to an average abundance with increasing weighting towards the larger-population species as q increases [45,46]. Note the similarity of this definition to the standard mathematical p -norm except that the exponent is 1/p instead of 1/(1 − q). Another diversity measure is provided by the Renyi index [47] q H = log q D = which is a generalization of the Shannon entropy defined in equation (2). The order q describes the sensitivity of q D and q H to common and rare types [48]. Below, we provide an overview of the most commonly used indices which result from the generalized diversity q D for different values of q:

Richness
In the limit of q → 0 + , the probabilities f q i are equal to unity and 0 D is simply the total number of types in the population, or the 'richness' R. The richness is often used in quantifying the diversity of T cells and species counts in ecology [3] and represents a metric that weights all subpopulations equally.

Shannon index
For q = 1 − ε in the limit → 0 + , the generalized diversity as defined by equation (15) becomes which is the exponential of the Shannon index that parallels the Shannon entropy defined in equations (2) and (9). This index is also sometimes called the Shannon-Wiener index (H) and can be defined using any logarithmic base. Usually measured values are Sh ∼ O (1). Qualitatively, e Sh can be thought of as a rule of thumb for the number of effective species in a population.

Evenness
Evenness is another class of diversity indices often invoked in ecological and sociological studies. One definition ('Shannon's equitability') is based on simply normalizing the Shannon diversity by the maximum Shannon diversity that arises if every species is equally likely [49]:

Simpson's index with replacement
When q = 2, we find Simpson's diversity index is defined as which carries the interpretation that upon drawing an entity from a given population the same type is selected twice.

Simpson's index without replacement.
A related index that cannot be directly constructed from q D is Simpson's index without replacement: Here, when an entity is drawn, it is not replaced before the second entity is drawn. The differences between S r and S are significant only for systems with small numbers of entities n i for all types i.

Berger-Parker diversity index
In the q → ∞ limit, we find is defined as the maximum abundance in the set {f i }, i.e. the abundance of the most common species. It is equivalent to the optimal solution of an ∞-norm of f= n/N .

Clone count representation
An alternative way of quantifying a population is through the species abundance distribution or 'clone counts' defined by where the discrete indicator function 1(n, k) = 1 if n = k and zero otherwise. The sum is usually taken over all species for which n i 1. Clone counts can also be defined over only a certain special subset of species. Clone counts, or species abundance distributions, in the language of computational mathematics, can be thought of as the measure of the levelsets [50] of the discrete function n i , or, in the language of condensed matter physics, the density of states if n i are thought of as energies of states i [51]. The clone counts also satisfy where N and R are the discrete total population and the total number of species (richness) present. Clone counts are commonly used in the theory of nucleation and self-assembly [52][53][54], where all particles are identical and c k represents the number of clusters of size k. They are equivalent to 'species abundance distributions' or sometimes ambiguously described as 'clone size distributions.' Clone counts have recently been used to quantify populations in barcoding studies [55] described below.
Clone counts do not depend on the specific labeling of the different types i and do not contain any identity information. However, since the common diversity indices are only a summary of the vector {n i } and also do not retain species identity information, q D can be written in terms of c k rather than n i : which leads to corresponding expressions at specific values of q, e.g. 0 D = R, While q D is well-defined when species are discretely delineated, for more granular or continuous traits, the delineation of different species will affect the values of n i and c k . Figure 2 shows population counts ordered by a continuous trait x. By defining the discrete species i according to different binning windows over x, we find different sets of number and clone counts. Thus, measures of diversity can be highly dependent on the resolution and definition of traits and species.

Sampling
In most applications, including all the ones we will discuss below, the entire population is not accessible for identification and measurement. In an ecosystem, all animals of the population cannot be tracked. In blood samples, only a small fraction of the cell types in the whole organism is drawn for identification/ sequencing. Thus, inferring the diversity in the entire system from the diversity in the sample is a key problem encountered across many fields.
There are numerous ways to randomly sample a population. One approach is to draw one individual, record its attributes, return it to the system, and allow it to well-mix or equilibrate before again randomly drawing the next individual. This process can be repeated M times. To indicate this type of sampling, we use the subscript 1 × M in the corresponding distributions and expectation values. Similar sampling approaches are used in the 'mark-release-recapture' experiments to estimate population size [56], survival, and dispersal of mosquitos [57]. For a given configuration {n i } and total population size N [58], the probability that the configuration {m i } is drawn after M samples is simply where f j ≡ n j /N is the relative population of species i, N ≡ R i=1 n i is the total population and M ≡ R i=1 m i is the total number of samples. We can now use P 1×M to compute the statistics of how the system diversity is reflected in the diversity in the samples. For example, the mean population in the sample in terms of N). The lowest moments of the populations in the sample are An alternative random sampling protocol is to draw a fraction σ ≡ M/N < 1 of the entire population once. This type of sampling arises in biopsies such as laboratory blood tests. To be able to distinguish between this sampling protocol and the previous one, we now use the notation M × 1. In this case the combinatorial probability of a specific sample configuration, given n, N, and M is where the discrete indicator function enforces the constraint between m i and the sampled population M. In this single-draw sampling scenario, we use the Results using P 1×M and P M×1 rely on perfectly random sampling, where certain clones/species are not more likely sampled or captured than others. The moments E[m i m j ] can be directly used to evaluate the expected Simpson's diversities, S r (with replacement) and S (without replacement) defined by equations (23) and (24), in the corresponding sample. In the case of 1×M sampling, we find and while for M×1 sampling, we find and Note that for both types of random sampling, we find that the expected Simpson's diversity (without replacement) in the samples are equal to the Simpson's diversity in the full system. In general, the expectations do not commute and Effects of sampling on clone counts c k can be similarly calculated by averaging the definition for the sampled clone count over the sampling probabilities P M×1 (m|n, M, N) or P 1×M (m|n, M, N). For clone counts, the calculations of moments of sampled quantities b k are more involved, and explicitly noncommutative One advantage of working in the b k representation is that diversity indices such as the expected sampled richness R s , are difficult to extract Some related results are given in [59,60]. The above results provide expected diversities in the sample assuming full knowledge of {n i } in the system. They represent solutions to the forward problem, the so-called 'rarefaction' in ecology. However, the problem of interest is usually the inverse problem, or extrapolation in ecology. In the simplest case, we wish to infer the expected diversity (or {n i } and c k ) in the system from a given configuration {m i } or clone count b k . Extrapolation is a much harder problem and is the subject of many research papers [6,[61][62][63][64].
One may wish to use the observed sample diversity q D(M) to approximate the population diversity q D (N). For any q, the underestimation of q D (N) using q D(M) decreases as the sample size M increases. The deviation of q D(M) from q D (N) is smaller for larger q, as higherorder Hill numbers are more heavily weighted by large species, which are less sensitive to subsampling. Chao and others have shown that for q 1 and in the N → ∞ limit nearly unbiased approximations can be obtained and when q 2, these unbiased estimates are very insensitive to sample size M [59,60]. Using clone counts in a sample of population M, Chao et al [65] obtained for q = 1 (in terms of Shannon's index): For q 2, Gotelli and Chao [59] obtained where (22) and (24)).
The ill-conditioning of the inverse problems is particularly severe for the richness 0 D. The general formula for an estimate of the system richness is and reduces to the unseen species problem for determining d 0 [66,67]. Since the sample size M and the richness R in the system are uncorrelated, one must use information contained in the species fractions f i or the clone counts c k in the full system [68,69]. However, a popular estimate for the system richness R (N) is the 'Chao1' estimator [59,70] Chao1 : which is actually a lower bound and gives reliable estimates for systems of size only up to approximately double or triple the sample size M. The uncertainty of the Chao1 estimator has also been derived via a variance that is also a function of d 1 and d 2 [71]. The 'Chao2' estimator gives the system richness as a function of measured incidence [59] Chao2 : where q 1 , q 2 are the number of species found in 1 or 2 samples out of many (as in the 1 × M sampling method). Shen et al [72] derived another estimatê which is only reliable if the sample size M is more than half of the system size N. Many of these estimators have been coded into analysis software such as R and iNEXT [73].
Regardless of the estimator, the major limitation is an insufficient sample size M N. Models predicting species abundances as a function of system size can help bridge this gap. For example a log-normal rela-tionship for the clone count c k [74] has been used to find agreeable results [75,76]. In general, models can be extremely useful for quantifying the effects of sampling, particularly when a Bayesian prior is desired.
We have outlined the basic mathematical frameworks for quantifying diversity that have utility across applications in different disciplines. The above summary of sampling assumes a wellmixed population, precluding any spatial dependence of the distribution of individual species. Spatially dependent sampling has been proposed for the origin of relationships between the number of species detected and the total area occupied by the population (see below).

Fields in which diversity play a key role
Below, we summarize a few modern applications in which diversity is important. By no means exhaustive, the following are simply examples of specific systems in modern biology that reflect the authors' intellectual biases.

Ecology, paradox of the plankton
The classic problem in the context of biological diversity is dubbed the paradox of the plankton and was originally discussed in a paper of the same title [77]. It describes diverse populations of plankton in environments with limited resources or nutrients. Sampled populations of plankton exhibit a large number of species even in low nutrient conditions during which one expects strong competition for resources. This observation runs counter to the competitive exclusion principle arising in many settings [78].
Perhaps the most common application of diversity arises in biological population studies, specifically in ecology [6][7][8][9][10][11]. Possible areas of application include the monitoring of ecosystems and the development of efficient species conservation strategies [2,5,9,10,[29][30][31]. Multiple overlapping and nebulous definitions of ecological diversity have been advanced [3,4,[25][26][27][28][29]. Early work by Fisher [6] introduced a logarithmic series model to mathematically describe empirical species diversity data. Here, the diversity index referred to a free parameter in the corresponding model. In a later study, MacArthur defined species diversity based on the size of the sampled area [79]. In the ecological setting, multiple layers of subpopulations are an important feature of populations. These subpopulations may be delineated by another property of the individual species, such as size, weight, behavioral attributes, etc. Subpopulations can also be distinguished through their spatial distribution or occupation of different habitats. Whittaker [80,81] qualitatively defined four types of diversity (point, alpha, beta, and gamma) conditioned on habitat or spatial distribution of the subpopulations [81]. Fundamentally, these differences arise from different methods of sampling, leading to different Hill numbers q D. We summarize a few oftenused descriptions below: • 'Point diversity' refers to samples taken at a single point or 'microhabitat.' This quantity is usually operationally measured by trapping organisms at one or more specific points. • 'Alpha diversity' is defined as the diversity within an individual location or specific area. In general, one can define a Hill number derived from measurements at a specific location as q D α , while the index α ≡ 0 D α is the richness encountered within a defined area or specific location. A few subtle variations in the definition of the index α exist, mostly related to the sampling process [45,46]. For example, in relation to beta diversity (discussed below), alpha diversity is the mean of the specific-location diversities across all locations within a larger landscape. • 'Gamma diversity' is the diversity index q D γ determined from the entire dataset, the total landscape, or the entire ecosystem. The index γ ≡ q D γ usually denotes the total number of different species or clones at the largest scale. Note that the mean or sum of the alpha diversities is in most cases not equal to the gamma diversity. The nonlinearity of the Hill numbers as well as the intersection or exclusion of species amongst the different sites suggests a need for indices that connect alpha and gamma diversities. • 'Beta diversity' was devised to describe the difference in diversity between two habitats or between two different levels of ecosystems. While the different levels of diversity are designed to the spatial aspects of diversity, different habitats overlap, leading to some amount of arbitrariness in determining the β-diversity. Moreover, beta diversity was initially described in different ways [45,80,81], leading to confusion about its mathematical definition and use [45,46,48]. One possible definition is Whittaker's [80] multiplicative law q D γ ≡ q D α q D β where here, α is defined as the mean of the diversities across all micro-habitats. Whittaker's definition describes beta diversity q D β = q D γ / q D α as a measure to quantify the diversity in the total population relative to the mean diversity across all microhabitats [45]. In the limit of q → 1 − , we obtain the Shannon diversity relationship Sh γ = Sh α + Sh β according to equation (20). Another definition of β is given by Lande's [82] additive law γ ≡ α + β according to which diversity indices are measured in the same units. One concept associated with β in terms of the additive partitioning is 'species turnover' quantifying the difference in richness between the entire and the local population. As an example, consider two distinguishable or spatially separate habitats A and B. If A contains species  {a, b, c, d, e} and B contains {b, c, f , g}, we find β A,B = 5 associated with the set {a, d, e, f , g}. The laws of Whittaker and Lande sparked debates about how to properly define beta diversity, and led to the distinction between multiplicative and additive diversity measures [45,46,48].
• 'Delta, epsilon, omega diversity' are other hierarchical definitions of diversities proposed by Whittaker [81]. Delta diversity is analogous to beta diversity but defined at the larger among-landscape scale, while epsilon diversity corresponds to gamma diversity, but at the regional scale that contains many landscapes. Omega diversity is measured at the biosphere scale, and thus characterizes the diversity of all ecosystems [83]. • 'Zeta diversity' was introduced by Hui and McGeoch [84], and is defined by a set of ζ indices that mathematically describe the species numbers between different partitions of a certain habitat. Specifically, ζ i is the mean number of species shared by i partitions. In particular, ζ 1 is the mean richness across all sites. For example, between two samples A and B or sets of data, the average number of species is ζ 1 := (R A + R B )/2, while the intersection is ζ 2 := A ∩ B. Generalizations to multiple samples can be defined using a series of zeta diversity indices ζ i . • Many other indices have been defined for different applications. The Jaccard index [45,80,84,85] is defined as J(A, B) = |A ∩ B|/|A ∪ B|, and is a general measure for quantifying the similarity in richness between two sets of populations A and B.
A myriad of different definitions of diversity indices arise from specific cases of the Hill numbers and consideration of different spatial scales of ecosystems. There is potential to further unify these definitions in a more systematic way using mathematical norms and more general mathematical structures of spatial dispersal of particles.

Area-species law and Island biodiversity
A particularly consistent, albeit qualitative feature observed in ecology is the species-area relationship (SAR) which relates the measured number of species (richness) to the relevant area. These areas can represent distinct habitats, such as mountain tops, or islands. For the latter, much work has been done in the subfield of island biodiversity.
The SAR is usually expressed as a power-law relationship between the number of species (or richness) R and the habitat/island area: where c is a constant prefactor and z is an exponent. On a log-log plot, log R = log c + z log A defines a line with slope z. An example of the area-species law for species counts of long-horned beetles in the Florida Keys is shown in figure 3, yielding a slope z = 0.29. An alternative species-area relationship is e R = cA z [94], which is a straight line on a semi-log plot. The classic book by MacArthur and Wilson [95] and many subsequent analyses have promoted and extensively analyzed the SAR idea. In MacArthur and Wilson's neutral equilibrium theory, immigration to and death on an island are monotonically decreasing and increasing functions of the number of species already on the island, respectively. Usually, measured values of the exponent fall in the range z ∼ 0.1-0.4. Field work has also found relationships between the parameters c and z and system-specific attributes such as the island distance to the mainland, habitat type, etc [95,96]. Nonetheless, reasonable predictions based on equation (47) are ubiquitous across many ecological examples.
Mechanistic origins of the robustness of the SAR have been proposed [98][99][100]. Different models for species populations n i or clone counts c k were surveyed and the corresponding species-area laws were derived by He and Legendre [99]. Spatial clustering of species and the averaging of random measurements were shown to robustly generate a power-law species-area curve [99,100], highlighting the fundamental importance of sampling.

Gut microbiome
Another ecological system that has recently received much attention is the human microbiome, especially in the gut. The gut bacterial ecosystem is important for health and can impact cardiovascular disease, diabetes, neuropsychiatric diseases, inflammatory bowel disease (IBD), and digestive and metabolic function to the point that fecal transplantation (bacteriotherapy) has become an effective treatment for recurrent C. difficile colitis infections [102]. This type of infection often occurs after antibiotics disrupt the gut microbiome. Transplants have also shown to be effective in treating slow-transit constipation [103].
Recent efforts to collect and curate gut microbiome data have included NIH's Human Microbiome Project (HMP) [104,105] and the European Metagenomics of the Human Intestinal Tract (MetaHIT) [106][107][108], as well as the integration of the data in [109]. Each dataset contains sequence data from samples from different body regions of hundreds of individuals, both healthy and diseased.
Bacterial species are usually determined by sequencing of the 16S ribosomal RNA (rRNA), a comp onent of prokaryotic ribosomes that contain hypervariable regions that are species-specific. However, closely related taxa can have very similar sequences, making separation imperfect [110]. Nonetheless, with numerous public databases [101,[111][112][113], estimates of species abundances in samples are readily available. In the gut, there are usually on the order of 10 3 bacterial species, with Bacteroidetes and Firmicutes being the dominant phyla [114,115]. Indeed, lower gut diversity is seen to be associated with conditions such as Crohn's disease [114]. For example, the frequency distribution of bacterial species in healthy and Crohn's disease patients are shown in figure 4. The quantification of diversity of human microbiome is an essential step in ongoing research and the diversity indices have been applied to microbiome data, including α-diversity and β-diversity across the microbiome from different anatomical regions and different patients. As with island biodiversity, the gut microbiome can be modeled as a birthdeath-immigration (BDI) process.

Barcoding experiments
Besides taxonomy of gut bacteria, the accurate identification of animal and plant species from samples is an essential task in ecology. In the early 2000's a DNA barcoding method was developed to read relatively short DNA regions specific to certain species [119,120]. These barcodes are usually found in mitochondrial DNA and often derived from a region in the cytochrome oxidase gene [119]. By sequencing samples and comparing them with a sequence database such as The Barcode of Life Data System [121,122], one can infer the number of species present within a sample. Detecting specific species within samples using DNA barcoding and DNA libraries arises in many applications including identification of birds [120] and flowering plants [123], detection of contaminants [124], and the tracking of plant composition in processed foodstuffs [125].
Recently, a number of barcoding or tagging protocols [126][127][128] have been developed to genetically label a large population of cells to study how they differentiate and proliferate, especially in the context of hematopoiesis [116,117,129,130] and cancer progression [131][132][133].
A novel approach used to investigate hematopoiesis exploits in situ barcodes [129]. Mice were engineered with an enzyme (Sleeping Beauty Transposase) that randomly moves DNA sequences (transposons) to different parts of the genome. The transposase is designed to be controllable by doxycycline, an antibiotic that can be used to switch on or off gene regulation. When the transposase is briefly activated, transposons within cell genomes are randomly rearranged within a brief period of time. Since the genome length transposon length, the new locations of the transposons will be distinct across the founder cells. After switching off the transposase, proliferation of founder cells imparts the same genomic sequence to their daughter cells. These collections of cells constitute a multiclonal population that proliferates and differentiates.
Analysis of the clonal population within differentiated cell pools shows that granulocytes derive from stem cells at particular time points during the life of the mouse [129]. Comparing clonal abundance structure within different cell lineages shows that clones originally predominant in the lymphoid lineages eventually arise in myeloid cells, indicating that multipotent progenitor cells continually produce cells of both lineages.
In another recent series of studies on hematopoiesis, outlined in figure 5, stem cells (HSCs) were extracted from rhesus macaques and infected with a lentiviral vector. The lentivirus integrates its genome randomly in the genome of the HSCs. Since the lentivirus genome is much shorter than that of mammalian cells, nearly every successful infection results in a new viral integration site (VIS) or clone. The infected stem cells are autologously transplanted into the animal and some of them resume differentiation into progenitor cells that transiently proliferate and further differentiate. Descendant cells carry the same genetic sequence, including the lentivirus integration locations, or the viral integration sites (VIS). Another approach is to use libraries of synthesized DNA/RNA as tags. Here, the different sequences, rather than their integration sites, serve as the distinguishing feature. This process avoids the need to determine VISs.
In all of the above approaches, each successive generation of cells will acquire the same tag, VIS or specific DNA barcode sequence as their parent, and ultimately, as the founder HSC. Compared to the Sleeping Beauty Transposon protocol, the VIS or barcoding experiments require an additional viral transfection step. Nonetheless, these VIS and barcoding experiments are equally effective in dissecting the differentiation process and quantifying lineage bias with age. For example, the variation (in time) of the abundances of a clone across different lineages indicates the level of fate switching of a stem cell [116,134].
These experiments also enabled observation of biological mechanisms on a finer scale compared to traditional studies, allowing inference of parameters that are difficult to measure directly such as the initial HSC differentiation rate and the proliferative potential (number of generations) accessible to progenitor cells [55,135].
After sampling, PCR amplification, and sequencing (each process carrying specific errors), the relative species populations and clone counts within defined cell types can be quantified. Figure 6(a) shows frequencies of barcode i as a function of sampling times t j in rhesus macaque. The fraction of each clone is depicted by the vertical distance between two neighboring curves. Here, it is important to note that the 'diversity' is a measure of the distribution of clone ID (barcodes) instead of lineages (cell types). In figure 6(b), we plot three different and rescaled diversity indices associated with the data in (a). The sampled richness is initially low at month 3 when barcoded clones have not fully differentiated and emerged in the peripheral blood. The sampled richness then peaks at month 9 before stabilizing after month 29. Simpson's diversity seems to continue to increase after month 29 which may indicate more unevenness and coarsening (fewer clones dominating the total population). Shannon's index is shown to decrease slightly, suggesting a decrease in the effective number of barcodes.
Sun et al [129] and Kim et al [116] also used simple clustering algorithms that identified similar clones according to their activity patterns across time. They identified distinct groups of clones that are featured by different time points of contribution to hematopoiesis. Koelle et al [134] calculated Shannon diversity to ensure comparability across time between animals and different cell types.
The employment of neutral barcodes to study blood cell populations is statistically insensitive to spatial partitioning (different tissues in the organism).
Nonetheless, small sampling ( M N) makes inference difficult. Thus, mechanistic simplifications and mathematical models have been used to quanti fy clonal evo lution. Assuming a multispecies birthdeath-immigration process (figure 7) Dessalles et al [136] found explicit steady-state distribution functions for n i (log series) and c k (Poisson) for constant r and µ, as well as formulae for the expected Shannon's and Simpson's diversities. Goyal et al [55] derived a master equation for the evolution of E[c k ] and then extended the solution to expected clone counts in the progenitor and sampled mature cell pools. By comparing results to the expected clone count in the sample at steady-state, they were able to infer kinetic parameters of the differentiation process. Biasco et al [138] proposed two candidate stochastic models for n i and used Bayesian Information Criterion (BIC) to assess the likelihood of each.

Cells of the adaptive immune system
Another intra-organism system for which diversity is often quantified is the adaptive immune system in vertebrates. The simplest immune subsystem consists of lymphoid cells (e.g. B and T cells) and tissues. B and T cells originate from common lymphoid progenitors (CLPs) that differentiate from HSCs in the bone marrow. B cells develop from CLPs in multiple stages in the bone marrow and spleen while T cells are formed from CLPs in the thymus. During T cell development in the thymus, T cell receptors (TCRs) are generated by random recombination of the associated receptor gene. TCRs are heterodimeric proteins that usually consist of an alpha chain and a beta chain. After a specific genetic sequence-corresponding to a specific amino acid sequence-is selected for, the naive T cell is exported from the thymus into peripheral tissue (such as circulating blood and lymph nodes) where they can further proliferate or interact with antigens presented on the surface of antigen-presenting cells (APCs). Naive T cells (those that have not previously strongly interacted with an antigen) can be activated through association of the surface T cell receptors (TCRs) with antigens presented by major histocompatibility complex (MHC) molecules on the surface of APCs. Similarly, naive B cells are generated in the bone marrow. The B cell receptors (BCRs) are comprised of heavy and light chains and an antigen-binding region, which is generated by the same recombination processes as TCRs. B cells are subsequently activated within tissues by binding to an antigen via their B-cell receptors (BCRs).
The mechanism responsible for creating very diverse repertoires of both BCRs and TCRs is V(D) J recombination [139]. In developing B cells, this mech anism involves the random recombination of diversity (D) and joining (J) gene segments of the heavy chain (DJ recombination). In the following step, a variable (V) gene segment joins the previously formed DJ complex to create a VDJ segment. In light chains, D segments are missing and therefore only VJ segments are generated. During T cell development and TCR generation, gene segments of the alpha chain and beta chain, the VJ and VDJ segments, respectively, also undergo random recombination. In the case of the beta chain, one of two different D regions of thymocytes recombine with one of six different joining J regions first, followed by rearrangement of the variable V region connecting it to the now-combined DJ segment. Due to the missing D segments in alpha chains, only VJ recombination is taking place. The recombination and joining processes in B cells and T cells involve many different genetic deletions and insertions that result in many different BCR and TCR protein sequences and a very large theoretical total number of possible clones with R 10 14 -10 15 [140,141].
In the end, each T or B cell expresses only one TCR or BCR type (an 'immunotype' or 'clonotype'). TCR sequences are preserved during proliferation, while BCR sequences can further evolve [142]. Since the space of antigens (the different amino acid sequences, or epitopes, presented by MHCs) is large, a large number of different TCR and BCR sequences should be present in an organism in order to mount an effective response to a wide range of infections. However, before T cell export from the thymus, a complex selection process occurs [143]. Positive selection eliminates  [55,116,117]. Here, 'barcodes' are defined by the random integration sites of a lentiviral vector. (b) Xenograft barcode experiments using mice [118] in which a library of barcodes was used to tag leukemia-propagating cells before direct transplantation into mice. T cells that interact too weakly with MHC molecules. Subsequently, negative selection eliminates those T cells and TCRs that bind too strongly to epitopes. Cells that escape negative selection may lead to autoimmune disease as they react to self-proteins. Thus, the total number of different distinct immunoclones realized in an organism (the richness) defines its T cell repertoire and is estimated to range from 10 6 -10 8 [144], with the lower range describing mice and the higher range an estimate for humans. B cell richness in man is estimated to be 10 8 -10 9 [145,146]. These values are much lower than the theoretical repertoire size R 10 14 -10 15 . TCR and BCR diversity is an important factor in health. For example, TCR diversity has been shown to influence the tumor microenvironment and lymphoma patient survival [147].
Although specific TCR sequences i can be determined, and their populations n i measured and estimated, the TCR identities vary significantly across individuals (private sequences) so clone counts are usually studied. Figure 8(a) shows T cell clone counts b k sampled from mice [141] that exhibit a biphasic power-law behavior. Figure 8(b) shows preliminary clone counts for six individuals, three uninfected patients and three HIV-infected patients [150].
Quantifying T cell diversity is confounded by a number of technical limitations. Usually, the complete T cell repertoire in an animal cannot be directly measured. Rather, as in most other applications, small samples of the entire population are usually drawn. When sampling from animals, the fraction of cells drawn and sequenced is perhaps only σ = M/N ∼ 10 −5 -10 −2 .
Thus, clones that have small populations may be missed in the sample. Besides sampling, sequencing requires PCR amplification of the sample, leading to PCR bias, especially in the larger-sized clones [149]. Finally, as in many other applications, there are multiple subclasses of the T cell population. Naive T cells that are activated by antigens develop into memory T cells that carry the same TCR and can further proliferate. Thus, it is difficult to separate the clone counts of different subpopulations such as naive or memory T cells [149].
Many mathematical models for the development and maintenance of the immune systems have been developed [135,136,140,143,151,152]. For the multiclonal naive T cell population, rudimentary insights can also be gleaned from a birth-death-immigration process, much as in the modeling of hematopoiesis. Here, the thymus mediates the immigration of a large number of clones, which undergo homeostatic proliferation and death in the periphery. Immigration rates can be different for different clones, depending on the likelihood of specific recombination patterns which may be inferred from probabilistic models of VDJ recombination [153,154].
Proliferation in the periphery depends on interactions between self-peptides with T cell receptors and is thus clone-dependent. Recently, it has been shown that TCR-dependent thymic output and proliferation rates (a nonneutral BDI model) influence the measured clone count patterns [155]. These processes form and maintain a diverse T cell receptor repertoire, which is usually characterized by its richness. Unlike the barcode abundances arising during hematopoiesis, the neutral BDI processes are not able to capture the shapes of the measured TCR clone counts.
It is also known that T cell residence times depend on interactions between tissues and T cell receptors. Thus, different clones of T cells are expected to be differentially spatially distributed in the body. Figure 7. A simple multispecies birth-death-immigration (BDI) process [55,[135][136][137]. A constant source (i.e. stem cells with slow dynamics) generated by 16 cells, each of a different clone, undergo asymmetric differentiation with rate α to produce differentiated cells that can undergo birth or death with rates r (N) and µ (N) that may depend on the total population in the differentiated pool. In this example, the differentiated population contains N = 30 cells, R = 9 different clones (barcodes), thus leaving c 0 = 7 unseen species. Figure 8. Examples of recently published clone count data. (a) Clone counts derived from a small sample (10 5 sequences) of T cells [141]. Note the broad distribution described by a biphasic power-law curve. Ignoring the largest clones, power-law fits for each regime yield slopes of −1.13 and −1.76. However, one should be cautious describing sampled TCR (and BCR) clone counts using power laws as they hold typically for far less than two decades. (b) Human TCR clone counts for three HIV-infected (red) and three uninfected (black) individuals show qualitative differences between the distributions (unpublished). Other data from mice and humans, under different conditions and in different cell types, have been recently published [148,149].
Hence, diversity metrics should be defined within and between habitats, much like that in ecology. Finally, it is known that T cell richness decreases with age [156][157][158][159]. Qualitatively, a loss of diversity has been predicted within the multispecies BDI process by assuming a decreasing thymic output rate with age. Even when the thymus is abruptly shut down, the diversity of the T cell repertoire slowly decreases as successive clones go extinct and the clone abundance distribution slowly coarsens. In humans, since the overall T cell population is primarily maintained by proliferation rather than thymic immigration [160], the reduction in diversity is fortunately a slow process.

Societal applications of diversity: wealth distributions
Metrics associated with diversity have been naturally applied in human social contexts [19][20][21]161], including physical, cultural, educational [24,32], and economic settings. For example, the distribution of wealth is the chief metric in many economic and political studies. As with all applications, data collection, sampling, and delineating differences in attributes are main research challenges.
Wealth and income, unlike species, are essentially continuous and ordered quantities, and can be described by many indices designed by economists to measure different wealth attributes of a population. Distinct from cellular or ecological contexts, socioeconomic diversity is also often discussed in terms of 'inequality,' 'evenness,' or 'polarization.' Diversity or 'inequality' indices in the socioeconomic setting usually invoke a number of additional assumptions • Individual identities are irrelevant: this is analogous to barcoding studies of a singular cell type in which the barcode identity is not important. • Size and total wealth invariance: the diversity is invariant to the total population size. Only proportions of the total population that are associated with a proportion of the total wealth are relevant. • Dalton principle: any inequality index should increase if any amount of wealth is transferred from an entity to one with higher existing wealth.
Mathematically, one starts by ordering the wealth or income of a population of N entities w 1 w 2 . . . w i w i+1 , . . . w N . For large N, the rescaled wealth distribution w( f ) ≡ w fN is a function of the relative fraction of the total population f = n/N ∈ [0, 1]. Furthermore, we can define a normalized wealth distribution or densitỹ and the corresponding cumulative distribution (49) or The functions W( f ) are known as 'Lorenzconsistent' if they satisfy the above assumptions [33]. Four representative Lorenz consistent raw wealth distributions are shown in figure 9(a) as functions of the individual index. In figure 9(b), we plot the continuous cumulative rescaled wealth distribution W( f ) as a function of the relative population fraction f corresponding to the wealth distributions shown in figure 9(a). From any ordered distribution, we can define a so-called 'Lorenz curve' that illustrates many indices graphically. The Lorenz curve is defined as the cumulative wealth of all individuals of a relative index f = n/N and lower. Many indices can be visualized by the Lorenz curves. For example, the Gini index [162,163] for the red distribution (linear wealth) in figure 9(a) is calculated by the area of the red shaded region (A) divided by the area under the equality curve (A + B = 1/2): In a society where every person receives the same income, the Gini index equals zero. However, if the total wealth is concentrated in only one out of N entities, Gini = 1 − 2/N. This motivates one to define the Gini index for discrete cumulative wealth values W i according to while the 'Hoover' or 'Robin Hood' index defined by [34,164,165] is the Legendre transform at f * , the fraction of individuals corresponding to dW( f )/df | f =f * = 1. For the two Lorenz curves in figure 9(b), the Robin Hood index is indicated by the two corresponding arrows. The Robin Hood index happens to be a specific case of the Kolmogorov-Smirnoff statistic as defined in equation (14) for two cumulative distributions. For convex functions W( f ) that satisfy W(0) = 0, W(1) = 1, the index H corresponds to the fraction of the total wealth that needs to be distributed in order to achieve uniform wealth. This can be seen by considering the wealth w i up to an index n * such that w i N −1 for all i n * . The total wealth that needs to be redistributed to obtain equal wealth fractions N −1 for every individual is Another possibility is to sum over all entities w i according to (54) The specific, local redistribution is not specified but it would be intriguing to cast it in the language of optimal transport and Wasserstein distances [166]. This way, one might also define costs to wealth redistribution.
It is also possible to quantify inequity according to the Theil index [167][168][169] which corresponds to a relative entropy as defined in equation (10). In this case, the entropy of the distribution of w i is measured with respect to the expected value we may interpret w i as the probability of finding an individual in income class i, and E[w] = N −1 corresponds to the relative share of equally distributed wealth. Naturally, many other measures for inequality have been defined by numerous authors focussing on specific socioeconomic areas [170].
However, typical inequality indices do not convey any judgment, belief system, or behavioral propensity on measured inequity and thus may not capture typical social concepts. In an effort to better quantify concepts such as inequity or 'polarization' [171], sociologists have proposed a number of polarization indices that are argued to be more directly correlated with social tension and unrest. For example, Esteban and Ray [35,36] developed a measure of polarization to account for clusters within which individuals are more similar in an attribute x (such as wealth) than they are between clusters. While there may be many ways to define polarization, imposing a few reasonable features and constraints can narrow down the allowable forms. First, they assume an 'identity-alienation framework' in which an individual also identifies with his own distribution f (x) at value x. An effective 'antagonism' of an individual with attribute x towards those with attribute y is defined as T[ f (x), d] where a simple form for the distance is d = |x − y|. The polarization P is then assumed to take the form By imposing axioms that the polarization (i) cannot increase if the distribution is squeezed (compressed towards its peak), (ii) must increase if two nonoverlapping distributions are moved farther apart, and (iii) the polarization should be invariant to scalings of the total population. Using these constraints, the polarization can be more explicitly defined as where 1/4 α 1 [36] (Esteban and Ray [35] and Kawada, Nakamura, and Sunada [172] found 0 α < 1.6 using slightly different assumptions). The parameter α describes the amount of 'polarization sensitivity.' It measures identification of a population with its distribution and distinguishes polarization from other standard inequity measures such as the Gini index (when α = 0 [35]) or Simpson's index. Also, note that when α = 0, the form of P[f ] resembles the total potential energy of a system of particles that are distributed according to f (x) and exhibits an interaction energy |x − y|. The discrete analogue of equation (57) is for which the individuals i, j can be generalized to groups. In empirical studies, the Esteban and Ray polarization measure is given by where are the relative frequency and the mean of the wealth in group i, respectively [173]. D'Ambrosio and Wolff suggested replacing the difference of mean wealths in equation (57) by the Kolmogorov measure of variation distance [173,174] to obtain Additional indices have been proposed, including a class of polarizations by Tsui and Wang [175] of the form where ψ is a smooth function of the rescaled distance Many of these polarization metrics can in fact be expressed in terms of the Gini coefficient. For example, the Foster-Wolfson polarization index is defined as [176] (63) where µ(x) is the corresponding mean income, and the subscript indices B and W denote the between and within group Gini coefficients. According to the definition of P FW (x), inequity differs from polarization in the following way: the Gini index as the sum of Gini B and Gini W quantifies the unequal distribution of wealth in a society whereas polarization is measured in terms of the difference of Gini B and Gini W . Thus, an increase in within-group inequality leads to a larger total inequality but a lower polarization. A more refined understanding of socioeconomic diversity will need to consider multiple classes of attributes, including possible geographic or spatial distributions.
The described polarization measures are relevant not only in the context of wealth distributions, but they are also able to provide important insights into other sociological phenomena associated with the notion of diversity. As one example, quantitative measures of polarization are applicable to examine factors that influence the cohesiveness of groups [23]. In this context, the social entropy theory aims to quantitatively compare diversity across social systems such as societies, organizations, and individual groups [19,20,177].

Summary and discussion
Quantifying the diversity of a given population in terms of a single measure such as richness does not fully describe the underlying distribution of species or other properties. Various diversity measures have been developed and tailored to specific applications in different fields including ecology, biology, and economics. Mathematically, one can describe populations in terms of species numbers n i (number of entities of type i) or clone counts c k (number of species of size k). Hill numbers q D provide a framework to unify some common diversity indices that are based on a species-number description. Hill numbers with large values of q put more weight on common species whereas small values of q yield measures that are more sensitive to rarer species. This implies that measures such as richness (q = 0) and evenness (q = 1) are more prone to sampling effects than Simpson's diversity index (q = 2) or Hill numbers with q > 2 [179]. In table 1, we summarize some common diversity measures, their applications, and advantages and disadvantages.
In conclusion, we have provided an overview of the most relevant measures of diversity and their information-theoretic counterparts. We then summarized common applications of diversity indices in biological and ecological systems. Despite the ambiguity in the definitions and the variety of diversity measures [3,4,[25][26][27][28][29], the concept is still of great importance for the monitoring of ecosystems and in the context of conservation planning [2,5,9,10,[29][30][31].
We also described the importance of a quantitative treatment of diversity for experiments in the study of the gut microbiome, stem cell barcoding, and the adaptive immune system. Finally, we discussed examples of the application of diversity measures in human social systems including the characterization of wealth distributions in societies and measures of political or cultural polarization. Scientific conclusions in these fields, and in ecology, are particularly sensitive to sampling and measurements. However, accurate measurements [180], meaningful classification, spatial resolution [100], and informative sampling protocols [68,75] remain elusive across almost all fields. Sometimes, as illustrated in figure 6(b), different measures even lead to contradictory conclusions [181]. There is no golden rule in choosing a unique metric for a specific situation, as the sampling effects also depend on the underlying unknown clone-count distribution [179]. It is recommended that one cross-checks different metrics, while bearing in mind how sampling effects may impact diversity measures differently.