Estimating the size of a hidden finite set: large-sample behavior of estimators

A finite set is"hidden"if its elements are not directly enumerable or if its size cannot be ascertained via a deterministic query. In public health, epidemiology, demography, ecology and intelligence analysis, researchers have developed a wide variety of indirect statistical approaches, under different models for sampling and observation, for estimating the size of a hidden set. Some methods make use of random sampling with known or estimable sampling probabilities, and others make structural assumptions about relationships (e.g. ordering or network information) between the elements that comprise the hidden set. In this review, we describe models and methods for learning about the size of a hidden finite set, with special attention to asymptotic properties of estimators. We study the properties of these methods under two asymptotic regimes,"infill"in which the number of fixed-size samples increases, but the population size remains constant, and"outfill"in which the sample size and population size grow together. Statistical properties under these two regimes can be dramatically different.

Despite the wide diversity in application domains, most statistical approaches to estimating the size of a hidden set fall into a few general categories. Some approaches are based on traditional notions of random sampling from a finite population [50,51]. Others leverage information about the ordering of units [42,43], or relational information about "network" links between units [5,26,[52][53][54][55]. Single-or multi-step sampling procedures that involve record collection or "marking" of sampled units -called capture-recapture experiments -are common when random sampling is possible [23,35,[56][57][58][59]. Sometimes exogenous, or population-level data can help: when the proportion of units in the hidden set with a particular attribute is known a priori, then the proportion with that attribute in a random sample can be used to estimate the total size of the set [18,25,[60][61][62][63]. Still other methods use features of a dynamic process, such as the arrival times of events in a queueing process, to estimate the number of units in a hidden set [45,46].
Alongside these practical approaches, corresponding theoretical results provide justification for particular study designs and estimators, based on large-sample (asymptotic) arguments. Guidance for prospective study planning often depends on asymptotic approximation. For example, sample size calculation may be based on asymptotic approximation if the finite-sample distribution of an estimator is not identified or hard to analyze [64][65][66]. In retrospective analysis of data and the comparison of statistical approaches, researchers may choose estimators based on large-sample properties like asymptotic unbiasedness, efficiency and consistency if closed-form expressions for finite-sample biases and variances are hard to derive [67,68]. Claims about the large-sample performance of estimators depend on specification of a suitable asymptotic regime, and it is well known that estimators can perform differently under different asymptotic regimes. Asymptotic theory in spatial statistics provides some perspective on what it means to obtain more data from the same source: informally, an "infill" asymptotic regime assumes a bounded spatial domain, with the distance between data points within this domain going to zero. An "increasing domain" or "outfill" asymptotic regime assumes that the minimum distance between any pair of points is bounded away from zero, while the size of the domain increases as the sample size increases. The latter is usually the default asymptotic setting considered by researchers studying the properties of spatial smoothing estimators [69][70][71]. However, under infill asymptotics, these desirable asymptotic properties of smoothing estimators often do not hold: even when consistency is guaranteed, the rate of convergence may be different [69,[72][73][74][75].
When the size of the population from which the sample is drawn is the estimand of interest, intuition about large-sample properties of estimators can break down, but a similar asymptotic perspective is useful in studying the properties of estimators for the size of a hidden set: an infill asymptotic regime takes the total population size to be fixed, while the number of samples from this population increases; the outfill regime permits the sample size and population size to grow to infinity together.
In this paper, we review models and methods for estimating the size of a hidden finite set in a variety of practical settings. First we present a unified characterization of set size estimation problems, formalizing notions of size, sampling, relational structures, and observation. We then introduce the non-asymptotic regime in which sample size tends to the population size, and define the "infill" and "outfill" asymptotic regimes in which the sample size and population size may increase. We investigate a range of problems, query models, and estimators, including the German tank problem, failure time models, the multiplier method, the network scale-up estimator, the Horvitz-Thompson estimator, and capture-recapture methods. We characterize consistency and rates of estimation errors for these estimators under different asymptotic regimes. We conclude with discussion of the role of substantive and theoretical considerations in guiding claims about statistical performance of estimators for the size of a hidden set.

Setting and notation 2.1 Hidden sets
Let U be a set consisting of all elements from a specified target population. In general, U can be discrete or continuous. Let µ(·) be a measure defined on U such that µ(U ) < ∞. The size of U is µ(U ). We call U a hidden set if the members of U are not directly enumerable, or if its size µ(U ) cannot be ascertained from a deterministic query. When U is a finite set of discrete elements, µ(U ) = |U | := N is the cardinality of U . If alternatively U is the union of intervals, then µ(·) can be taken as Lebesgue measure.
We seek to learn about the size of U by sampling its elements. Define a probability space (U, F, P), where F is a σ-field, and P is a probability measure on U . The measure P represents a probabilistic query mechanism by which we may draw subsets of the elements of U . For each possible sample s ∈ F, defining P(s) gives a notion of random sampling. Sequential sampling designs can be specified by defining the sequential sampling probabilities P(S i = s i |s 1 , . . . , s i−1 ). Sequential samples are denoted as s = (s 1 , . . . , s k ), and the sample size is defined as |s 1 | + · · · + |s k |, the sum of the cardinality of each sample, which can be larger than µ(U ) under with-replacement sampling. An Elements of the hidden set U , or of a sample s from U , may have attributes, labels, or relational structures that permit estimation of µ(U ) from a subset. An element i ∈ U may be labeled or have attributes X i , which may be continuous, discrete, unordered, or ordered. The elements of U may be connected via a relational structure, such as a graph G = (U, E), where the vertex set is U , and edges {i, j} ∈ E represent relationships between elements. Alternatively, the sampling mechanism may impose a structure on the elements of a sample: if s 1 ⊆ U and s 2 ⊆ U are samples from U , then the intersection M = s 1 ∩ s 2 is the set of elements in both samples. An observation on the sample s consists of statistics that reflect these attributes, labels or structures of the units in s, such as the value of attributes {X i }, network degrees in a graph or size of the intersection of samples |M |.

Asymptotic regimes
We now formalize asymptotic regimes relevant for hidden set size estimation.
Definition 1 (Asymptotic regime). Let (U t , F t , P t ) be a probability space defined for each t = 1, 2, . . ., and let s t = {s  Figure 1: Illustration of different regimes for discrete sets. Units are indicated by circles. The sample s "expands" to U under the finite-population regime. Infinitely repeated samples of a fixed size are drawn from a fixed population under infill asymptotics. Under outfill, s and U grow simultaneously with the former going to a fixed proportion of the latter.
Definition 2 (Finite-population regime). Let U be a hidden discrete set of fixed size. The finitepopulation (non-asymptotic) regime is U t = U for all t and s t = U for all t > t 0 , where t 0 < ∞ is a positive integer.
Next, we define the "infill" asymptotic regime that arises when sampling repeatedly (with replacement between different samples) from a set of fixed finite size. This regime is an example of a superpopulation model [76,77] which reproduces the original population U t = U for each t.
Definition 3 (Infill asymptotic regime). Let (U t = U, F t = F, P t ) be a sequence of probability spaces, where P t assigns probability P(s (t) i |s and µ(U ) are both fixed and bounded, and the number of samples k t → ∞ as t → ∞.
Sometimes it can be difficult to conceptualize sampling infinitely many times from U , or the sampling design may be subject to practical constraints, so that sampling only a single or fixed number of samples, or a fixed proportion of the total population, is allowed. It is therefore also reasonable to study the performance of estimators under an asymptotic regime in which a single sample is obtained from the hidden set, where the size of the sample and hidden set may tend to infinity together.
Definition 4 (Outfill asymptotic regime). Let (U t , F t , P t ) be a sequence of probability spaces, where P t assigns probability P(s (t) i |s kt ∈ F t for any t. The outfill asymptotic regime is a sequence {s t , U t , P t } such that µ(U t ) → ∞ and n (t) where lim t→∞ k t may be finite or infinite.
We are primarily interested in the outfill asymptotic regime with k t = 1 for all t. The multiplier and capture-recapture methods, described below, are special cases where k t may be greater than one. Figure 1 illustrates different regimes in general discrete settings.

Statistical properties of estimators
Let δ(s t ) be an estimator of µ(U t ), defined for each t. We are interested in the statistical properties of δ(s t ) under the asymptotic regimes described above. An estimator is called unbiased if E t [δ(s)] = µ(U t ) for all t, where E t (·) denotes expectation with respect to P t . Under an asymptotic regime There may be some slightly biased estimators whose variance is smaller than that of every unbiased estimator. A common way to balance the trade-off between the bias and variance is to evaluate the mean . The asymptotic MSE under a given regime is defined as lim t→∞ M SE(µ(U t ), δ(s t )).
An estimator δ(s t ) that satisfies lim t→∞ P t (|δ(s t )−µ(U t )| > ε) = 0 for any ε > 0 under a particular asymptotic regime {s t , U t , P t } is called consistent for µ(U t ). An estimators δ(s t ) is called MSE consistent for µ(U t ) under a certain asymptotic regime if M SE[δ(s t ), µ(U t )] → 0 as t → ∞ under that asymptotic setting. MSE consistency implies consistency. Under a particular asymptotic regime, we call a sequence of estimates δ(s t ) asymptotically normal with mean ξ, variance σ 2 /t r and rate t r if the cumulative distribution function (CDF) of t r (δ(s t ) − ξ) converges to the CDF of

Ordered sets
Suppose each unit in the hidden set i ∈ U has a distinct label X i ∈ R, so that the labels give a natural ordering of the elements in U : we can define units i < j if X i < X j . One common scenario for discrete U is that the X i 's are consecutive integers. Another common situation when U is equivalent to an interval in R is that ∪ i∈U X i equals that interval. An observation of samples from an ordered set U consists of sampled units s and their labels {x i : i ∈ s}.

Discrete set: the German tank problem
In 1943, the Economic Warfare Division of the American Embassy in London initiated a project to learn about the capacity of the German military using serial numbers found on German equipment, including tanks, trucks, guns, flying bombs, and rockets [42,78]. In a simple conceptualization of the problem, let U = {1, . . . , N } and consider sampling n = |s| units without replacement from U with probability P(s) = 1/ N n . With k t i.i.d. repeated samples, an estimator δ(s) for N is a functional of the observations, including the sample sizes and observed labels X 1,1 , . . . , X 1,n , . . . , X kt,n . Let X k(j) be the jth order statistic for the kth sample.
With one sample, the maximum likelihood estimator (MLE) for N is N M LE = X (n) , which is negatively biased. Goodman [43] proposed an unbiased estimator Single sample which is a uniformly minimum-variance unbiased estimator (UMVUE), with Var( N G ) = (N − n)(N + 1)/n(n + 2). An alternative estimator of N takes into account the gap between X (n) and N , and adjusts for the bias with the average gap between order statistics [43]. The estimator is also unbiased, with Var( N 2 ) = n(N − n)(N + 1)/(n − 1)(n + 1)(n + 2). The estimator N 2 can also be modified to estimate N when the labels do not start with 1. In particular, is the UMVUE of N when the initial label is unknown [43], with Var( N 3 ) = 2(N − n)(N + 1)/(n − 1)(n + 2).
When there is more than one sample, we take the MLE as the maximizer of the joint sampling probability P t (s 1 , . . . , s kt ), which is max i∈[kt] X i(n) , the largest observed value across all k t samples.
For estimators with closed forms like N G , N 2 , N 3 , we derive k t estimates δ(s (t) i ), i = 1, . . . , k t based on each sample, and take their average as the estimator. In remaining sections, we average the estimators under infill by default, except for the models where infinite without-replacement sampling is feasible (e.g. Section 3.2 and 4.1). We consider the infill asymptotic regime where n t = n, N t = N and k t → ∞, and the outfill regime where n t , N t → ∞, k t = 1 with n t /N t → c ∈ (0, 1). Figure 2 illustrates different regimes for the German tank problem. We have the following asymptotic results: Höhle and Held [79] investigated the same problem from a Bayesian perspective. Taking an improper uniform prior, p(N ) ∝ 1, the posterior mode is the MLE X (n) and the posterior mean is for n > 2. The latter converges in probability to a biased quantity under the infill regime, and has the same MSE rate as N G under outfill asymptotics.

Continuous interval
A continuous version of the German tank problem arises for estimation of the length θ of a finite interval using i.i.d. random samples from the continuous uniform distribution Unif(0, θ). For one sample of size n, the probability density P(s) = 1/θ n ·1{X (1) ≥ 0 and X (n) < θ}. Repeated samples are independently generated under the same mechanism.
For one sample, the MLE is X (n) , which is biased. The UMVUE is with variance θ 2 /n(n + 2). Consider the infill regime with n t = n, θ t = N , and k t → ∞, and the outfill regime where k t = 1, n t → ∞ and θ t → ∞ with n t /θ t → c > 0. When there are k t samples, the MLE is max i∈[kt] X i(n) , which is biased, but asymptotically unbiased under the infill regime when k t → ∞. Since under the outfill regime. We discuss outfill consistency when the density increases at polynomial and exponential rates near θ in the Appendix.

Bernoulli Trials
Consider a discrete hidden set U consisting of N unlabeled, indistinguishable units. A sample s from U arises by associating a binary indicator Y i ∼ Bernoulli(p) to each i ∈ U , for fixed 0 < p < 1, where different realizations of the Y i 's can be generated in different draws. The probability p may be known or unknown. A single sample consists of the subset of units with positive indicators, s = {i ∈ U : Y i = 1}. This is a frequently encountered situation in computer science, ecology, business, epidemiology, and many other fields [33,34,80,81].

Binomial N parameter
We first assume that p is known. A single sample s from U gives an observation X = |s| = i∈U Y i which follows Binomial(N, p) distribution. When there are n independent samples, we assume they are generated by the same mechanism, so P( The finite-population regime arises when n = 1 and p → 1, i.e. when all units are associated with indicator 1 and observed in a single sample. We consider the infill asymptotic regime with N t = N and n → ∞, where the "sample size" n here represents number of repeated samples. The outfill regime is n t , N t → ∞ with n t /N t → c > 0. Figure 3 shows how the sampling mechanism varies under different regimes for the binomial N model. − → 0 for any α > 1/2. The "relative error" of X (n) with α = 1 goes to p − 1 in probability [82,83].
When p is unknown, the situation does not improve: negative or unstable estimates may occur, and Bayesian approaches are usually adopted to avoid these issues. Blumenthal and Dahiya [82] adopted a conjugate prior Beta(a, b) for p and an improper uniform prior p(N ) ∝ 1 for N ; the posterior is proper if and only if a > 1 [84]. Blumenthal and Dahiya [82] showed that the posterior mode N m is consistent under infill asymptotics, and satisfies √ n under the outfill regime. In particular, the MSE rate is slower compared to O(1) as in Theorem 4.1 when p is known.

Zero-truncated Poisson
Sampling bias can sometimes be exploited to estimate the size of a hidden set. For example, a registry may record the number of times each unit has been observed, but zero counts are not recorded. Distributional assumptions can be used to estimate the proportion of unobserved zero counts, leading to estimates of the set size. Zero-truncated counting models have been used to estimate size of hard-to-reach populations, including drug users [85,86], undocumented immigrants [87,88], criminal population [89,90], the number of infected households in an epidemic [91], and species richness in ecology [92,93].
To illustrate, let U be a set of N indistinguishable units. To each unit i ∈ U , we associate a realization of the attribute and an observation on s is {X i : i ∈ s}, the set of all positive counts. For one sample, the sampling mechanism is given by P( We define the infill asymptotic regime as N t = N and n t → ∞, i.e. more and more identically distributed and mutually independent realizations of {X j,i } i∈U,j=1,...,nt are generated, leading to the samples s t such that s . . , n t . The outfill asymptotic regime is defined as When λ is known, estimation of N reduces to the simplest binomial model as in Section 4.1, where p = 1 − e −λ , and all asymptotic claims follow. When λ is unknown, Stuart et al. [94] suggested using the MME of binomial N wherep = 1 − e −λ , leading to N = |s|/(1 − e −λ ). This estimate is unbiased if λ is known, and negatively biased by Jensen's inequality if an unbiased estimatorλ is used for λ.

Waiting times
Sometimes the state of a hidden unit may change, thereby making it known to an observer. For example, terrorist plots may change state from "hidden" to "executed", making them observable by intelligence agents [45]. The temporal pattern of such state changes may give insight into the number of hidden units. Properties of waiting times to an event have been exploited to estimate the number of units in studies of terrorism, crime, and estimation of epidemiological risk population sizes [45,[95][96][97].
Suppose U is a set of N hidden units in existence at time 0, each of which is at risk of "failure" at some future time. To each i ∈ U , associate a failure time T i ∼ Exponential(λ), and suppose failure times are observed up to some finite observation time T > 0. A sample is the set of units that have failed by the end of study, s = {i ∈ U : T i < T } with |s| = n, and an observation on s is {T i : i ∈ s}. With repeated sampling, a new observation is independent of all previous observations, taken after all units are set to be "at risk" over again. We consider the finite-population regime in Figure 4: Illustration of the waiting time model. The observed event times are subject to right censoring at t = T , that is, events that occur before T are observed. The finite-population regime is that T → ∞ so that all events are observed. Infill asymptotics amounts to generating different realizations of the failure times. Under the outfill regime, T and the total number of units N both increase toward infinity.
which T → ∞ so that all failures are observed, the infill regime in which T and N are fixed with the number of repeated observations k t → ∞, and the outfill regime in which T t , N t → ∞ with T t /N t → c > 0. Figure 4 illustrates each regime under the waiting time model. Let ∆ i := T i − T i−1 be the waiting time between the (i − 1)th and ith failure. The sampling mechanism is given by which gives rise to the likelihood L(t 1 , . . . , t n ; N ). Alternatively, if we ignore the timing of events, the observed number of events can be characterized by a binomial model P(n|N, λ) = N n (1 − e −λT ) n e −λT (N −n) , which yields L 2 (n; N ). Maximizing L and L 2 lead to two estimates, N M LE and N M LE of N . It is easy to verify that ∂ log L/∂N = ∂ log L 2 /∂N , so N M LE and N M LE are identical, and the timing of events does not contain more information about N than the total number of events.
The asymptotic behavior of N M LE follows from the discussion in Section 4.1: when λ is known, N M LE is consistent under finite-population and infill regimes. Under the outfill regime, it is unbiased and asymptotically normal with variance O(1).

The multiplier estimator
The multiplier method, also called the method of benchmark multiplier (MBM), can be used to estimate the size of a hidden population if the number of hidden units with a certain trait, and an estimate of the overall prevalence of that trait in the hidden population, are available. Often the prevalence of the trait is estimated through expert opinion, historical data, or from a separate sample [23,98,99].
Let U be a hidden set of units of size N . To each unit i in U we associate a binary trait Y i ∼ Bernoulli(p). The first sample is s 1 = {i ∈ U : Y i = 1}, and the benchmark is X = i∈U Y i = |s 1 |, which follows Binomial(N, p). If the trait prevalence p is known, the results in Section 4.1 apply. Alternatively, suppose p is estimated from another random sample, s 2 ⊆ U , which is independent of s 1 . We assume s 2 is a uniformly random draw from U with deterministic size n, among which m = |s 1 ∩ s 2 | has a positive trait. An observation on (s 1 , s 2 ) consists of the benchmark X and m. Then the proportion m/n gives the multiplier, which is an estimate of p. m follows a hypergeometric distribution, and the mechanism of generating the observations can be defined as P(m|s 1 , A MME for N is N M BM = xn/m, often called the multiplier estimator of N . When more than one sample pair (s 1 , s 2 ) is drawn, we shall note that unlike the binomial setting, the binary traits (like HIV status or death) of units will not change. Therefore, no new realizations of Y will be generated, and s 1 is always fixed under the infill regime. We consider the finite-population regime that n = |s 2 | → N . The infill regime is that x, n, N are fixed and k t → ∞, where k t is the number of sample pairs, (s 1,1 , s 1,2 ), . . . , (s kt,1 , s kt,2 ). The outfill regime is that x t , n t , N t → ∞ with x t /N t → c 1 ∈ (0, 1), n t /N t → c 2 ∈ (0, 1), with only one draw of (s 1 , s 2 ).
Since m follows hypergeometric distribution, E(m) = n · x/N , and N M BM is positively biased by Jensen's inequality. The multiplier estimator has essentially the same properties as the Lincoln-Petersen capture-recapture estimator in Section 5.1.1, where detailed discussion will be provided. We have the following asymptotic results:

The network scale-up method
Estimating the size of a hidden network or graph is an important problem in sociology, epidemiology, computer science, and intelligence applications [5,48,52,54,55,100,101]. A subgraph of a larger graph may contain information about the size of the larger graph [55,102,103]. The network scale-up method (NSUM) [5] provides an estimate for the size of a hidden population by making use of network information from a sub-sample of individuals. We now introduce exchangeable random graph models (EGM) [104] that both scenarios are based on, or related to. Suppose each vertex i ∈ V is associated with some random attribute Y i which is i.i.d. for each i. The probability that units i and j are connected is P( where ω is a function from [0, 1] 2 to [0, 1]. We then denote G V ∼ EGM (ω, |V |). EGM includes Erdős-Rényi [105] and stochastic block models [106] as special cases.

Sampling from the general population
We consider sampling uniformly at random from the general population V \ U with a fixed sample size |s| = n. The sampling mechanism is P(s | |s| = n) = 1/ M −N n . We consider the distribution of G V that is slightly more general than EGMs in that we require the joint distribution (Y i , Y j ) to be i.i.d. for each combination of i ∈ V \ U and j ∈ V , instead of assuming i.i.d. Y i 's. This is a generalization of the commonly assumed Erdős-Rényi distribution for NSUM methods. Let π = By canceling out π we have the following MME: In (

Sampling from the hidden population
Consider a random sample s ⊆ U where G U ∼ EGM (ω, N ). We observe the nodes i ∈ s, as well as network degrees d s i := j∈s 1{E ij = 1} and d U i := j∈U 1{E ij = 1}, for each individual i ∈ s.
canceling out π yields the MME which is often simplified to Chen et al. [107] investigated the behavior of N with finite-sample as well as with large n, but did not specify the relationship between N and n. In our setting, the finite-population regime is n → N with N fixed. The infill regime is that n, N are fixed and the sampling procedure is infinitely repeated. The outfill asymptotic regime is that n t , N t → ∞ with n t /N t → c ∈ (0, 1).

Estimating a total with unequal sampling probabilities
A generalization of binomial models allows for heterogeneity in the inclusion, or "success" probabilities p, that is, when the sampling is not uniformly at random. Horvitz and Thompson [50] proposed unbiased estimators for population means and totals under the setting of sampling without replacement from finite population, where the selection probabilities can be unequal. The Horvitz-Thompson (HT) estimator for the population total is N = i∈s 1/p i , where p i = E(1{i ∈ s}) is the probability that unit i ∈ U is sampled in s. The estimator N is unbiased for the total population size N . This estimator and its variants have been applied to the estimation of animal abundance [108] and other fields. We consider a deterministic sample size n. Then the variance of N is [50] Var where p ij is the joint probability that units i and j are both in the sampled set s, and p ii = p i [50]. The finite-population regime amounts to letting p i → 1 for any i. Under the infill regime, p i , p ij , N are fixed and the number of repeated samples k t → ∞. Under the outfill regime, N and n both increase to infinity such that n/N → c ∈ (0, 1). Figure 6 shows the non-uniform sampling mechanism under each regime.
Specifically, we consider the following setting to illustrate the asymptotic behavior of the HT estimator. Suppose U consists of H clusters, where the hth cluster has N h units. We assume that H is known in advance, while N h is observed only if a unit from cluster h is sampled. In each sample, a total of n units are sampled from U by the following procedure: first a cluster h is drawn uniformly at random each with probability 1/H. Then one unit is drawn from the N h units in that cluster, also uniformly at random, without replacement. An observation on sample s consists of the units in s, their group membership, and the sizes of groups that they belong to.
We assume that min h∈[H] N h > n. The marginal probability that unit i in cluster h is sampled is and the joint probability that two units i, j are sampled from clusters h and l (h = l) is When there are repeated observations, we assume they follow the same design and are mutually independent. In this setting, the outfill regime is defined such that each cluster in the original population is replicated and appears t times in U t . The cluster sizes are fixed at N (t) h = N h and the number of clusters increases as H t = tH. N = H h=1 is fixed and the estimand is N t = N t. The sample size satisfies n t /N t → c ∈ (0, 1). We then have:

Capture-recapture experiments
Capture-recapture (CRC) refers to a broad class of methods to estimate the size of hidden populations for which random sampling is possible [35,57,58,[109][110][111]. Estimation of the population size is based on the overlap between two or more random samples [8,15,31,32]. While a wide variety of CRC estimators have been developed [109,110,[112][113][114], we focus here on the two-and k-sample CRC estimators with homogeneity within a closed population.

Two-sample estimation
We first consider the common case of two-sample CRC. Let U be a hidden finite set of size N , where each unit i ∈ U has binary attributes (X 1 i , X 2 i ), which are all (0, 0) in the beginning. We draw a sample s 1 ⊆ U with size n 1 from U , and set X 1 i = 1 for all i ∈ s 1 . Then a second sample s 2 with size n 2 is drawn, independent from s 1 and uniformly at random, and we set X 2 i = 1 for all i ∈ s 2 . We observe (X 1 i , X 2 i ) i∈s 1 ∪s 2 , and let m = i∈U 1{(X 1 i , X 2 i ) = (1, 1)}. Similar to the MBM, m follows a hypergeometric distribution conditioning on N, n 1 and n 2 . The MME, N L = n 1 n 2 /m, is also known as the Lincoln-Petersen estimator [115,116].
We consider the finite-population regime with n 2 → N . The infill regime is that N, n 1 , n 2 are fixed and repeated sample pairs {s (t) 1 , s (t) 2 } are drawn with t → ∞. Note that in contrast to the MBM, the first sample s 1 can be generated differently for repeated sampling. The outfill regime is given by Previous results exist on the bounds or estimates of biases and variances. These were implicitly based on asymptotic approximations: Chapman [56] showed a lower bound for the bias under outfill, and bounded the variance as under asymptotic approximation that was satisfied by the outfill regime. Though these no longer hold under finite-sample setting, it has been demonstrated through simulation that N L has a considerable bias under a range of settings. A less biased estimator was proposed [56], with bias for any n 1 , n 2 , N , and variance under outfill [56], where ∼ means the difference between two quantities decay to 0. We have the following asymptotic result: Under the finite-population regime, N L and N C are consistent. Under infill asymptotics, N L is positively biased and has MSE O(1) for at least a range of values of n 1 , n 2 , N . N C is negatively biased, but the bias is within 1 if n 1 + n 2 + 1 < N/2 and n 1 n 2 /N > log N [56]. Under the outfill regime, N L has bias at least O(1) and variance at least O(N ). N C is asymptotically unbiased Further, Chapman [56] showed that no estimator can be unbiased for all possible values of N, n 1 and n 2 .

k-sample estimation
We now consider the generalized setting of k samples. In this scenario, we draw k samples s 1 , . . . , s k ⊆ U with deterministic sizes n 1 , . . . , n k respectively. We assume the probability p j := n j /N of being observed in the jth sample is the same for each unit for j = 1, . . . , k. In each sample (say s j ), we give the observed units a label that is different for different j's, and record the capture history H j,i = (I From the contingency table we have m i , the number of already marked individuals in s i , and M i , the total number of marked individuals in U before s i is drawn. The sampling scheme then follows a generalized hypergeometric distribution: Maximizing the likelihood (10) gives the MLE of N as the solution of which is unique, finite and greater than r if s 1 ∩ . . . ∩ s k is non-empty and |s i | < r for all i ≤ k [57]. We restrict our interest to this case only. Setting k = 2 recovers the Lincoln-Petersen estimator N L . Since finite-population and infill regimes for the two-and k-sample cases are similar in essence, we mainly discuss outfill asymptotics in this setting: for any finite k, we have N, n 1 , . . . , n k → ∞ with n i /N i → c i ∈ (0, 1) for i = 1, . . . , k, and k t may be finite or going to infinity. We assume the c i 's are bounded away from 0 and 1. Under outfill asymptotics with finite k, following from the delta method, the bias of the MLE is approximated by which is O(1), and the asymptotic variance is O(N ), approximated by [57] Under outfill asymptotics with infinite sampling repetitions, we assume inf i∈[k] p i > 0. Then the magnitude of bias is bounded above by N − E[r], and hence by N k . Therefore, as long as k is increasing such that Seber [35] investigated the mean Petersen estimate from k-sample CRC experiments: at stage i of the sampling process, regard s i as the second capture in the two-sample case, and ∪ j<i s j as the first capture. The Chapman estimate is then calculated as at each stage, and N is estimated as the average N = k i=2 N i /(k − 1), which is asymptotically unbiased under the outfill regmie with any k. They provided a conservative estimate for the variance, Var N ≈ Var N i /(k − 1) 2 [35].

Discussion
Several features determine researchers' ability to learn about the size of a hidden set. First, the structure of the set -labeled units, ordering of the labels, or relational (network/graph) information -can permit researchers to learn about the number of remaining units when a subset is observed. Second, a feasible probabilistic query mechanism -random sampling, or observation conditional on a unit trait or attribute -must be available. Third, a statistical estimator that enjoys desirable statistical properties must be chosen. Some of these features may be under the control of researchers, while others may be intrinsic to the problem.
How should empirical researchers evaluate the statistical properties of estimators, design a study or choose a sample size? Many of these tasks are based on asymptotic arguments, and statistical claims about the large-sample performance of hidden set size estimators depend on specification of an appropriate asymptotic (or even non-asymptotic) regime. It is crucial to identify how the sample size increases, especially in relation to the target population, when asymptotic approximation or comparison is involved in population size estimation tasks. When designing a study, this may include determining the minimum sample size that leads to desired standard error [117,118], or selecting an "optimal" sampling strategy (e.g. one-time larger sample versus multi-time repeated smaller samples). In data analysis, this may include establishing valid approximation to biases and variances or comparing the efficiency of different statistical approaches [117,[119][120][121]. If the vast majority of the target population can be observed in one-step sampling, consistency under the trivial finite-population regime may be a goal when developing estimators. If the total population is fixed, and arbitrarily repeated i.i.d. samples can be obtained, then consistency under infill may justify the use of a statistical approach. If instead only one-time or finite-time sampling is permitted, in which the sample size is believed to reflect a proportion of the potentially large population, performance of estimators under outfill may be of more interest. We have shown that different asymptotic regimes can lead to dramatically different statistical properties. Some seemingly sensible estimators are inconsistent with different rates of MSE, and asymptotic claims for population size estimators under one regime may be of limited value for analyzing the general situation.
While we have discussed many of the most popular settings and methods for estimating the size of a hidden set, there are several other settings we have not covered. Respondent-driven sampling (RDS), snowball sampling and link-tracing sampling generate samples from hidden networks, and modeling the stochastic process underlying such sampling mechanism helps to learn the hidden population [2,97,122,123]. There is a large literature on CRC beyond what we have covered here. For example, there are approaches for CRC with an open population, with immigration, emigration, birth, and death [112,113] or with heterogeneity in capture probabilities [109,110]. CRC is also possible using data from network sampling designs [114]. We have also not discussed species number estimation [124], "count distinct" and streaming estimation problems [125][126][127], and genetic methods for population size estimation [128,129]. In addition, we have not addressed the issue of entity resolution, or record de-duplication [47]. The results presented in this paper suggest that researchers employing methods for estimating the size of a hidden set should evaluate the performance of estimators under deliberately specified asymptotic assumptions.

A Asymptotic normality and consistency
We first introduce a simple lemma that helps to prove consistency or inconsistency based on asymptotic normality. Proof. Assume a t → ∞ when t → ∞. Then for any ε > 0, there exists T > 0 such that |E(X t − ν t )| < ε/2 for any t > T . For such t, since |X t − ν t | ≤ |X t − ν t − E(X t − ν t )| + |E(X t − ν t )|, applying the union bound and Chebyshev's inequality yields If a t does not go to infinity, then for some m > 0 and any M > 0, there exists t > M such that a t < m for such t. Pick ε > 0, then there exists T > 0 such that for Y ∼ N (0, σ 2 ) for all t > T . Specially, for any M > 0, there exists t 0 > max{T, M } such that a t 0 < m holds. Then indicating that {X t − ν t } does not converge to 0.

B Non-uniform draws from a continuous interval
We showed in Section 3 that some estimators for µ(U ) are inconsistent under the outfill asymptotic regime when draws from a set U with finite size µ(U ) are uniform. In those estimates, X (n) , or a scaled version of it, does not approach the true maximum θ or N fast enough to ensure consistency. Does the same result hold when the distribution of draws from an interval U is non-uniform? Consider the sequence of densities indexed by u f (u) for p u > 0 and lim u→∞ p u → ∞, which has support on [0, θ] and increases as x approaches θ. Setting p = 0 recovers the uniform distribution. As u increases, f (u) θ (x) becomes more peaked at θ, and lim u→∞ f (u) θ (x) = δ(θ − x), a point mass at θ. To simplify notation, we omit the subscript u hereinafter, but shall keep in mind that all distributions f θ (x) and the parameter p's are actually indexed by u.
Additionally, we investigate the consistency of θ M LE when the density from which samples are drawn increases at an even faster rate as x approaches θ. Consider then the MLE is θ M LE = X (n) and its MSE under the outfill asymptotic regime. (12) approaches 0 under outfill because the first two terms go to 0, and also, the third term is The preceding argument is summarized in the following theorem.
Theorem. Under outfill asymptotics, an infinite-degree polynomial or exponential rate of increment near θ leads to a consistent MLE for θ. However, no finite-degree polynomial density leads to a consistent MLE.

C Proof of theorems
Proof of Theorem 3.1. Rates of biases and variances of N G , N 2 and N 3 follow from the non-asymptotic claims of biases and variances given by Goodman [43], as stated in the main text. Consistency under the finite-population regime follows directly from setting n = N in each of the estimators. Consistency under infill of N G , N 2 , N 3 follows from the unbiasedness of these estimators, while that of N M LE follows from the fact that max i∈[kt] X i(n) P − → N as k t → ∞.
We show the inconsistency of N G (when the initial label is 1) and N 3 (when the initial label is unknown) under outfill. The results can be derived similarly for N M LE and N 2 , since they are shifted and scaled versions of N G , and the corresponding proofs for inconsistency also amount to bounding the probability that X (n) equals a specific value (as done below).
For N G , recall that where N G , n and N are implicitly indexed by t as defined under outfill asymptotics. However, we omit the subscript for simpler notation. For any 0 < ε < 1/c − 1, there exists T 0 ∈ N + such that for any t > T 0 , where n, N are indexed by t. Then when t > T 0 , Then we show the inconsistency of N 3 with unknown initial number u. Let Y = X − u, then Y (n) − Y (1) and X (n) − X (1) follow the same distribution. Likewise, for any ε > 0, there exists T 1 such that for any t > T 1 . Then Since N G and N 3 take discrete values, (14) and (16) imply the inconsistency of N G and N 3 .
Proof of Theorem 4.2. The MBM estimator and the Lincoln-Petersen estimator take the same form of n 1 n 2 /m, where m follows hypergeometric distribution with n 2 "draws", and two categories with sizes n 1 and N − n 1 . Refer to the proof for inconsistency of N L in Theorem 5.1.
Proof of Theorem 4.3. We first show the conditional distribution of d U i given d V i . For any i, since Since we impose no assumption on the distribution of network degrees within U , even when we sample all units in V \ U , we cannot recover N deterministically. (For example, when there exists j ∈ U such that d where d s i /2 ∼ Binomial n 2 , π , and d V \s i ∼ Binomial(n(M − n), π). By the central limit theorem and Slutsky's theorem, Multiply (18)(19) by (n − 1)/ n(n − 1) and (M − n)/n respectively and by Slutsky's theorem we have Since d s i and d V \s i are mutually independent, combining (17), (20) and (21) yields Also, Divide both sides by d V i /nM , and Slutsky's theorem yields which can be rewritten as Learning about the asymptotic behavior of N − N requires characterizing the second term on the left-hand side of (23). Define a sequence of random variables and functions where n, M are indexed by t, and a function g(x) = 1 − π/x. Then since g(π) = 0. The first term in (24) and the second term in (24) satisfies by the delta method. Therefore the quantity in (24) by Slutsky's theorem. Combining (23) and (24), we have where σ 2 is bounded between Therefore, N is asymptotically normal with bias c 1 and variance O(1), and following from Lemma 1, inconsistent under the outfill regime.
Proof of Theorem 4.4. We derive the asymptotic normal distribution of N under the outfill regime that n/N → c ∈ (0, 1). Note that by the central limit theorem. Therefore, by Slutsky's theorem, Multiply both sides by n(N − n)/(n − 1) and Slutsky's theorem yields which can be rewritten as We need to characterize the second term on the left-hand side of (27) in order to derive the asymptotic distribution of N . By the central limit theorem, and therefore Define where n is indexed by t. Also define h(y) = 1 − π/y. Then since h(π) = 0. The first term in (29) is and for the second term in (29), by the delta method. Hence the quantity on the left-hand side of (29) satisfies Combine (27) and (30) 1) Under the infill regime, n t = n, H t = H and N (t) h = N h for any t, so (31) is O(1). The number of samples k t goes to infinity as t increases, and under k t -time sampling, Var N (t) = O( 1 kt ). Therefore N is MSE consistent, and also consistent, under infill asymptotics. where i.e. the variance of N (t) is O(t), which goes to infinity as t increases. It follows from Lemma 1 that the HT estimator is inconsistent under outfill asymptotics.
Proof of Theorem 5.1. Finite-sample claims follow from Chapman [56]. Setting n 2 = N leads to consistency under the finite-population regime. Behavior under infill asymptotics follows from the biases of N L and N C .
We show the inconsistency of N L and N C under the special outfill regime that n 1 = c 1 N, n 2 = c 2 N with N increasing, and n 1 , n 2 , N are indexed by t but the subscripts are omitted for simplicity.