Properties of Estimators in Exponential Family Settings with Observation-based Stopping Rules.

Often, sample size is not fixed by design. A key example is a sequential trial with a stopping rule, where stopping is based on what has been observed at an interim look. While such designs are used for time and cost efficiency, and hypothesis testing theory has been well developed, estimation following a sequential trial is a challenging, still controversial problem. Progress has been made in the literature, predominantly for normal outcomes and/or for a deterministic stopping rule. Here, we place these settings in a broader context of outcomes following an exponential family distribution and, with a stochastic stopping rule that includes a deterministic rule and completely random sample size as special cases. It is shown that the estimation problem is usually simpler than often thought. In particular, it is established that the ordinary sample average is a very sensible choice, contrary to commonly encountered statements. We study (1) The so-called incompleteness property of the sufficient statistics, (2) A general class of linear estimators, and (3) Joint and conditional likelihood estimation. Apart from the general exponential family setting, normal and binary outcomes are considered as key examples. While our results hold for a general number of looks, for ease of exposition, we focus on the simple yet generic setting of two possible sample sizes, N=n or N=2n.


Introduction
It is commonly known that statistical designs where the sample size is random pose challenges beyond the fixed sample-size case and that many findings are counter-intuitive.
While this has been documented for situations where the sample size depends on the data, such as in sequential trials (Siegmund, 1978;Hughes and Pocock, 1988;Emerson and Fleming, 1990) or incomplete data (Little and Rubin, 2002), it is less widespread that such counterintuitive results apply even when the sample size is completely random (Grambsch, 1983;Barndorff-Nielsen and Cox, 1984), in the sense that both the collected and uncollected data have no relationship to the stochastic mechanism governing the sample size. Liu and Hall (1999) provided a general theory for sequential studies, where the decision to either stop or continue the study at every interim look depends deterministically on the data collected up to that point. Molenberghs et al (2012) generalized their results to the setting where the sample size may depend stochastically rather than deterministically on the observed data, a general setting that contains both sequential trials and completely random sample sizes (CRSS) as special cases. We refer to these three settings together as a stochastic stopping rule. Molenberghs et al (2012) also discussed the related cases of incomplete longitudinal data, censored time-to-event data, joint modeling of survival and longitudinal data, and clustered data with random cluster sizes.
An important finding of Liu and Hall (1999) was that the commonly used sufficient statistics in deterministic stopping designs are incomplete, a property that will be defined in the next section. Molenberghs et al (2012) generalized this to stochastic stopping rules and explore the implications of this for linear estimators based on the sample sum as well as on so-called marginal and conditional estimators. They found for stochastic stopping rules that the counterintuitive implications of a random sample size follows from two properties: (a) excluding the CRSS case, the sample size is non-ancillary given the sample sum; (b) the pair consisting of the accumulating sample sum and the sample size is an incomplete minimal sufficient statistic. These properties are defined in Section 2.
The work of Liu and Hall (1999) and Molenberghs et al (2012) was confined to the special case of normally distributed outcomes. Further, Molenberghs et al (2012) illustrated there developments with a random stopping rule of probit form. These specific choices allow for insightful expressions. The latter choice is not however necessary for deterministic stopping rules that can be cast in the form of continuation and stopping regions or, equivalently, the boundaries between them.
Extending the results in Liu and Hall (1999), Liu et al (2006) presented a general deterministic stopping rule theory where the outcome follows a one-parameter exponential family, and also established incompleteness for this case. This implies, in particular, that there are infinitely many unbiased estimators, none with uniformly minimum variance.
Here, we show incompleteness in the exponential family case, for a stochastic stopping rule, and derive explicit results for linear estimators as well as for marginal and conditional likelihood estimators. These general findings are then further illustrated in the normal case, making the connection to Molenberghs et al (2012), and in the case where the outcomes are binary, and hence the sample sum is binomial.
Our findings are essentially as follows. The classical sample average is biased in finite samples, though asymptotically unbiased for a broad classes of stopping rules. An unbiased estimator follows from the conditional likelihood, where the conditioning is on the (non-ancillary) sample size. Contrary to intuition, the conditional estimator has larger mean squared error than the ordinary sample average for sufficiently large sample size, the latter resulting from the joint likelihood, where 'joint' means a simultaneous model for the outcomes and the sample size. In some cases, the result holds for all sample sizes, large and small. Thus, the sample average is a valid and sensible estimator, contrary to some claims in the sequential-trial literature, for stochastic and deterministic stopping rules. The literature on sequential trials is indeed very large, with a relatively early review given by Whitehead (1999). Tsiatis, Rosner, and Mehta (1984) and Rosner and Tsiatis (1988) address precision estimation after group sequential trials. Emerson and Fleming (1990) propose estimators within an ordering paradigm. Much of this work is placed in a unifying framework by Liu and Hall (1999). A review can be found in Molenberghs et al (2012).
The finite-sample bias in the sample average disappears only in the CRSS case. Even then, it is not unique in that a whole class of so-called generalized sample average estimators can be defined, all of which are unbiased. This enables us to show that the ordinary sample average is only asymptotically optimal. Indeed there is no uniformly optimal unbiased estimator in finite samples for most exponential-family members; the exponential distribution is a noteworthy exception.
The case of two possible sample sizes, N = n and N = 2n is simple yet generic, and will be adopted here. All developments can be generalized with ease to the setting with L possible sample sizes and accrual numbers n 1 , . . . , n L .
The remainder of this paper is organized as follows. In Section 2, the problem under investigation is formally introduced, along with key concepts. The incompleteness of the sufficient statistics is established in Section 3. Section 4 is devoted to generalized sample averages, while joint and conditional likelihood estimation is the topic of Section 5. In each of Sections 3-5, the general exponential family case is supplemented with the particular case of the normal and Bernoulli distributions.

Notation, Basic Concepts, and Problem Formulation
As stated in the introduction, we consider a simple sequential trial, where n measurements Y i are observed, after which a stochastic stopping rule is applied and, depending on the outcome, another set of n measurements is or is not observed. Let Y be the (2n × 1) vector of outcomes that could be collected, with the sample sum denoted by K, and N be the realized sample size, that is, N = n or N = 2n. A joint model for the stochastic outcomes is The sample sum is denoted by K. If necessary, a subscript will indicate over which batch the sample is calculated. Molenberghs et al (2012) noted the similarity with missingdata concepts, where (1) is a selection model factorization and (2) is a pattern-mixture factorization (Little and Rubin, 2002). In all cases, it is assumed that f(N|y, ψ) = f(N|y o , ψ) depends on observed outcomes only, and hence the sample size is determined by the first batch of observations (Y 1 , . . . , Y n ). We may then write f(N|K n , ψ). This corresponds to the frequentist concept of missingness at random (Little and Rubin, 2002).
In the limiting case of a deterministic stopping rule, f(N|y, ψ) is degenerate and f(N = n|y, ψ) equals 1 when K n ∈ S ⊂ IR and 0 over its complement C, with the reverse holding for f(N = 2n|y, ψ). The CRSS case follows by assuming Y and N to be independent, meaning that both factorizations (1) and (2) trivially reduce to f(y|θ) · f(N|ψ).
In the stopping-rule case ψ is not estimable from the data and will be assumed to be specified by design. This is different for the other settings that can also be cast in terms of (1)-(2), such as incomplete longitudinal data, clusters of random size, censored time-to-event data, joint models for longitudinal and time-to-event data, and random measurement times settings, as noted by Molenberghs et al (2012). In these cases, a subject-specific index i needs to be introduced into (1)-(2) and N needs to be replaced by the missing data indicators, censoring indicators, and so on.

Basic Concepts
In line with Molenberghs et al (2012), we will review several fundamental concepts that are essential in what follows.
In line with Rubin (1976), we consider ignorability. For pure likelihood or Bayesian inferences, under missingness at random (MAR), inferences about θ can be made using f(y o i |θ) only, without the need for an explicit missing-data mechanism or, in our case, without the need for an explicit sample-size model. This is, provided the regularity condition of separability holds true, i.e., that the parameter space of (θ , ψ ) is the Cartesian product of their individual product spaces. In other words, this means that the sample size model does not contain information about the outcome model parameter. It implies that N could then be considered ancillary in the sense of (Cox and Hinkley, 1974, pp. 32-35). We will see that this is true for CRSS, but not for the other situations. Excluding MNAR, ignorability can be violated in three ways. First, even in the likelihood and Bayesian frameworks and under MAR, ignorability does not apply in a non-separable situation.
Second, frequentist inferences are not necessarily ignorable under MAR. Third, assuming MAR and separability hold and we are in a likelihood or Bayesian framework, ignorability in the selection model decomposition (1) does not translate to the pattern-mixture model (2), as is clear from the presence of both θ and ψ in both factors of (2). The latter statement is symmetric and could be made starting from a pattern-mixture view as well.
The bottom line is that ignorability holds in at most one of these, except in the trivial MCAR setting, such as for CRSS.
There is a connection between ignorability and ancillarity (Cox and Hinkley, 1974). They define an ancillary statistic T to be one that complements a minimally sufficient statistic S such that, given S, T does not contain information about the parameter of interest.
Arguably the best known example is the sample size T = n when estimating a mean, provided the sample size is fixed by design or the law governing it does not depend on the mean parameter to be estimated, as with CRSS. Counterexamples are the stochastic and deterministic stopping rules.
The crucial property for Liu and Hall (1999), Liu et al (2006), Molenberghs et al (2012), as well as for us here is that of completeness (Casella and Berger, 2001, pp. 285-286). A statistic s(Y ) of a random variable Y , with Y belonging to a family P θ , is complete if, for every measurable function g(·), E[g{s(Y )}] = 0 for all θ, implies that P θ [g{s(Y )} = 0] = 1 for all θ. The relevance of completeness for us surfaces in two ways. First, from the Lehman-Scheffé theorem (Casella and Berger, 2001), if a statistic is unbiased, complete, and sufficient for some parameter θ, then it is the best mean-unbiased estimator for θ. The lack of this property in the stopping-rule case will manifest itself when studying generalized sample averages in Section 4. Second, completeness and ancillarity are connected through Basu's theorem (Basu, 1955;Casella and Berger, 2001, p. 287): a statistic both complete and sufficient is independent of any ancillary statistic.

General Model Formulation
Assume that we collect n i.i.d. observations Y 1 , . . . , Y n , with exponential family density where θ is the natural parameter, a(θ) the mean generating function, and h(y) a normalizing constant. Assume a stochastic stopping rule with K n = n i=1 Y i . The form for (4) is left unspecified at this time. The CRSS setting follows as F (k n ) ≡ F , a constant. Likewise, when F (·) is degenerate, a deterministic stopping rule ensues. When the trial is not stopped, a further n observations Y n+1 , . . . , Y 2n are collected, also with density (3). The inferential goal is to estimate θ or a function of this, such as the population mean µ. From the exponential-family structure, the density of K n can be expressed When no ambiguity can arise, the subscript n may be dropped from K n . Because the density integrates to 1, it trivially follows that While expression (6) is well known to be a Laplace transformation, it is useful to state it explicitly in preparation of the derivations in Section 3. Because the stopping rule depends on K n , and because (4) combined with the outcome model is a pattern-mixture factorization (2), N is not ancillary to K.
When, in addition, the conditional probability of stopping an exponential family form is chosen, e.g., then an appealing form for the marginal stopping probability can be derived. Here f 1 (z) can be seen as an exponential family member, underlying the stopping process. When the outcomes Y and hence K do not range over the entire real line, the lower integration limit in (7) should be adjusted accordingly, and the function A(k) should be chosen so as to obey the range restrictions. It is convenient to assume that f 1 (z) has no free parameters; should there be the need for such, then they can be absorbed into A(k). Hence, we can Using (5) and (8), the marginal stopping probability becomes: where In the special case of a CRSS, A(k) ≡ A and (9) reduces to In our two special cases, (3) will be chosen as standard normal and Bernoulli, respectively.
In the first of these, in concordance with Molenberghs et al (2012), (4) will be assumed to be of probit form: In the binary case, we will generally leave (4) unspecified, but for some developments it is useful to consider an explicit example, for which we will resort to the beta distribution, i.e., with B(·, ·) the beta function. It is convenient to choose integer values, for illustrative purposes: α = p + 1, β = q + 1, with p and q integers, changing (11) to: Choosing (12) leads to the conditional stopping probability: It is instructive to consider some special cases of this. When p = q = 0, (12) reduces to the uniform distribution on the unit interval, and it immediately follows that F (k) = A(k).
When p = 1 and q = 0, we find F (k) = A(k) 2 . As a third and last instance, when A useful function is A(k) = k/n, implying that stopping is certain when K = n and continuation is certain when K = 0, while for 0 < K < n stopping is probabilistic. The actual probability in these cases depends on the choice for p and q.
These choices are made to illustrate our general developments and our emphasis is not on, say, designing a particular trial. However, the class of beta-based stopping rules, for example, potentially leads to rich families of stopping rules and spending functions (Whitehead, 1997;Jennison and Turnbull, 2000).

The General Case
We now consider the role of completeness in this setting, building upon the work of Liu and Hall (1999), Liu et al (2006), and Molenberghs et al (2012). A sufficient statistic for this setting is (K, N). In line with the developments in the above papers, the joint distribution for (K, N) is: When the stopping rule leads to range restrictions in the sense of Lehman and Stein (1950), it is known that the sufficient statistic is complete. Hence, for the rest of this section, we assume their necessary and sufficient conditions do not hold. It is known that these conditions do not hold for the normal distribution, in contrast to classes of stopping rules for the Poisson and binomial distributions, for example.
Assume now that a function g(K, N) exists such that its expectation is zero for all values of the parameter and further that integrands are not zero almost everywhere over their integration ranges. Such a function must satisfy: Substituting the general exponential form (5) into (16), and using (6), leads to (17) Because the left hand side of (17) is a convolution, and using the uniqueness of the Laplace transform, we find: Hence, when g(k, n) is chosen arbitrarily, (18) prescribes the choice for g(k, 2n) which leads to a counterexample to completeness, hence establishing incompleteness.
For the CRSS case, when F (k) ≡ F , a constant, and also choosing g(k, n) = c, a constant, it follows that In the limiting case of a deterministic stopping rule, F (z) = 1 over the stopping region S and 0 over its set complement C. It then follows that (14)-(15) reduce to: For the deterministic case, (18) becomes: Expression (21) follows from the fact that, in the deterministic case, F (k) = 1 over the stopping region S and 0 elsewhere. The transition from one denominator to the other follows from observing that the convolution of f n (k) with itself produces f 2n (k), and then replacing all of these by their explicit exponential-family form (5). Alternatively, it is easy to show that (21) follows immediately from the definition of a function G(K, N) and The implication of these findings is that whenever they hold, the Lehmann-Scheffé theorem cannot be applied (see Section 2). It follows that a best mean-unbiased estimator does not necessarily exist for the average. In the next section, it will be shown that this is indeed the case for many, but not all outcome distributions and stopping rules, given that, for example, the exponential distribution does admit a uniform optimum. It will be shown that no optimum exists for the normal case, in line with Molenberghs et al (2012), and neither for the Bernoulli and Poisson cases, for a wide class of stopping rules.

The Normal Case
Following Molenberghs et al (2012), consider the outcome to be standard normal with mean µ and let stopping be governed by (10). They derived from first principles that the marginal probability of stopping is: This expression also follows as a special case of (9) by choosing (10) as the stopping rule, i.e., f 1 (z) as the standard normal density and A(k) = α + βk/n, and further f n,θ=µ = ϕ µ,n (k), where ϕ µ,s (k) is the normal density with mean µ and variance s. Details of this derivation are provided in Appendix A.
In contrast, although the observed data are present in the conditional stopping probability, µ is not, implying separability in the selection model formulation.
In this case (14)-(15) takes the form Here, φ s (k) is the normal density with mean 0 and variance s. Expression (25) is more explicit than (15), making use of the fact that the outcome densities are normal and the stopping probability is written as a normal cumulative distribution function. The derivation can be found in Molenberghs et al (2012). Based on the fact that integrating the joint densities specified by (23)-(25) over K and summing over N should be equal to one, leads to the identity: In Section 4.1, (26) will be derived in general.
The specific form of condition (18) is: In the CRSS case, (24)-(25) reduce to: where Φ ≡ Φ(α). Then here, as in the general case, (27) simplifies and leads to an explicit solution for a number of cases, especially when g(k, n) is chosen to be a constant.
In addition, for this case, other explicit examples can be constructed, even when β = 0.
We reproduce the two examples of Molenberghs et al (2012).
Such g(k, N) functions lead to entire classes of estimators. To see this, assume that an estimator for µ is available, µ, say. For example, µ could be the sample average We can then construct a class of estimators derived there from.
Applying this to our example and choosing (30) and (31) for the special case of β = 0 leads to the following class of estimators: It follows directly from the construction of g(k, N) that E(µ) = E( µ λ ) and hence, if µ is unbiased, then so is µ λ .
For the variance of (33), we obtain var( µ λ ) = var(µ) + λ 2 Φ(1− Φ) which, within this class, is minimal for λ = 0. Hence, for β = 0, i.e., the CRSS case, the original estimator is more efficient than any member of the new class. This will change when β = 0. We also need to consider the basic estimator itself, e.g., either (32) or (51). before moving on to this, we first complete the second example.
With g(k, n) as in (30), when β → +∞ (31) becomes To see that the considerations particular to the above example are not unique, we consider a second one.
with λ a given constant, then Choosing (37) and (38) now produces the estimator When now taking the limit β → +∞, (37)-(38) become: The fact that the function g(k, n) in (40) is undefined over the negative real numbers is unproblematic, because the stopping region is confined to the non-negative half line.
While the binary case follows from the general considerations given in Section 3.1, it is insightful to examine this outcome type in some detail; here, integration is replaced by summation. Let the Bernoulli probability be π. The sample sum K then follows a Bin(π, N) distribution and For now, as in the general case, we leave F (k) unspecified. The joint distribution of (K, N) now takes the form where the meaning of H(k) is obvious, a ∨ b = max(a, b), and a ∧ b = min(a, b).
When stopping rule (13) is chosen, (43) becomes: The marginal stopping probability can be derived by summing (46) over k but is generally unwieldy. In the particular case that p = q = 0 and A(k) = k/n, we find While the derivation of (47) is obvious, that of (48) is less straightforward and details are given in Appendix B. From (47), we deduce immediately that In other words, this particular choice of conditional stopping rule produces essentially the simplest possible marginal stopping probability that depends on the parameter π that governs the outcomes.
The condition for the existence of a non-trivial function g(K, N) with expectation zero for all π is a discrete version of (16) and reads: Using the discrete-data version of (6), i.e., Owing to equality of polynomial coefficients, we find: the discrete-data version of (18).

The General Case
To underscore the impact of incompleteness of the statistics (K, N), Molenberghs et al (2012) generalized the sample average (32) to for some constants c and d. We will refer to it as the generalized sample average (GSA).
The ordinary sample average follows as c = d = 1. In this section, (51) will be considered from a general exponential-family perspective. Sections 4.2 and 4.3 bring out some further specifics for the normal and Bernoulli cases, respectively.
From (5), the mean follows as µ = ∂a(θ)/∂θ. The expectation is: This form can be simplified. We will derive two identities that are useful here and in what follows. Because integrating (14)-(15) over K and summing over N should lead to unity, it follows that This equation obviously also follows from first principles. Likewise, we have that where Using (54), we can rewrite (52) as While obvious, it is useful to spell out (56)-(57) for the ordinary sample average: It is very intuitive that the bias in the sample average is a simple function of the difference between conditional and marginal expectation of K/N on the one hand, and the probability of stopping on the other.
The specific form of (52) will depend on both the exponential family member considered and the form of the stopping rule. In general, the expectation may be a non-linear function of µ and hence there may be no constants c and d for which the expectation is µ. Hence, in many situations, all linear estimators of the form (51)  We now turn to the asymptotic behavior of the GSA, i.e., the case where n → +∞. Because K converges to a N(nµ, nσ 2 ) variable, and using a first-order Taylor series expansion , we find from first principles: Using (60) and (61), (56) converges to: In particular, for the ordinary sample average: In Section 4.2, we we will see that (62) is finite and, moreover, (63) equals µ. Sufficient conditions for this to hold in general can be given. Assume that F (·) is a continuously differentiable function that depends on k as a function of k/n. To emphasize this, write Then F (nµ) = F (η(µ), independent of n and F (nµ) = η (µ)F (η(µ))/n, which depends on n only through the factor n −1 and hence converges to zero. More generally, a stop-ping rule that satisfies F (nµ) n→+∞ −→ 0 ensures that the sample average is asymptotically unbiased.
For a GSA to be asymptotically unbiased, (62) should equal µ. Assume that the third term on the right hand side of (62)  For the GSA to be unbiased in the finite-sample case, (56) needs to equal µ, leading to the requirement: with µ = E(K/N|N = n). Evidently, this is a function of µ in the non-CRSS case and hence no uniformly unbiased estimator exists. Further, unless in the CRSS case, the ordinary sample average never satisfies (65) because this would imply that µ = µ and hence the stopping probability would be independent of µ.
In the specific case of a CRSS, the constant F is taken out of the integrals on the right hand side of (56) and we easily find: which is unbiased if and only if An obvious solution is c = d = 1, the sample average, next to an infinite number of unbiased linear estimators of the type (51). Note that (67) follows from (65) upon observing that in the CRSS case µ = µ and P (N = n) = F .
In addition to studying the overall expectation of the GSA, it is of interest to consider the conditional expectations. These are: The ordinary sample average versions follow by setting c = d = 1 in (68)- (69).
For the ordinary sample average, when F (nµ) converges to zero, the conditional expectations converge to µ. In case the limits in (70) and (71)  A natural follow-up question is whether there is a, perhaps a uniform, optimal estimator in the CRSS case. From straightforward algebra we find that which is minimal for In (72) and (73), σ 2 is the variance. It follows as either the first derivative of the mean function or, in the slightly more general case where there is an overdispersion parameter, as the first derivative of the mean multiplied with the overdispersion parameter.  (67)  In all cases, when F = 0 then d = 1 and c is irrelevant, while for F = 1, the reverse is true.
We have seen above that, even for CRSS, the sample average is not optimal, and that there is no uniform optimal solution, even though the sample average approximately is. The exponential case is an exception to this, as we saw above. However, the sample average is optimal in the restricted class of estimators that is invariant to future decisions. Indeed, if stopping occurs, then the choice of the coefficient c leads to an unbiased estimator, provided the appropriate d is chosen. However, this d will never be used as it pertains to 'future' observations. This can be avoided only by setting both coefficients to be equal, from which the conventional sample average emerges.
The asymptotic behavior for a deterministic stopping rule is completely captured by the normal case, described in Section 4.2, because the stopping rule F (k) has the effect of restricting the integrals over the stopping and continuation regions S and C, respectively.
This, together with the fact that f n (k) approaches a normal density with mean nµ and variance nσ 2 establishes this fact. As a result, we can restrict considerations regarding the deterministic case to the finite-sample situation. But also this one is very straightforward.
Given that the joint distribution (14)-(15) becomes (19)-(20), the functions A n (µ) and B n (µ) in (55) take the form: and all results, such as marginal and conditional expectations of the GSA, carry over.
The specific case of a CRSS, here corresponding with β = 0, has been considered in Section 4.1.
When β = 0, expression (75) does not in general simplify. It is easy to see here that there cannot be a uniformly unbiased estimator, i.e., that there cannot exist c and d such that (75) reduces to µ, for all µ, and in particular for µ = 0. For this special case where ν 0 = (α)/ 1 + β 2 /n. Given that β = 0, this expression leads to the condition 2c = d. Substituting this back into (75), which should be µ for every value of µ, and not Based on this, given that Φ(ν) is not constant but rather depends on µ, unless β = 0, we see that there can be no uniformly unbiased estimator for the generalized sample average type. In other words, a simple average estimator, that merely uses the observed measurements in a least-squares fashion, can never be unbiased unless β = 0.
In particular, when β → +∞, we see that There exist other choices that also lead to asymptotically unbiased generalized sample averages. For β = 0 but finite, the expectation becomes which equals µ if and only if: While (79) and (67) are similar, there is a crucial difference between these: the latter is independent of µ, while the former is not, except when c = d = 1. In other words, there is no uniformly asymptotically unbiased generalized sample average for finite, non-zero β, except for the ordinary sample average itself.
The above limits also follow from (62) and (63), because now η(k/n) = α + βk/n and the derivative therefore is F (nµ) = φ(α + βµ) · β/n, which leads to (76). Molenberghs et al (2012) also studied the deterministic stopping rule case, following from β → ∞, because then (78) becomes This provides us with the interesting situation that, for positive µ, c = 1 yields an asymptotically unbiased estimator, regardless of d, with the reverse holding for negative µ. In the special case that µ = 0, both coefficients are immaterial. In addition, we see here as well that the only uniform solution is obtained by requiring that the bias asymptotically vanishes for all values of µ, that is c = d = 1.
The pleasing asymptotic behavior of the sample average is connected to the choice of the stopping rule, in view of limiting expressions (63) The essence is that the stopping rule is a cumulative density function based transformation of a linear predictor in k/n. It is therefore of interes to examine the consequences of switching to a different class of stopping rule. Therefore, we change the stopping rule to Φ(α + βk). Then F (nµ) = βφ(α + βnµ) which again tends to zero. However, depending on the sign of β and µ, Φ(α + βnµ) tends to either zero or one. Applying de l'Hôpital's rule to the case where F (nµ) tends to zero as well, produces −β(α + βnµ) which tends to infinity, and hence the regularity condition (70) appears not to be satisfied. This requires careful qualification, because not only does F (nµ) appear in (70), it is also the probability with which N = n, which then equally well tends to zero. Thus, for this case, in the limit, E(µ|N = 2n) = E(µ) and unbiasedness still applies. Evidently, when 1 − F (nµ) tends to zero rather than F (nµ), we are in the mirror image of the above situation, and the result is the same. This result applies more generally. If F (k) = Φ(α + βkn m ), with m any real number, then F (nµ) converges to zero whatever m is. Further, F (nµ) converges to Φ(α + βµ) for m = −1, Φ(α) for m < −1, and Φ(±∞) (i.e., 0 or 1) for m > −1.
This means that the sample average is asymptotically unbiased in all cases, and even conditionally asymptotically unbiased, based on the same logic as before.

The Binary Case
An explicit form for the expectation of the generalized sample average in the Bernoulli case is with H(k) as in (45).
The CRSS has been covered in Section 4.1, and the coefficients for optimal estimators listed in Table 1.
As an example, when stopping rule (13) is chosen, with p = q = 0 and A(k) = k/n, we have that F (k) = A(k) = k/n and Hence, (81) becomes Clearly, the estimator is unbiased if and only if .
Hence, there is no uniform solution, neither in π nor in n. When n → +∞, Note that the ordinary sample average, i.e., c = d = 1, is a solution to (83), as it should.
Turning to the case of a deterministic stopping rule, assume that the stopping region S is defined by (k ≤ k 0 ), i.e., F (k) = 1 if k ≤ k 0 and 0 otherwise. Functions A n (π) and B n (π) as in (74) are here: I(k, n, π), the binomial cumulative distribution function, is actually defined by (84). Various alternative formulations exist, but none is of direct use to us here. The expectation of the GSA becomes: For the ordinary sample average, (86) reduces to E(π) = π 1 + 1 2 [I(k 0 − 1, n − 1, π) − I(k 0 , n, π)] .

The General Case
For notational convenience, we introduce the indicator variable Z = I(N = n).
The joint likelihood for the observed data and stopping occurrence is: Likelihood decomposition (87) is of a selection model type. The factors pertaining to stopping are free of the mean parameter µ. This simplifies the kernel of the likelihood (µ), score function S(µ), and Hessian H(µ): The simplicity of this estimator is a direct consequence of ignorability. Based on (14)-(15), the conditional probability for the sample sum K, given the sample size N, can be derived. For the case that N = n, the likelihood function is: leading to the following expressions for the log-likelihood, score, and Hessian: Here A n (µ) and B n (µ) are as defined in (55), and C n (µ) = k 2 f n (k)F (k)dk.
When N = 2n, the likelihood takes the form: with Then, the counterparts to (92)-(94) are: Then, the counterparts to (92)-(94) are: From the form of (93) and (97), it is immediately clear that the conditional expectations of the conditional scores are equal to zero and therefore also the marginal expectation.
The expectation of the joint likelihood based estimator, which is the ordinary sample average, was presented in Section 4.1. Even though there is small-sample bias in most cases different from CRSS, wide classes of stopping rules are asymptotically unbiased. The bias expressions in the conditional expectation of the sample average, which of course are also the bias expressions for the joint likelihood estimator, are of the form E(K/N|N)−µ.
These expressions coincide with the correction in conditional score equations (93) and (97) relative to (89), which follows immediately upon rewriting the former as S N (µ) = Turning to precision and information, first note that for CRSS, H n (µ) = −nσ 2 and H 2n (µ) = −2nσ 2 ; hence the marginal and conditional information in this case reduces to In the general case, the marginal and conditional information are Using information expressions (99)-(100), the bias for the marginal likelihood estimator, and the fact that the conditional likelihood estimator is unbiased, the mean squared error expressions are: Recall that for CRSS B n (µ) = nµA n (µ) and both MSE expressions coincide. In the asymptotic case, (101)-(102) can be approximated, using (60)-(61), as: Returning to the exact expressions (101)-(102), it is relatively straightforward to show that (101) is smaller than (102) if and only if σ 2 A n (µ)[1−A n (µ)][2−A n (µ)] ≥ 4. Requiring that this inequality is satisfied for all values of A n (µ) in the unit interval comes down to requiring that σ 2 ≤ 2.54. Hence, the MSE is smaller in the marginal case if the variance is sufficiently small. For binary data, for example this is always satisfied given that the variance takes the form π(1 − π). Also, asymptotically, A n (µ) typically tends to either 0 or 1, and the above requirement is then also satisfied. In case F (nµ) tends to zero as n tends to infinity, both MSE expressions tend to the same limit.
Consider first the case where N = n. The kernel of the likelihood (µ), score function S(µ), and Hessian H(µ) are: When N = 2n, the corresponding expressions are: Next, we consider bias, consistency, precision, and mean squared error of the joint and conditional likelihood estimators.
In the CRSS case, µ vanishes from the joint stopping model, and both estimators coincide with the ordinary sample average, amply studied in Section 4.
Asymptotic unbiasedness of the sample average follows simultaneously from direct calculation as well as from the fact that it is the maximum likelihood estimator from the joint likelihood (106). In terms of the conditional likelihood, the estimator is obtained from the solution to the score equations, (111) and (114). These can be reformulated as: .
The expectation of (116) results from (76), combined with the observation that the probability of stopping is Φ(ν): Finite-sample unbiasedness follows directly from the linearity of the score in the data.
Thus, the difference between both score equations is bias-correcting. The correction is a non-linear function of µ and has no closed-form solution, underscoring the point that no simple algebraic function of K and N will lead to the same estimator.
Finally, we note that the conditional likelihood estimator is also conditionally unbiased, i.e., it is unbiased for both situations N = n and N = 2n separately, in agreement with our results of Section 5.1. To see this explicitly, it is convenient to rewrite the expectation of the generalized sample average (75) from which both expectations E(µ|N) follow: In conclusion, the sample average is conditionally and marginally biased, with the bias vanishing as n goes to infinity, except in the situations that correspond to vanishing probabilities. In contrast, the conditional estimator is unbiased, whether considered conditionally on the observed sample size or marginalized over it.
Turning to precision, the expected information in the joint approach is where α and β are as above. In the conditional case, this is When n → ∞, the information approaches The difference between joint and conditional information tends to zero when n tends to infinity.
We conclude that the conditional estimator is less precise than the joint one, in contrast to many familiar settings such as contingency table analyses. The important feature here is that conditioning is done on a non-ancillary statistic. In line with the general theory in Section 5.1, we have also seen that the joint approach leads to the ordinary sample average, an estimator that has met with considerable concern in the past in the sequential setting.
Because of the opposing results for bias and precision, it is useful to calculate the mean squared error for both estimators. The expressions from Molenberghs et al (2012), are Comparing these, we see that g(ν) = [2 − Φ(ν)] 2 Φ(ν)[1 − Φ(ν)] < 4, the inequality being strict. In fact, the maximal value for g(ν) equals 0.619. Hence, the joint estimator has the smallest MSE of both, even though the difference will be very small for moderate to large sample sizes. This holds regardless of the choice for α, β, and n, and of the true value of µ. For β finite and when n → ∞, ν approaches α + βµ and β approaches β. Then, Φ(α + βµ) and φ(α + βµ) become constant and the difference between the two expressions disappears because the second terms on the right hand sides of (121) and (122) are of the order of 1/n 2 .
To conclude this section, we examine the above quantities for the limiting case of a deterministic stopping rule, i.e., β → ±∞. Focusing on the positive limit, we obtain α + βµ The marginal outcome model retains its normal-density form, while the other three expressions change. First, the conditional outcome model (109) becomes from which it follows that E[ S(µ)|N] = 0. For the sample average a little more caution is required. From (117) and (118), it follows that E(µ|N) converges to µ at a rate of n, because ν → α + βµ. The situation is more subtle when β → ∞. To show this, we take the limit of (124) and (125) as n → ∞. When µ < 0, applying de l'Hôpital's rule whenever needed, the limits are E(µ|N = n) → 0 and E(µ|N = 2n) → µ. Similarly, when µ > 0, the corresponding expressions are E(µ|N = n) → µ and E(µ|N = 2n) → µ/2.
It follows that when µ = 0, these are both equal to 0. Somewhat surprising, this shows there is no bias in the conditional means: when n → ∞, the probability itself that N = n (N = 2n) for negative (positive) µ goes to zero. This implies that, overall, conditional inference based on the ordinary sample average is still acceptable.
For precision, the second term in (120) approaches This term is non-zero for finite n but can be shown to approach 0 if n → ∞. This has the interesting consequence that there is no difference in precision when β = 0 and β → ∞, but that there is for finite non-zero β.
For the mean squared error, the argument differs from the one used in the stochastic stopping rule case, because now ν = √ nµ and β = √ n, which leads to When n → ∞, both expressions converge to 1/(2n) if the trial continues and 1/n if the trial stops, and the difference between them disappears.

The Binary Case
Joint-likelihood expressions for the binary case, in the probability parameter π are: The expected Hessian, for fixed sample size, is well known to be −N/[π(1 − π)]. However, with our stopping rule F (k) = k/n, it can be shown to be Likewise, given that the solution to S(π) is the sample average, the bias is We will return to this in what follows.
Turning to the conditional expressions, for N = n, (91)-(94) become: with A n (π) = n =0 n π (1 − π) n− F ( ), leading to: For the case where N = 2n we obtain: The fact that E[S N (π)|N] = 0 follows from the derivations in Section 5.1, as well as from first principles.
It is clear that the above expressions are slightly different than the general expressions (87)-(90), because π is not the natural parameter. This does not prohibit further derivations but makes them cumbersome from an algebraic standpoint. Therefore, we switch to the logit form, i.e., α = ln[π/(1 − π)] will be used. Furthermore, we restrict attention to the particular stopping rule used in previous sections, F (k) = k/n. Then, (126)- (129) become: The use of π on the right hand sides of (140) and (141) rather than α is for convenience only. The expected Hessian is straightforward to derive, given that E(N) = n(2 − π): In fact, this calculation is considerably easier than the derivation of (130), even though they are equivalent. Indeed, (130) follows from (142) by applying the delta method.
The forms for (133)-(137), supplemented with the Hessians, are: Note that the conditional Hessians are in line with what one would expect from conditioning upon the sample size: one 'degree of freedom' is removed for mean parameter estimation. Such an operation though, is standard only when the sample size is fixed.
The counterintuitive effect on the efficiency was seen in general in Section 5.1 and very explicitly for the normal data setting in Section 5.2. Straightforward algebra then establishes: Thus, the conditional information is expected to take one subject less into account than the marginal expectation, precisely the opposite of what one would expect in the fixed sample-size case. The bias in the estimators is easy to quantify, given that the estimators are π = k/N in the marginal case and π c = (k−1)/(n−1) when N = n and π c = k/(2n−1) when N = 2n. The biases are (n − k)/([n(n − 1)] and −k/[2n(2n − 1)], respectively.
This follows from the difference between the marginal and conditional estimators, given that the latter is unbiased. For this stopping rule, E(K|N = n) = nπ + 1 − π and E(K|N = 2n) = π(2n − 1), and so the average bias is (131), as we expect.
Hence, like in the normal case, the joint estimator is more efficient than the marginal one.

Concluding Remarks
We have considered the consequences for statistical inference of a random sample size.
Our setting is that of univariate random variables from the exponential family that are subject to a stopping rule such that the sample size is either N = n or N = 2n, with n specified by design. The stopping rule is stochastic and is allowed to depend on the sample sum K over the first n observations. The rule is generic in the sense that its limiting cases are a deterministic stopping rule, such as in a sequential trial, and a completely random sample size, independent of the data. This setting extends those of both Liu et al (2006) and Molenberghs et al (2012); the former restrict attention to a deterministic stopping rule, although they do so for an arbitrary number of interim looks. The latter confined attention to normally distributed outcomes only.
We have focused on three important inferential aspects. First, we have shown that the sufficient statistic (K, N) is incomplete. Second, we have examined the consequences of this for the sample average, as well as for linear generalizations thereof. We have shown that there is small-sample bias, except for the CRSS case. Even then, there is no optimal estimator, except for the exponential distribution, for which the optimum differs from the ordinary sample average. Third, we have studied maximum likelihood estimation in both a joint as well as a conditional framework. The joint likelihood is for the exponentialfamily parameter and the stopping rule simultaneously. The conditional likelihood starts from the conditional distribution of the outcomes, given the sample size. Also here, counterintuitive results are derived. The joint likelihood produces the sample average as maximum likelihood estimator, which is biased in finite samples but is asymptotically unbiased, provided a regularity condition on the stopping rule applies. The conditional likelihood estimator is unbiased, even in small samples. This notwithstanding, the sample average has smaller MSE than the conditional estimator in many important cases, such as the normal and binary examples considered, as well as when the variance of the outcomes is sufficiently small. Under regularity conditions, both estimators are asymptotically equivalent, with the difference between both being O(n −1 ). The regularity condition is not very restrictive; it essentially comes down to requiring that F (k = nµ) approaches zero where F is the stopping rule. For broad classes of parametric functions, this condition is satisfied. We have shown that the corresponding conditional expectations are unbiased.
Hence, when the regularity conditions are satisfied, the sample average remains an attractive and sensible choice for sequential trials. Thus, while some familiar inferential properties no longer hold, estimation after sequential trials is more straightforward than commonly considered and there is little need for complicated, modified estimators, given that the ordinary sample average is acceptable for wide classes of stopping rules, whether stochastic or deterministic.
Molenberghs et al (2012) considered several ramifications of their developments. They commented on the situation of an arbitrary number of looks in a sequential trial, and considered in detail the CRSS case for more than two possible sample sizes. All of this was done for normally distributed outcomes. They also commented on the connection between their derivations and longitudinal outcomes subject to dropout of an MAR type, where dropout depends on observed but not further on unobserved outcomes. While similar, there are subtle differences because now the randomness in the sample size pertains to the number of measurements per subject, rather than to the number of subjects. The difference lies in the fact that measurements within a subject are not independent. Our results extend to these settings as well for the exponential family. Furthermore, connections can be made with a variety of other settings with random sample sizes, such as clustered data with informative cluster sizes, time-to-event data subject to censoring, jointly observed longitudinal and time-to-event data, and random observation times. These settings are currently scrutinized further, and will be reported in a separate manuscript.