Conditional Poisson Stochastic Beams

Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process: Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et al. (2019)’s stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.


Introduction
Many NLP tasks require the prediction of structured outputs, such as sentences or parse trees, either during decoding or as part of a training algorithm.For today's neural architectures, beam search (Reddy, 1977) has become the decoding algorithm of choice due to its efficiency and empirical performance (Serban et al., 2017;Edunov et al., 2018;Yang et al., 2019;Meister et al., 2020b).Beam search is a deterministic method, which invites a natural question: What is the proper stochastic generalization of beam search?Several recent papers have investigated this question (Kool et al., 2019(Kool et al., , 2020;;Shi et al., 2020).Here we build Table 1: A comparison of beam-based decoding algorithms for sequence models, by solution set size and objective.The argmax and sample variants are related through annealing: As the annealing parameter of the distribution τ → 0, sampling turns into computing an argmax (see §3). on this line of work and introduce an alternative stochastic beam search that the authors contend is a more faithful stochasticization of the original algorithm in that it recovers standard beam search as a special case.We name our algorithm conditional Poisson stochastic beam search (CPSBS) as we draw on the conditional Poisson sampling scheme (Hájek, 1964) in its construction.The relationship between CPSBS and other common decoding strategies is displayed visually in Table 1.
At every iteration, CPSBS replaces the top-K operator in the beam search algorithm with conditional Poisson sampling, resulting in a decoding strategy that generates samples-withoutreplacement.Importantly, annealing our sampling distribution at each time step turns local sampling into a local top-K computation and thereby recovers beam search.We subsequently show that these samples can be used to construct a statistically consistent estimator for the expected value of an arbitrary function of the output.
In our experiments with neural machine translation models, we observe that CPSBS leads to better estimates of expected BLEU and conditional model entropy than SBS and the sum-and-sample estimator (Kool et al., 2020), distinctly outperforming Monte Carlo sampling for both small sample sizes and low temperatures.Furthermore, we find that CPSBS can be used as a diverse sampling strategy.
We take these results as confirmation that CPSBS is a useful tool in the newfound arsenal of sampling strategies for neural sequence models.

Beam Search
In this section, we overview the necessary background on neural sequence models and beam search in order to motivate our algorithm in §3.
Neural Sequence Models.We consider locally normalized probabilistic models over sequences y: where y is a member of a set of well-formed outputs Y.In the context of language generation models, well-formed outputs are sequences of tokens y = y 1 , y 2 , . . .from a vocabulary V ; all y ∈ Y begin and end with special tokens BOS and EOS, respectively.We use y <t to represent the subsequence y 1 , . . ., y t−1 .In this work, we consider the setting where the maximum sequence length is upper-bounded; we denote this upper bound T > 0.
Without loss of generality, we may condition p on an input x, as is necessary for machine translation and other conditional generation tasks.
Beam Search.Beam search is a commonly used search heuristic for finding an approximate solution to the following optimization problem: Its most straightforward interpretation is as a pruned version of breadth-first search, where the breadth of the search is narrowed to the top-K candidates.However, here we will present beam search in a nonstandard lens (Meister et al., 2020a(Meister et al., , 2021) ) in order to emphasize the connection with our stochastic generalization in §3.Specifically, we present the algorithm as iteratively finding the highest-scoring set under a specific set function.Under this paradigm, the initial beam Y 0 contains only the BOS token.At subsequent steps t = 1, . . ., T , beam search selects the K highestscoring candidates from the set Y t−1 • V that we define below: 2 2 Sequences already ending in EOS are not extended by y ∈ V and are simply added to the set "as is." where • is sequence concatenation.Those candidate sets with collectively higher probability under the model p have higher score.This process continues until all y ∈ Y t end in EOS, or t = T .For notational ease, we define B t def = Y t−1 • V ; throughout this paper, we will assume |B t | = N and identify the elements of B t = {y We can formulate the time-step dependent set function whose argmax beam search finds as where w n is the weight of the n th element of B t .To recover beam search, we set our weights equal to probabilities under a model p, i.e., w n = p y (n) ≤t .Note that we leave the constraint that Y ⊆ B t implicit in Eq. ( 4).As should be clear from notation, this set function only assigns nonzero scores to subsets of Y t−1 • V of size exactly K and the assigned score is proportional to the product of the probability of the candidates under the model p. Putting this all together, beam search may be viewed as the following iterative process: return Y T (7)

Conditional Poisson Stochastic Beams
Our paper capitalizes on a very simple observation: Rather than taking its argmax, we may renormalize Eq. ( 4) into a distribution and sample-withoutreplacement a size K set at each iteration: Note the structural zeros of Q t prevent any incompatible sequence of beams.We provide a theoretical analysis of the scheme in §4 and an empirical analysis in §5.
)-a distribution over subsets of size K of a base set B t -using the CPS design.The normalizing constant for this distribution is defined as Despite there being exponentially many summands, we can sum over all N K subsets in O(N K) time via the following recurrence relation: 5 We give complete pseudocode in App. C. Correctness of this algorithm is shown in Kulesza and Taskar (2012).The normalizing constant can then be efficiently computed as ≤t ∈ Y t , also termed an item's inclusion probability.In this paper, we write π Qt y One strategy is to choose w n at time step t such that π Qt y ≤t ).This choice recovers beam search when we anneal our chosen weights w n → w 1/τ n : as the temperature parameter τ → 0, the CP distribution will assign probability 1 to the set containing the top-K elements. 6inding w n 's that result in pre-specified inclusion probabilities is possible, but it requires solving a numerical optimization problem (Aires, 1999;Grafström, 2009).Further, in CPSBS, we will be sampling from a different distribution at each time step and it would be quite slow to solve the numerical optimization problem each iteration.Luckily, the choice of w n = p(y ) yields a good approximation to the target inclusion probabilities in both theory and practice (Hájek, 1981;Bondesson et al., 2006;Aires, 1999).

Statistical Estimation with Conditional Poisson Stochastic Beam Search
In this section, we discuss statistical estimation with CPSBS samples.To that end, we construct two estimators with different properties.However, only the second estimator provides good performance in practice, which is discussed later in §5.

The Horvitz-Thompson Estimator
We build upon the Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952), which is a common technique for estimation from samplingwithout-replacement (SWOR) schemes.Let f : Y → R d be a function whose expected value under p we seek to approximate: The Monte Carlo estimator of the above quantity is where y (m) i.i.d.∼ p.However, in the special case of sampling from a finite population-which is extremely common in NLP-it can be very wasteful.For example, if a distribution is very peaked, it will sample the same item repeatedly; this could lead to inaccurate approximations for some f .As a consequence, the mean square error (MSE) of the estimator with respect to E y∼p [f (y)] can be quite high for small M .Indeed, we see this empirically in §5.1.
Taking samples without replacement allows us to cover more of the support of p in our estimate of E y∼p [f (y)].However, we must take into account that our samples are no longer independent, i.e., ∼ p.We now define the HT estimator, using notation specifically for the case of CPSBS: As should be clear from notation, we assume Y T ∼ P ; further, we use π P (y) to denote the inclusion probability of y under CPSBS, i.e., the probability of sampling a set Y T ∼ P that contains the element y: In Eq. ( 17), the distribution π P may be viewed as a proposal distribution in the sense of importance sampling (Owen, 2013) and 1/π P (y) as the corresponding importance weight corrections.If we can exactly compute π P , then the HT estimator is unbiased7 (see App. B.1 for proof).However, the summation in Eq. ( 18) is intractable so we resort to estimation.

Estimating Inclusion Probabilities
In this section, we develop statistical estimators of the inclusion probabilities under conditional Poisson stochastic beam search.Note that in order to maintain the unbiasedness of the HT estimator, we must estimate the reciprocal inclusion probabilities.8However, these are not straightforward to estimate.Thus, we attempt to estimate the inclusion probabilities directly and take the reciprocal of this estimator.This strategy leads to a consistent, but biased, estimator.An important caveat: the analysis in this section only applies to the estimators of the inclusion probabilities themselves.Further analysis may be undertaken to analyze the variance of the HT estimators that make use of these estimators.

Naïve Monte Carlo
One obvious way to derive an inclusion probability estimator is via Monte Carlo estimation: where Y (m) ∼ P .
Proposition 4.1.Eq. 19 has the following two properties: i) π MC P is an unbiased estimator of π P and Here V a denotes the asymptotic variance, which is the variance after the number of samples M is large enough such that the central limit theorem has kicked in (Bickel and Docksum, 2015).
Qualitatively, what this result tells us is that if we are asking about the inverse inclusion probability of a candidate with a low inclusion probability, our estimator may have very high variance.Indeed, it is unlikely that we could derive an estimator without this qualitative property due to the presence of the inverse.Moreover, the estimator given in Eq. ( 19) is not of practical use: If we are interested in the inverse inclusion probability of a specific candidate y, then we may have to sample a very large number of beams until we eventually sample one that actually contains y.In practice, what this means is that our estimate of the inclusion probably for a rare y will often be zero, which we cannot invert. 9Instead, we pursue an importance sampling strategy for estimating π P (y), which we outline in the next section.

Importance Sampling
We now turn to an inclusion probability estimator that is based on importance sampling.Recall from Eq. ( 18) that the inclusion probability for y is a massive summation over sequences of possible beams Y 1 , . . ., Y T that could have generated y.
Rather than computing the sum, we will estimate the sum through taking samples.Our procedure starts by generating hindsight samples Y 1 , . . ., Y T from the following proposal distribution that is conditioned on y: In words, Q t is Q t conditioned on its sets Y t containing the prefix y ≤t (thus it is always the case that y ≤t ∈ Y t ). 10 For brevity, we omit an explicit notational dependence of Y t and Q t on y.
Lemma 4.1.The joint proposal distribution ) may be ex- 9 One solution would be to smooth our estimates of the inclusion probabilities, adding a small ε to ensure that we do not divide by zero, but the authors find our next approach to be more methodologically sound. 10This proposal distribution can be realized through a minor modification of our algorithm in §3, where w(y) corresponding to y ≤t is placed at the beginning and added to Yt deterministically.
pressed in terms of P as follows: where we define as the joint probability of the beams Y 1 , . . ., Y T under the original distributions Q t .We omit that both P and P are conditioned on Y 0 .
In terms of computation, Eq. ( 22) makes use of the fact that the per-time-step inclusion probability π Qt (y ≤t ) for a given Q t can be computed efficiently with dynamic programming using the following identity: For completeness, we give pseudocode in App. C. Given samples Y (m) T ∼ P for P defined in Eq. ( 23) with respect to a given y, we propose the following unbiased estimator of inclusion probabilities: where y ≤t is a prefix of y.One simple derivation of Eq. ( 25) is as an importance sampler.We start with the equality given in Eq. ( 18) and perform the standard algebraic manipulations witnessed in importance sampling: where equality (i) above follows from Lemma 4.1.This derivation serves as a simple proof that Eq. ( 25) inherits unbiasedness from Eq. ( 17).
Proposition 4.2.Eq. (25) has the following two properties: i) π IS P is an unbiased estimator of π P ; ii) The estimator of the inverse inclusion probabilities 1/ π IS P (y) is consistent with the following upper bound on the asymptotic variance: where we assume that the following bound: holds for all Y 1 , . . ., Y T .
Proposition 4.2 tells us that we can use Eq. ( 25) to construct a consistent estimator of the inverse inclusion probabilities.Moreover, assuming Pr (y ∈ Y T ) > 0, then we have that the importance sampling yields an estimate π IS P (y) > 0, unlike the Monte Carlo estimator π MC P (y).We further see that, to the extent that T t=1 π Qt (y ≤t | Y t−1 ) approximates π P (y), then we may expect the variance of Eq. ( 25) to be small-specifically in comparison to the naïve Monte Carlo estimator in Eq. ( 19)-which is often the case for estimators built using importance sampling techniques when a proposal distribution is chosen judiciously (Rubinstein and Kroese, 2016).Thus, given our estimator in Eq. ( 25), we can now construct a practically useful estimator for E y∼p [f (y)] using the HT estimator in Eq. ( 17).In the next section, we observe that this estimator is quite efficient in the sequence model setting.

Experiments
We repeat the analyses performed by Kool et al. (2019), running experiments on neural machine translation (NMT) models; for reproducibility, we use the pretrained Transformer model for WMT'14 (Bojar et al., 2014) English-French made available by fairseq11 (Ott et al., 2019).We evaluate on the En-Fr newstest2014 set, containing 3003 sentences.Further details can be found in App.D. Our implementation of CPSBS modifies the beam search algorithm from the fairseq library.We additionally consider the beam search, stochastic beam search, diverse beam search, and ancestral sampling algorithms available in fairseq.

Statistical Estimators for Language Generation Models
Estimators have a large number of applications in machine learning.For example, the REIN-FORCE algorithm (Williams, 1992) constructs an estimator for the value of the score function; minimum-Bayes risk decoding (Kumar and Byrne, 2004) uses an estimate of risk in its optimization problem.In this section, we compare estimators for sentence-level BLEU score and conditional model entropy for NMT models.Notably, NMT models that are trained to minimize cross-entropy with the empirical distribution 12 are not peaky distributions (Ott et al., 2018a;Eikema and Aziz, 2020); thus, standard estimation techniques, e.g., Monte Carlo, should generally provide good results.However, we can vary the annealing parameter of our model in order to observe the behavior of our estimator with both high-and low-entropy distributions, making this a comprehensive case study.Here the annealed model distribution is computed as where we should expect a standard Monte Carlo estimator to provide good results at τ close to 1 when p is naturally high entropy.We test our estimator in this setting so as to give a comparison in a competitive setting.Specifically, we assess the performance of our estimator of E y∼p(y|x) [f (y)] given in Eq. ( 17)-using inclusion probability estimates from Eq. ( 25) with M = 1 and with importance weight normalization-in comparison to three other approaches: Monte Carlo (MC) sampling, the sum-and-sample (SAS) estimator, and stochastic beam search (SBS).
Monte Carlo.Under the Monte Carlo sampling scheme with sample size K, we estimate the expected value of f under our model using Eq. ( 16) with a sample y (1) , . . ., y (K) i.i.d.∼ p.
Sum and Sample.The sum-and-sample estimator (Botev et al., 2017;Liu et al., 2019;Kool et al., 2020) is an unbiased estimator that takes as input a deterministically chosen set Y of size K − 1 and 12 Label-smoothing (Szegedy et al., 2016) is typically also employed, which leads to even higher entropy distributions.Figure 1: BLEU score estimates for three different sentences using estimators for respective decoding methods.τ indicates scaling temperature; τ values and sentences are chosen to mimic (Kool et al., 2019).
samples an additional y from the remaining elements, supp(p) \ Y , where we obtain the set Y using beam search in our experiments.Formally, the SAS estimator can be written as: Stochastic Beam Search.Stochastic beam search (Kool et al., 2019(Kool et al., , 2020) is a SWOR algorithm likewise built on top of beam search.The algorithm makes use of truncated Gumbel random variables at each iteration, resulting in a sampling design equivalent to performing the Gumbeltop-k trick (Vieira, 2014) on the distribution p.
Estimators built using SBS likewise follow the Horvitz-Thompson scheme of Eq. ( 17); we refer the reader to the original work for inclusion probability computations.They suggest normalizing the estimator by the sum of sample inclusion probabilities to help reduce variance; we therefore likewise perform this normalization in our experiments.
To assess the error of our estimator, we compute its root MSE (RMSE) with respect to a baseline result.While computing the exact value of an expectation is typically infeasible in the sequence model setting, we can average our (unbiased) MC estimator in Eq. ( 16) over multiple runs to create a good baseline.Specifically, we compute our MC estimator 50 times for a large sample size (K = 200); variances are reported in App.D.
Probabilistic models for language generation typically have large vocabularies.In this setting, the computation of Eq. ( 12) is inefficient due to the large number of items in the set that are assigned very small probability under the model.We experiment with truncating this distribution such that the set of possible extensions of a sequence consist only of the highest probability tokens within the core n% of probability mass (0.99 in our experiments), similar to the process in nucleus sampling (Holtzman et al., 2020).We compare this approach to the original algorithm design in App.D and see that empirically, results are virtually unchanged; the following results use this method.We also compare the decoding time of different sampling methods in Fig. 7.
BLEU Score Estimation.BLEU (Papineni et al., 2002) is a widely used automatic evaluation metric for the quality of machine-generated translations.Estimates of BLEU score are used in minimum risk training (Shen et al., 2016) and reinforcement learning-based approaches (Ranzato et al., 2016) to machine translation.As such, accurate and lowvariance estimates are critical for the algorithms' performance.Formally, we estimate the expected value of f (y) = BLEU(x, y), whose dependence on x we leave implicit, under our NMT model p for reference translation x.For comparison, we use the same sentences and similar annealing tempera-  tures 13 τ evaluated by Kool et al. (2019).We repeat the sampling 20 times and plot the value and standard deviation (indicated by shaded region) of different estimators in Fig. 1.From Fig. 1, we can see that CPSBS has lower variance than our baseline estimators across all temperatures and data points. 14 Especially in the low temperature setting, our estimator converges rapidly with minor deviation from the exact values even for small sample sizes.Additionally, in Fig. 2a we see that the RMSE is typically quite low except at higher temperatures.In such cases, we observe the effects of biasedness, similar to Kool et al. (2019)'s observations.
Conditional Entropy Estimation.We perform similar experiments for estimates of a model's conditional entropy, i.e., f (y) = − log p(y | x), whose dependence on x we again leave implicit.We show results in Fig. 2b, with plots of the value in App.D since results are quite similar to Fig. 1.We see further confirmation that our estimator built on CPSBS is generally quite efficient.

Diverse Sampling
We show how CPSBS can be used as a diverse set sampling design for language generation models.We generate a sample of translations Y T ∼ P , i.e., according to the CPSBS scheme, where weights are set as w n = p(y (n) ≤t )/(1 − p(y (n) ≤t )) at each time step, as recommended in §3.In Fig. 3, we show the trade-off between minimum, average and maximum sentence-level BLEU score (as a quality 13 Results for τ = 0.05 converged rapidly for all estimators, thus not providing an interesting comparison. 14The sampling distribution at n = 1 is not the same across strategies, hence the difference in variances even at n = 1.measure) and n-gram diversity, where we define ngram diversity D as the average fraction of unique vs. total n-grams for n = 1, 2, 3, 4 in a sentence: Metrics are averaged across the corpus.We follow the experimental setup of Kool et al. (2019)

Conclusion
In this work, we present conditional Poisson stochastic beam search, a sampling-withoutreplacement strategy for sequence models.Through a simple modification to beam search, we turn this mainstay decoding algorithm into a stochastic process.We derive a low-variance, consistent estimator of inclusion probabilities under this scheme; we then present a general framework for using CPSBS to construct statistical estimators for expectations under sequence models.
In our experiments, we observe a reduction in mean square error, and an increase in sample efficiency, when using our estimator in comparison to several baselines, showing the benefits of CPSBS.

D Experimental Setup and Additional Results
We a Transformer-based model trained according to Ott et al. (2018b) on WMT'14 English-French dataset. 16We use the pre-trained model checkpoints made available by fairseq. 17Data preprocessing steps, model hyperparameters and baseline performances can be found in the original work and on the fairseq website.All evaluations are performed on the wmt14.v2.en-fr.newstest2014version of the newstest data set.We show additional results using the setup in §5 in Figs. 4 to 6.We provide an empirical runtime analysis in Fig. 7. Table 2 shows the variance of baseline estimator value for the three sentences used in RMSE experiments.Sentence# 1500 0.00 0.00 0.03 0.08 0.00 0.01 0.07 0.10 Sentence# 2000 0.04 0.04 0.08 0.00 0.00 0.00 0.02 Sentence# 2500 0.07 0.09 0.25 0.84 0.00 0.01 0.03 0.04 the integers {1, . . ., N }.
RMSE of BLEU score estimator for different temperatures.Results are averaged across several sentences.RMSE of conditional model entropy estimator for various temperatures.Results are averaged across several sentences.We see a larger bias under both CPSBS and SBS at higher temperatures in these experiments.

Figure 4 :Figure 5 :
Figure4: Entropy estimates for three different sentences using estimators for respective decoding methods.indicates scaling temperature.Values are chosen to mimic(Kool et al., 2019).

Figure 6 :
Figure6: BLEU score for CPSBS both with and without truncation of the sampling distribution.We see that our estimator with truncation provides virtually the same results.

Figure 7 :
Figure 7: A comparison between decoding time of different sampling methods.The y-axis shows the average decoding time of the three sentences as before.The x-axis shows the number of samples taken for each sentence.All methods are tested on CPU.
Add the n th element of B t to Y t with prob.
. . ., 0.8} as a means of encouraging diversity; for DiverseBS we instead vary the strength parameter in the same range.Interestingly, we see that temperature has virtually no effect on the diversity of the set of results returned by CPSBS.Despite this artifact, for which the authors have not found a theoretical justification, 15 the set returned by CPSBS is still overall more diverse (position on x-axis) than results returned by DiverseBS and reflect better min, max, and average BLEU in comparison to random sampling.SBS provides a better spectrum for the diversity and BLEU tradeoff; we thus recommend SBS when diverse sets are desired.

Table 2 :
Variance of baseline estimator (MC for k = 200 in 50 iterations) for the three sentences.