Importance sampling the union of rare events with an application to power systems analysis

This paper presents a method for estimating the probability $\mu$ of a union of $J$ rare events. The method uses $n$ samples, each of which picks one of the rare events at random, samples conditionally on that rare event happening and counts the total number of rare events that happen. We call it ALORE, for `at least one rare event'. The ALORE estimate is unbiased and has a coefficient of variation no larger than $\sqrt{(J+J^{-1}-2)/(4n)}$. The coefficient of variation is also no larger than $\sqrt{(\bar\mu/\mu-1)/n}$ where $\bar\mu$ is the union bound. Our motivating problem comes from power system reliability, where the phase differences between connected nodes have a joint Gaussian distribution and the $J$ rare events arise from unacceptably large phase differences. In the grid reliability problems even some events defined by $5772$ constraints in $326$ dimensions, with probability below $10^{-22}$, are estimated with a coefficient of variation of about $0.0024$ with only $n=10{,}000$ sample values.


Introduction
In this paper we consider a mixture importance sampling strategy to estimate the probability that one or more of a set of rare events takes place. Our main method is to repeatedly choose a rare event at random, and then sample the system conditionally on that one event taking place. The result is an importance sampler with a sampling standard deviation that is no more than some modest multiple of the rare event probability, even in settings where that probability tends to zero.
Our motivating context is the reliability of the electrical grid when subject to random inputs, such as variable demand by users and variable production, as occurs at wind farms. The rare events describe unacceptably large electrical phase differences at pairs of connected nodes in the grid.
It is common to use a simplified linear direct current (DC) model of the electrical grid, because the equations describing alternating current (AC) are significantly more difficult to work with, and some authors (e.g., Van den Bergh et al. (2014)) find that there is little to be gained from the complexity of an AC model. This DC model is presented in Sauer and Christensen (1984) and Stott et al. (2009). It is also common to model the randomness in the grid as Gaussian, especially over short time horizons. We make both of these simplifications, linearity and Gaussianity, for this paper.
The probability we consider can be written (1) Section 2 inroduces more notation for the event (1) and develops an importance sampling estimatorμ for the generally intractable quantity µ. An upper bound for µ isμ = J j=1 Pr(H j ), the union bound. Theorem 1 proves that the importance sampling estimator has variance at most µ(μ − µ)/n when n IID samples are used. Section 3 discusses some further sampling properties of our estimator that hold without the Gaussian assumption. When there are J events, the variance ofμ is at most (J + 1/J − 2)µ 2 /(4n) when the system is sampled n times. Section 4 compares our importance sampler to a state of the art code (Genz et al., 2017) for estimating the probability that a multivariate Gaussian of up to 1000 variables with arbitrary covariance belongs to a given hyperrectangle. Our algorithm is simpler and extends to higher dimensions. When we studied rare event cases, our algorithm was more accurate. In our examples that are not rare events, the other algorithm was more accurate. Section 5 describes the power system application. Section 6 contains some discussions. The appendix proves Theorem 1 for any set of J events, not just those given by a Gaussian distribution. The theorem applies so long as we can sample conditionally on any one event H j and then determine which other events H also occur.
We call our method ALORE, for "at least one rare event", because the sampler only samples from ∪ j H j . One common way for rare event sampling to become inaccurate is that we might fail to obtain any points where the rare event happens. That leads to a severe under-estimation of the rare event probability. In ALORE, the corresponding problem is the failure to sample any points where two or more of the rare constituent events occur. In that case ALORE will return the union bound as the estimated rare event probability instead of zero. That is also a setting where the union bound is likely to be a good approximation. So ALORE is robust against severe underestimates of the rare event probability. The second common problem for rare event sampling is an extreme value of the likelihood ratio weighting applied to the observations. In ALORE, the largest possible weight is only J times as large as the smallest one.
Our sampler is closely related to instanton methods in power systems engineering (Chertkov et al., 2011a,b;Kersulis et al., 2015). Out of all the configurations of random inputs to a system, the most probable one causing the failure is called the instanton. When there are thousands of failure types there are correspondingly thousands of instantons, each one a conditional mode of the distribution of x. Our initial thought was to do importance sampling from a mixture of distributions, with each mixture component defined by shifting the Gaussian distribution's mean to an instanton. What we do instead is sample conditionally on the rare event happening. By sampling in a half-space we avoid wasting samples outside the failure region. By conditioning instead of shifting, we get better control over the likelihood ratio in the importance sampler. Each instanton remains the most probable point in its half-space under this sampling.
Our sampler is a form of multiple importance sampling. Multiple importance sampling originated in computer graphics (Lafortune and Willems, 1993;Veach and Guibas, 1994). Owen and Zhou (2000) found a useful way to combine it with control variates defined by the mixture components. Elvira et al. (2015a,b) investigate computational efficiency of some mixture importance sampling and weighting strategies.
We do not consider self-normalized importance sampling (SNIS) in this paper. SNIS is useful in settings where we can compute an unnormalized version of our target density but cannot sample from it efficiently, if at all. This is common in Bayesian applications (Liu, 2001, Chapter 2). For a recent adaptive version of SNIS, see Cornuet et al. (2012). In estimating rare event probabilities, the optimal sampler for SNIS can be shown to place half of its probability in the rare event and half outside the rare event, while the optimal plain IS estimator places all of its probability on the rare event. The best coefficient of variation for rare event estimation under SNIS is asymptotically 2/ √ n. Ordinary importance sampling can attain much smaller variances, and so we focus on it for the rare event problem.

Gaussian case
For concreteness, we present the algorithm first for Gaussian random variables. We let x ∈ R d have the standard Gaussian distribution, N (0, I), deferring general Gaussians to Section 2.1. We are interested in computing the probability that x lies outside a polytope P. In our motivating applications, the interior of the polytope defines a safe operating region and we assume that x ∈ P is a rare event. For j = 1, . . . , J, define half-spaces The set P is convex and not necessarily bounded. Ordinarily τ j > 0, because we are interested in rare events.
The setting is illustrated in Figure 1 for J = 6 half-spaces. In that example, two of the half-spaces have their conditional modes inside the union of the other half-spaces. One of those half-spaces is entirely included in the union of the others. Letting P j = Pr(x ∈ H j ) = Φ(−τ j ), we know that max 1 j J P j =: µ µ μ := J j=1 P j .
The right hand side is the union bound. Sometimes that bound is very conservative, and other times it is quite accurate. For a given high dimensional problem we might not know whether the bound is close. Therefore we seek a more accurate estimate of µ. We will need to use some inclusion-exclusion formulas, so some notation for these follows. For any u ⊆ 1:J ≡ {1, 2, . . . , J}, let H u = ∪ j∈u H j , so H j = H {j} and by convention H ∅ = ∅. We identify the set H u with the function count the number of rare events that happen. For s = 0, 1, . . . , J, let T s = Pr(S = s) give the distribution of S. We use |u| for the cardinality of u.
Our estimand is by inclusion-exclusion. Our approach to estimating µ is through mixture importance sampling. For j = 1, . . . , J, let q j = L(x | ω T j x τ j ) and take q 0 = p = N (0, I), with P 0 = 1 and H 0 (x) = 1. The probability density functions for j > 0 are q j (x) = p(x)H j (x)/P j , while q 0 = p. Let α 0 , α 1 , . . . , α J be nonnegative numbers summing to 1, and q α = J j=0 α j q j . The mixture importance sampling estimate of µ based on n draws x i ∼ q α iŝ Here the α 0 term represents a defensive mixture component (Hesterberg, 1995). Although the inclusion-exclusion formula (2) contains 2 J − 1 nonzero terms, each summand in the unbiased estimate in (3) can be computed at cost O(J). It is interesting here to discard the defensive component, taking α 0 = 0. Now with α j = α * j = P j /μ, we get because H 1:J (x) = 1 always holds for x ∼ q α * . The estimate (4) is a multiplicative adjustment to the union boundμ. The terms S(x i ) −1 range from 1 to 1/J and so we will never getμ α * larger than the union bound or smaller thanμ/J. This is convenient becauseμ µ μ/J always holds.
Theorem 1. Letμ α * be given by (4). Then and Proof. See the appendix, where this is proved for a general set of J events, not necessarily from Gaussian half-spaces.
The union bound can be written and so we may write (6) as

General Gaussians
Now suppose that we are given y ∼ N (η, Σ) and the half-spaces are defined by γ T j y κ j . We assume that Σ is nonsingular. If it is not, then we can reduce y to a subset of components whose variance is nonsingular, and write the other components as linear functions of this reduced set. We also assume that we can afford to take a matrix square root Σ 1/2 . Now x = Σ −1/2 (y − η) ∼ N (0, I), and y = η + Σ 1/2 x. Then the half-spaces are given by For rare events, we will have κ j > γ T j η.

Sampling algorithms
We want to sample x ∼ N (0, I) conditionally on x T ω τ for a unit vector ω and scalar τ . We can use the following steps: Step 3 generates a truncated Gaussian by a 'complementary inversion' method where ordinary inversion of the conditional CDF would use 1 − u in place of u. If a random number generator produces values below the machine epsilon then complementary inversion has better resolution of extreme y values than inversion has.
This algorithm above can be problematic numerically when Φ(τ ) is close to 1 as it will be for very rare events. For instance, in the R language (R Core Team, 2015), Φ(10) yields 1 and then Φ −1 (Φ(10) + u(1 − Φ(10))) yields ∞ for any u. Some of our electrical grid examples have max j τ j > 10 10 . That is, some of the potential failure modes are virtually impossible.
Because τ > 0 might be quite large, we get better numerical stability by sampling x ∼ N (0, I) conditionally on x T ω −τ and then delivering −x. The steps are as follows: Even a very small u = 10 −12 combined with τ = 10 yields without any underflow in the R language (R Core Team, 2015). In cases with some τ j 10 10 we will ordinarily get P j = 0 and then never sample conditionally on the corresponding H j . We compute step 4 via x = ωy + z − ω(ω T z) to avoid a potentially expensive multiplication (I − ωω T )z.

Importance sampling properties
As shown in the Appendix, Theorem 1 holds more generally than the Gaussian case. In this more general setting, we have J events, H j , on a common sample space X where x ∈ X has probability density p. Event H j has probability P j . We want µ = Pr(H) where H = ∪ j H j . The union bound is µ μ = j P j . We assume that 0 <μ < ∞. The upper bound only has to be checked if J = ∞. If µ = 0, then we know µ = 0 without any sampling. When we sample, we ensure that at least one rare event takes place every time, by first picking an event H j with probability proportional to P j . Then we sample x ∈ X conditionally on H j and find S(x) = J =1 H (x), the total number of events that occur. This includes H j and so our sample values always have S(x i ) 1. The importance sampling estimateμ α * averagesμ/S(x i ) over n independent replicates. As in the prior section, we use for the probability of exactly s events happening. Then the variance ofμ is given by (6).
The optimal importance sampling distribution for estimating µ is uniform on H = {x | H(x) = 1}. Sampling from this distribution would yield an estimate with variance zero. Not surprisingly, we are seldom able to do that in applications. Our ALORE sampler takes x ∈ X with probability proportional to S(x), so it has support set H.
We think that many applications will have events H j that rarely co-occur. In that case S(x) is nearly constant at 1 for x ∈ H, and then our sampler is close to the optimal one. Other applications may have a few near duplicated events that co-occur often. One extreme setting has a common cause that triggers all J events at once and those events almost never arise outside of that common situation. In that case S(x) is again nearly constant on H, this time usually equal to J, and our sampler is again nearly optimal.
The variance bound µ(μ − µ)/n from (6) can be conservative. It stems from T s /s T s , when s 1. If Pr α * (S > 1) is appreciably large then the variance is less than our bound. Becauseμ Jµ we also have Equality in (8) only happens ifμ = Jµ. Then the J events are almost surely the same and Var(μ α * ) = 0. Therefore strict inequality always holds in equation (8).
The trivial exception from µ = 0 is not possible because we haveμ > 0. We can sharpen the bound (8) by using the following lemma.
Lemma 1. Let S be a random variable supported on {1, 2, . . . , J} for J ∈ N. Then with equality if and only if S ∼ U{1, J}.
Proof. See the appendix.
Lemma 1 tells us that for J 2, our worst case setting is one where half of the time that one or more events happen, exactly one happens and half of the time, all J of them happen. We might have guessed this, because withμ = E(S) fixed at some level, the variance of S −1 is maximized when S −1 takes only the two extreme values 1 and J −1 with equal probability. The situation is a bit more complicated because E(S) and E(S −1 ) both depend on the distribution of S and we need to bound the product E(S)E(S −1 ), where both expectations are under q α * .
From Theorem 2 and Lemma 1, we get because T s /µ is a probability distribution on {1, 2, . . . , J}. This bound describes relative error without reference to the union boundμ. A rare event estimator has bounded relative error if Var(μ)/µ 2 remains bounded as one takes the limit in a sequence of problems (Asmussen and Glynn, 2007, Chapter VI). The sequence is typically one where the event of interest becomes increasingly rare. Equation (10) shows a bounded relative error property for ALORE in any sequence with J/n uniformly bounded and µ > 0.
When we want an upper bound in terms ofμ, then the right hand side of (6) is maximized over µ at µ =μ/2, yielding Using µ μ in (10) yields This is only an improvement on (11) when J = 2. The inequality µ =μ cannot actually hold under the conditions where equality holds in (10). Now suppose that f (x) is supported on H and we want to estimate Then by the same arguments used in the Appendix, If f (x) ∈ {0, 1}, then Var(ν) ν(μ − ν)/n. That is, when f describes a rare event that can only occur if one or more of the H j also occur, we can reduce its Monte Carlo variance from ν(1 − ν)/n to at most ν(μ − ν)/n, in cases whereμ < 1. In the Gaussian context, such an f is only nonzero outside the polytope P. Now suppose that f (x) = g(S(x)) only depends on S and g(0) = 0. For instance, in some applications we might want the probabilty that S(x) > k where k 1 is the largest acceptable number of failure events. Then

Comparisons
Here we consider some numerical examples comparing ALORE to pmvnorm from the R package mvtnorm (Genz et al., 2017). This package can make use of special properties of the Gaussian distribution, and it works in high dimensions. We begin by describing mvtnorm based on Genz and Bretz (2009) and a personal communication from Alan Genz. The program computes . . , d, and Σ can be rank deficient. We can use it to compute µ = Pr( The code can handle dimensions up to 1000. In our context, that means at most J = 1000 half-spaces. The dimension d can be higher. The related pmvt function handles multivariate t random variables. The code has three different algorithms in it. One from Genz (2004) handles two and three dimensional semi-infinite regions, one from Miwa et al. (2003) is for dimensions up to 20 and the rest are handled by an algorithm from Genz and Bretz (2009). This latter algorithm uses a number of methods. It uses randomized Korobov lattice rules as described by Cranley and Patterson (1976) for the first 100 dimensions, in conjunction with antithetic sampling. There are usually 8 randomizations. For more than 100 dimensions it applies a method from Niederreiter (1972). There are a series of increasing sample sizes in use, and the method provides an estimated error (3.5 standard errors) based on the randomization. The approach is via sequential conditional sampling, after strategically ordering the variables (e.g., putting unconstrained ones first). The R package calls a FORTRAN program for the computation, so it is very fast. We use the default implementation which uses up to 25,000 quadrature points.
The main finding is that importance sampling is more effective when the polytope of interest is the complement of a rare event. This is not meant to be a criticism of pmvnorm. That code was not specifically designed to compute the complement of a rare event. The comparison is relevant because we are not aware of alternative code tuned for the high dimensional rare event cases that we need, and pmvnorm is a well regarded and widely available general solution, that seemed to us like the best off-the-shelf tool.
For J = 360 and τ = 6, we have µ exp(−18) . = 1.52 × 10 −8 . The lower bound is about 0.9995 times the upper bound, so we treat the upper bound as exact. Figure 2 shows histograms of 100 simulations ofμ/µ using ALORE and using pvnorm. In this case ALORE is much more accurate. The mean square relative error E((μ/µ − 1) 2 ) is about 800-fold smaller for ALORE than pvnorm. We also see that pvnorm has high positive skewness and the histogram of estimates has most of its mass well below the mean. Table 1 shows summary results for this problem with different values of τ . We see that pvnorm is superior when the event is not rare but ALORE is superior for rare events. The large error for pvnorm with τ = 5 stemmed from a small number of outliers among the 100 trials.
It is possible that this example is artificially easy for importance sampling, due to the symmetry. Whichever half-space H j we sample, the distribution of overlapping half-spaces H k for k = j is the same. Two half-spaces differ from H j by a one degree angle, two differ by a two degree angle and so on. To get a more varied range of overlap patterns, we replaced angles 2πj/360 by angles 2π × p(j)/360 where p(j) is the j'th prime among integers up to 360. There are 72 of them, of which the largest is 359. With τ = 6 and 100 replications using n = 1000 points in importance sampling, we have variance ofp/ exp(−18) equal to 0.00077. The comparable figure for mvtnorm is 8.5. There were a few outliers there including one that was more than 6 times the union bound. The gap between the prime angle polygon and the inscribed circle is larger than the one formed by the full polygon. Pooling all the importance sampling runs leaves an estimate of about 0.94 × exp(−18) for µ. In this example, we see importance sampling working quite well without symmetry.

High dimensional half-spaces
The previous example was low dimensional and each of the half-spaces sampled had numerous similar ones, differing in angle by a small number of degrees. Thus µ was quite a bit smaller thanμ. Here we consider a high dimensional setting where the half-spaces have less overlap.
Two uniform random unit vectors ω 1 and ω 2 in R d are very likely to be nearly orthogonal for large d. Then x T ω j > τ j are nearly independent events. For independent events, we would have To make x ∈ P a rare event, the P j must be small and then the probability above will be close to the union bound. Theorem 2 predicts good performance for importance sampling here.
For this test 200 sample problems were constructed. The dimensions were chosen by d ∼ U{20, 50, 100, 200, 500}. Then there were J ∼ U{d/2, d, 2d} constraints chosen with uniform random unit vectors ω j ∈ R d . The threshold τ   P(360, τ )) for various τ . The true mean µ is very nearly exp(−τ 2 /2). Importance sampling is more accurate for large τ (rare events), while pmvnorm is more accurate for small τ .
was chosen so that log 10 of the union bound was U [4,8], followed by rounding to two significant figures. Thenμ was computed by importance sampling with n = 1000 samples, and by pmvnorm. Figure 3 shows the results. The importance sampling value was always very close to the union bound which in turn is essentially equal to what one would see for independent events. The values from pmvnorm were usually too small. By construction the intersection probabilities are quite rare. In importance sampling, 77.5% of the simulations had no intersections among 1000 trials and the others had only a few intersections. Therefore it is clear that the probabilities should be close to the union bounds here. We also see evidence of strong positive skewness for pmvnorm.

Model
Our power system models are based on a network of N nodes, called busses. Some busses put power into the network and others consume power. The M edges in the network correspond to power lines between busses. The network is ordinarily sparse, with M a small multiple of N . The power production at bus i is p i , with negative values indicating consumption. For some busses, p i is tightly controlled and deterministic in the relevant time horizon. Other busses have random p i corresponding, for example, to variable consumption levels, that we treat as independent. Busses corresponding to wind farms have random power production levels with meaningfully large correlations. Our models contain one special bus S, called the slack bus, at which the power is p S = − i =S p i . The total power in the system is zero because transmission power losses are ignored in the DC approximation that we use.
The power at all busses can be represented by the vector p = (p T F , p T R , p S ) T corresponding to fixed busses, ordinary random busses (including any correlated ones) and the slack bus. There are N F fixed busses, N S = 1 slack bus and Estimate / Bound Figure 3: Results of 200 estimates of the µ for varying high dimensional problems with nearly independent events. N R = N − N F − N S random busses apart from the slack bus. We will use 1 R to denote a column vector of N R ones, and I R to denote the identity matrix of size N R and similarly for 1 F and I F .
The power p i at bus i must satisfy the constraints The vector p has a Gaussian distribution, determined entirely by the random components p R ∼ N (η R , Σ RR ). Therefore in the present context, p R is the Gaussian random variable x from Section 2. The fixed components satisfy p F = η F and then the slack bus satisfies p S ∼ N (η S , Σ SS ) where η S = −1 T R η R − 1 T F η F and Σ SS = 1 T R Σ RR 1 R . Because all of the randomness comes from p R , we will abbreviate Σ RR to Σ.
The node to node inductances in the network form a Laplacian matrix B where B ij = 0 if busses i and j are connected with B ii = − j =i B ij (up to rounding). The Laplacian is symmetric and has one eigenvalue of zero for a connected network. It has a pseudo-inverse B + . We partition B and B + as follows We also group B + into three sets of columns via The phase at bus i is denoted θ i . In our DC approximation of AC power flow, the phases approximately satisfy Bθ = p. Given the power vector p, we The phase constraints on the network are In our examples, allθ ij =θ for a single valueθ such as π/6 or π/4. Let D ∈ {−1, 1} M ×N be the incidence matrix. Each edge in the network is represented by one row of D with an entry of +1 for one of the busses on that edge and −1 for the other. The phase constraints are |Dθ| θ componentwise. Now The constraint that Dθ θ for every edge ij can be written We also have constraints Dθ −θ which can be written Equations (14) and (15) supply 2M constraints on the random vector p R . We have also the two slack bus constraints p S p S and −p S −p S , that is Finally, there are individual constraints on the random busses
These are the linear constraints on p R . There are 2M phase constraints and there are two constraints for all of the non-fixed busses, including the slack bus. These constraints can be turned into constraints Ω and T on a N (0, I R ) vector as described in Section 2.1.

Examples
We considered several model electrical grids included in the MATPOWER distribution (Zimmerman et al., 2011). In each case we modeled violations of the phase constraints, and used n = 10,000 samples. For some cases we found that, under our model, phase constraint violations were not rare events. In some other cases, the rare event probability was dominated by one single phase condition: µ = max J j=1 P j ≈ J j=1 P j =μ. For cases like this there is no need for elaborate computation because we know µ is within a narrow interval [µ,μ]. The interesting cases were of rare events not dominated by a single failure mode. We investigate two of them.
The first is the Polish winter peak grid of 2383 busses. There were d = 326 random (uncontrolled) busses and J = 5772 phase constraints. We variedω as shown in Table 2. Forω = π/7 constraint violations are not very rare. At ω = π/4 they are quite rare. The estimated coefficient of variation is nearly constant over this range.
The second interesting case is the Pegase 2869 model of Fliscounakis et al. (2013). This has d = 509 uncontrolled busses and J = 7936 phase constraints. It is described as "power flow for a large part of the European system". The results are shown in Table 3. We include an unrealistically large boundω = π/2 in that table, to test the limits of our approach. Forω = π/2, the standard error given is zero. One half-space was sampled 9408 times, another was sampled 592 times but in no instance were there two or more phase violations. The estimate reverts to the union bound. Getting 0 doubletons (S = 2) among n = 10,000 tries is compatible with the true probability of a doubleton being as high as 3/n. Even if T 1 = .9997 and T 2 = .0003 then we would have µ = (1 − (3/2) × 10 −4 )μ instead ofμ. We return to this issue in the discussion.
a system which included random and correlated wind power generators, but phase failure was not a rare event in that system. Pegase 13659 was too large for our computation. The Laplacian matrix has 37,250 rows and columns and we use the SVD to compute the generalized inverses we need. Pegase 9241 was large enough to be very slow and it did not have rare failures.

Discussion
We have introduced a version of mixture importance sampling for problems with multiple failure modes. The sample values are constrained to have at least one failure and we obtain bounded relative error.
The ALORE importance sampler is more accurate than a state of the art code for computing high dimensional Gaussian probabilities in our rare event setting, but not otherwise.
We have noticed two areas where ALORE can be improved. First, if we never see S 2 concurrent rare events, ALORE will return the union boundμ =μ, with an estimated variance of zero. That variance estimate could be undesirable even when µ/μ ≈ 1. Because S(x) is supported on {1, 1/2, . . . , 1/J} we can get an interval estimate of µ by putting a multinomial prior on this support set and using the posterior distribution given the sample.
A second and related issue is that whileμ μ μ/J always occurs, it is possible to getμ < µ = max 1 j J P j . We have seen this in cases whereμ ≈ µ because one of the P j dominates all of the others combined. In such cases µ,μ and µ are all very close together andμ has small relative standard deviation. Improving these two issues is outside the scope of this paper. They are both things that happen in cases where we already have a very good idea of the magnitude of µ.
In large problems the algebra can potentially be reduced by ignoring the very rarest events and simply adding their probabilities to the estimate. This will provide a mildly conservative bound. There is also the possibility of exploiting many generalized upper and lower bounds on the probability of a union. See for instance the survey by Yang et al. (2014).

Appendix: Proofs
Proof of Theorem 2 Our motivating problem involves probabilities defined by Gaussian content of half-spaces. The approach generalizes to the estimating the probability of the union of any finite set of events. We can also consider a countable number of events when the union bound is finite; see remarks below. We will assume that each event has positive probability, but that condition can also be weakened to a positive union bound, as described at the end of this section.
For definiteness, we define our sets in terms of indicator functions of a random variable x ∈ R d with probability density p. The same formulas work for general sample spaces and the density can be with respect to an arbitrary base measure.
We cast our notation into this more general setting as follows. For J 1, and j = 1, . . . , J, let the subset H j ⊂ R d define both the event H j = 1{x ∈ H j } and the indicator function H j (x) = 1{x ∈ H j }. For u ⊆ {1, . . . , J} we let H u = ∪ j∈u H j and H u (x) = max j∈u H j (x), with H ∅ (x) = 0. As before P u = E(H u (x)), the number of events is S(x) = J j=1 H j (x) and Pr(S = s) = T s . Next we introduce some further notation. We use −u for complements with respect to 1:J, especially within subscripts, and H c u (x) for the complementary outcome 1 − H u (x). Then H u (x)H c −u (x) describes the event where x j ∈ H j if and only if j ∈ u.
If P j > 0, then the distribution q j of x given H j is well defined: q j (x) = p(x)H j (x)/P j . If min j P j > 0 then we can define the mixture distribution q α * = J j=1 α * j q j , α * j = P j /μ,μ = J j=1 P j .
For n 1, our estimator of µ = Pr(S(x) > 0) iŝ In this section, some equations include both randomness due to x i ∼ q α * and randomness due to x ∼ p. For section only, we use Pr * , E * and Var * when the randomness is from observations x i ∼ q α * , while E, Pr and Var are with respect to x ∼ p.
Theorem 2. If 1 J < ∞ and min j P j > 0 and n 1, thenμ α * defined by (17) satisfies Pr * (μ/J μ α * μ) = 1, and Remark 1. Suppose that one of the P j = 0 butμ > 0. In this case q j is not well defined. However q α * places probability 0 on q j , so we may delete the q j component without changing the algorithm and then sampling from q α * is well defined.
Remark 2. Next suppose that there are infinitely many events, one for each j ∈ N. Ifμ ∈ (0, ∞), then q α * is well defined. The same proof goes through, only now sums over 1:J must be replaced by sums over N.
Proof of Lemma 1 Finally, the unique distribution for which Var(S), Var(S −1 ) and −Corr(S, S −1 ) all attain their maxima is U{1, J}.