Exploiting Locality in Full Configuration Interaction Quantum Monte Carlo for Fast Excitation Generation

In this paper, we propose an improved excitation generation algorithm for the full configuration interaction quantum Monte Carlo method, which is particularly effective in systems described by localized orbitals. The method is an extension of the precomputed heat-bath strategy of Holmes et al., with more effective sampling of double excitations and a novel approach for nonuniform sampling of single excitations. We demonstrate the effectiveness of the algorithm for a chain of 30 hydrogen atoms with atom-localized orbitals, a stack of benzene molecules, and an Fe(II)-porphyrin model complex, whereby we show an overall efficiency gain by a factor of two to four, as measured by variance reduction per wall-clock time.


INTRODUCTION
The full configuration interaction quantum Monte Carlo (FCIQMC) method is a sparse and highly parallelizable CI eigensolver in quantum chemistry. 1,2Relying on the sparsity of the CI Hamiltonian and its eigensolutions, FCIQMC is in general not limited by the active space sizes, 3−11 in contrast to conventional CI methods, 12,13 which are limited to at most 18 electrons and 18 orbitals. 14Due to the nearly linear parallelization of the algorithm, FCIQMC takes full advantage of modern computer architectures 2 and allows a large number of electrons and orbitals to be explicitly correlated; active spaces containing up to 96 electrons in 159 orbitals have been reported to date. 9The FCIQMC dynamics also allows to stochastically sample reduced density matrices (RDMs) via the replica trick, 15 enabling the calculation of properties and orbital optimization, in the form of the stochastic complete and generalized active space self-consistent field (stochastic-CASSCF and stochastic-GASSCF, respectively) methods. 3,9n FCIQMC, the stochastically sampled wave function is represented by walkers distributed across the electronic configurations that form the CI vector space.The CI coefficient of a given configuration is proportional to the expectation value of the walker occupation on that configuration averaged over time, c i ∝ ⟨N i w ⟩ τ .In the limit of large walker numbers, the approximated c i values become exact.Electronic configurations corresponding to vanishingly small c i coefficients are never or only rarely occupied by walkers.Since only occupied configurations have an impact on the computational costs of a FCIQMC simulation, the method relies on and takes advantage of sparsity.The initiator approximation (i-FCIQMC) 16−18 and its further refinement, the adaptive shift method, 19,20 increase the sparsity of the wave function and greatly improve the convergence with respect to the walker number.
The FCIQMC wave function is propagated along the imaginary time, τ, in discretized steps, Δτ.The iterative procedure of FCIQMC dynamics consists of four main stages: excitation generation, spawn, death, and annihilation.In the excitation generation stage, new configurations |D j ⟩ are suggested as targets.In the spawn step, new walkers are then spawned to |D j ⟩ from |D i ⟩ (i ≠ j), with a probability p spawn ∝ Δτ⟨D i |H ̂|D j ⟩.In the death step, configurations |D i ⟩ are killed with a probability proportional to their diagonal matrix element ∝ Δτ⟨D i |H ̂|D i ⟩.In the annihilation step, walker contributions (arising from different spawning events) residing in the same configuration are summed with their respective signs.If they are of identical amplitude and opposite sign, they annihilate each other, and any information about that configuration is removed from the data tables.
The exact details of the four main stages of the FCIQMC algorithm greatly affect its performance.For example, a low rate of successful spawns makes the algorithm inefficient; multiple spawns from one configuration to another (a "bloom" event) cause the calculation to become unstable.In order to reduce inefficiencies, new configurations |D j ⟩ should not be suggested uniformly in the excitation generation step.Instead, they should be selected with a probability p gen ∝ ⟨D i |H ̂|D j ⟩ such that important connections are sampled more often and the spawn probability p spawn ∝ Δτ excitation generation is at the heart of an efficient FCIQMC algorithm.
So far, we used the term configurations in its full generality since the FCIQMC algorithm is not restricted to a specific basis and can use Slater determinants (SD) as well as spin eigenfunctions.In the following, we will restrict ourselves to SDs because we make use of the Slater−Condon rules in this work.In the SD basis, there exist several algorithms for the sampling of double excitations with different trade-offs between efficiency and memory demand, notable ones include: (a) the Cauchy−Schwarz excitation generator by the Alavi group, 2 (b) the heat-bath Power−Pitzer method by Neufeld and Thom, 21 and (c) the precomputed heat-bath (PCHB) method by Holmes et al. 22 PCHB is the foundation of the work presented in this document.
One important property that these fast double excitation generators rely on is that the magnitude of the matrix element between two determinants, connected by a double excitation, does not depend on the involved determinants (details in Section 2).This property allows the determinant-independent precalculation of probability distributions and their efficient sampling. 22,2,21,23Algorithms for efficient excitation generation are also highly relevant for related methods such as selected-CI 24 or FCI fast randomized iteration 25 (FCI−FRI).
In PCHB, the sampling of double excitations (AB ← IJ) occurs in multiple steps.In the particle selection, a first particle I is chosen according to probability p(I), and then a second particle J is chosen under the conditional probability p(J|I).Next, in the hole selection, holes A and B are chosen with probabilities p(A|IJ) and p(B|IJA), respectively.When suitably localized MOs are used, after particle I has been selected, the second particle has a high probability, p(J|I), to be selected from the vicinity of I, rather than from spatially distant orbitals.Hence, the PCHB algorithm can take advantage of the locality effects.
Furthermore, particle indices, I and J, should preferably be guaranteed to be occupied in the starting determinant, |D i ⟩, while the hole indices A and B should be guaranteed to be empty, which requires additional work.This implies that there is a trade-off between (a) faster sampling, which cannot guarantee occupied I, J and empty A, B, versus (b) slower but high-quality sampling, which can guarantee this.If the underlying probabilities are near-uniform, then the advantage of weighted sampling vanishes, and it might be beneficial to return back to uniform sampling, which is faster and cheaply guarantees occupied I, J and empty A, B.
The Hamiltonian matrix elements for single excitations, on the other hand, depend on the occupied orbitals and hence on the determinant from which the excitation takes place (details in Section 2).This determinant-dependency makes the precalculation of single excitation probabilities harder, and single excitations have usually been sampled uniformly. 2,9,21,22n order to improve upon the uniform single excitation generation, Neufeld and Thom suggested to build precalculated probabilities for single excitations relying on the reference determinant, with some modifications. 21This approach works best for systems featuring single-reference wave functions.In the context of variational Monte Carlo, Sabzevari and Sharma suggested contracting over all orbitals in the two-electron contribution to the single excitation Hamiltonian matrix element and to truncate all single excitations that are below a user-defined threshold. 26This contraction over all orbitals is independent of the determinants and can be precomputed, but the truncation introduces an artificial bias in the excitation generation; moreover, it is still uniform for all of the excitations that are above the threshold.
For delocalized molecular orbital bases, double excitations are usually more important and more numerous than single excitations.Single excitations often feature vanishingly small matrix elements, for they (partially) fulfill the Brillouin theorem in its single-reference or generalized form. 27Therefore, most effort in the community has gone into improving double excitation generation.However, in localized MO bases, single excitations become as (or more) important than double excitations, capturing mean field effects via the CI-expansion.The corresponding single excitation Hamiltonian matrix elements are then generally large and pivotal in wave function optimization.Thus, the development of efficient sampling schemes for single excitations becomes particularly relevant for localized orbital bases.−7,11,28−30 Inspired and motivated by our research carried out on localized orbital bases, we have developed a novel PCHB algorithm that, relying on the locality of the correlation effects, has an increased excitation generation efficiency for both single and double excitations.This algorithm is the focus of this work.
For our new precomputed and weighted single excitation generator, we contract over all orbitals to obtain an approximated estimate for the nonuniform sampling probabilities of single excitations.The contraction is similar to the work of Sabzevari and Sharma, but we do not introduce a bias from truncation, sample the single excitations nonuniformly, and show that the nonuniform probabilities are well approximated in localized molecular orbital bases.
For the double excitations, we systematically investigated the trade-off between speed and guaranteeing occupied or empty orbitals in PCHB and determined which combination of uniform or (constrained) weighted PCHB sampling is the fastest, to date not present in the literature.In this work, we show that the choice matters, considerably improves the excitation generation of doubles, and depends on the system at hand.We also improve upon the current literature by examining how to efficiently draw from constrained distributions, for example p(I)| I∈|Dd i ⟩ from the precomputed p(I).
The new PCHB algorithm for singles and the improved excitation generation for doubles allows FCIQMC dynamics that are two to four times more efficient than the implementations developed by Holmes et al. 22 and by us (within the NECI 2 project).
The novel PCHB algorithm is easily extended to the recently developed stochastic-GAS wave function optimizations 9 and to the spin-purification strategy in SD based FCIQMC. 10he remainder of the article is organized as follows: in Section 2, we give the necessary background for our novel PCHB excitation generator in FCIQMC.This includes the equations for the evaluation of Hamiltonian matrix elements, their decay with increasing spatial distance, and the already existing PCHB algorithm for double excitations; in Section 3, we introduce our improved sampling of double excitations; in Section 4, we introduce our new weighted, precomputed single excitation generation; in Section 5, we demonstrate the application of the novel PCHB on several test case molecular systems and quantify the efficiency improvements: a chain of 30 hydrogen atoms with atom-localized orbitals, whose Hamiltonian is dominated by single excitation couplings, demonstrates the capabilities of our new algorithm for sampling single excitations; stacks of benzene molecules, with fragment-localized MOs serves as an example for a system characterized by strong correlation effects (double excitations) within each fragment, and generally small correlation effects across the fragments; the efficiency of the new PCHB is numerically demonstrated also for highly delocalized MOs, using an Fe-porphyrin model system from previous works, 3,4,8,9 as well as the N 2 dimer at equilibrium geometry with a large full-CI space.In Sections 6 and 7, we discuss how both generalized active spaces and spin purification on an SD basis benefit from the improved PCHB algorithm.Section 8 collects our final remarks and conclusions from the work.

THEORETICAL BACKGROUND
This section describes the key elements that represent the foundation of the novel PCHB algorithm suggested in the present work.
2.1.Matrix Elements.Using the conventions of the "purple book" 27 and assuming orthonormal spin−orbitals, we write the second-quantized, nonrelativistic molecular Hamiltonian operator in the Born−Oppenheimer approximation with atomic units as where P, Q, R, and S are generic spin−orbital indices.The a † and a are the usual unitary, second-quantized creation and annihilation operators.The one-electron integral h AI is given as where ϕ I denotes the I-th spin−orbital, r and x refer to the spatial and the combined spatial and spin electronic coordinate, respectively, and Z X and R X are the charge and spatial coordinate of the X-th nucleus.The two-electron integral g AIBJ is given in chemist's notation as where r 12 = |r 1 − r 2 |.The nuclear repulsion term, E nuc , is the classical repulsion of point charges given by The Hamiltonian operator contains up to two-particle excitation terms and in general, the matrix element ⟨D i |H ̂|D j ⟩ between two determinants is zero if they differ by occupation in more than two spin−orbitals (in virtue of the Slater− Condon rules 31 ).The nonzero terms are given as where indices I < J and A < B refer to occupied and unoccupied spin−orbitals in |D i ⟩, respectively, connecting to |D j ⟩.We write hence, R in eq 6 represents an index running over all the occupied spin−orbitals.The Fermionic prefactor is Γ AI D i = (−1) n , where n is the number of occupied spin−orbitals between A and I in determinant |D i ⟩.Note also that the indices for the single and double excitations have to be pairwise different (I ≠ J, A ≠ B, {I, J}∩{A, B} = ⌀).We will write uppercase I for spin−orbitals, lowercase i for spatial orbitals, and i σ for the i-th spatial orbital with spin σ.
Our new PCHB excitation generator relies on the decay of Hamiltonian matrix elements from vanishing integrals.We assume the MOs to be suitably localized and separated from each other so that a distance R between the MOs can be defined.We have to emphasize that the following discussion does not apply to delocalized orbitals.For large enough distances, the one-electron integrals decay with (S IA ), where S IA = ⟨|ϕ I |||ϕ A |⟩ is the absolute overlap between the two spin− orbitals, which decays as (exp(−R AI2 )) with the distance R AI between the MOs ϕ A and ϕ I . 32An intuitive understanding comes from the observation that with less and less overlap, the two functions ϕ I and ϕ A have more and more disjoint support with respect to the integration variable (see eq 2).
For the decay of the two-electron integrals g AIBJ , we need to define the following quantities: S IA and R IA are defined as before, S JB and R BJ are the absolute overlaps and distances between ϕ J and ϕ B , and R ovl,AI,BJ is the distance between the overlap distributions ϕ A ϕ I and ϕ B ϕ J .With this notation, the two-electron integrals g AIBJ decay for large distances as Here and in the following, we write We first decide with probability p 1 or p 2 whether to perform a single or double excitation, respectively, and then sample from the conditional probabilities p gen (|D j ⟩ ← |D i ⟩|n), n ∈ {1, 2}.Here and in the following, subscripts 1 and 2 refer to single and double excitations, respectively, while superscripts refer to the algorithm, e.g., p 2 PCHB is a PCHB-weighted double excitation probability.These subscripts or superscripts are omitted if the expressions are not restricted to a specific algorithm or level of excitation.
During a calculation, the time step is optimized to be as high as possible while preserving a stable calculation.In the simplest approach, this is achieved by In order to maximize the possible time step Δτ from eq 9, the probabilities p 1 and p 2 are adjusted during the calculation to preserve From eq 10, it becomes apparent that if an excitation generator is of low quality, i.e., there is a pair gen then the corresponding p n become large.Note that there exist more elaborate schemes to update the time step, for example, the histogramming Δτ search, which does not depend only on the worst case unlike eq 9. 34 Even in such schemes, the overall trend is retained; i.e., if single excitations are suboptimally sampled, they will be sampled more often.
Localized MOs tend to have considerably large matrix elements for single excitations, and these matrix elements are nonuniform, because excitations to distant MOs are negligible (vide infra).The generally utilized uniform excitation generation for single excitations, in the case of localized orbital bases, would lead to small p gen for single excitations due to the numerous vanishingly small single connections among distant MOs, leading to exaggeratedly large p 1 values, and overshadow any improvement of PCHB for double excitations.In these cases, the uniform excitation generator for single excitations represents a particularly poor choice and would negatively impact the entire dynamics.The present work overcomes such limitations and provides a more balanced excitation generation.
2.3.Original PCHB Strategy.In this section, we summarize the key elements of the original work on PCHB by Holmes and co-workers. 22The exact weighted sampling of orbitals for double excitations from |D i ⟩ is given by Clearly, whether the element is zero or nonzero depends on occupation (I, J ∈ |D i ⟩ and A, B ∉ |D i ⟩), and hence on the determinant.When nonzero, double excitations have the important property that the magnitude is independent of the starting determinant |D i ⟩ (compare eq 7).If we write for the weight then eq 11 can be factorized without approximations, as The crucial approximation of the original PCHB is the contraction over all indices and hence the determinantindependent precalculation of probabilities, which are given by Considering that the order in which we pick the particles and holes does not matter for the final determinant, and that in general, one cannot assume symmetry, e.g., p(I|J) ≠ p(J|I), the full probability of selecting

Journal of Chemical Theory and Computation
The precomputed probabilities from eq 14 have the huge advantage that they can be calculated once at the beginning of the calculation.In combination with the alias sampling algorithm, 23,35 the precalculated probabilities allow sampling with (1) run-time cost.This is a typical trade-off between memory and time.The memory requirements can in part be reduced by using read-only, node-shared memory for storing the alias tables. 2,9t would be desirable for an excitation generator that particle indices, I, J, and hole indices, A, B, are guaranteed to be occupied and unoccupied in |D i ⟩, respectively.Unfortunately, sampling according to eq 14, with no reference to the current determinant, |D i ⟩, has the important disadvantage that the particle/hole indices suggested might not be occupied/empty in |D i ⟩; this excitation can be discarded.

IMPROVED PCHB FOR DOUBLE EXCITATIONS
Our improvements to the sampling of doubles in PCHB consists of determining when and how to guarantee if orbitals are occupied or empty.The necessary renormalizations for constrained sampling (a sampling where the notion of occupied/empty orbital is available) that guarantees I, J and A, B to be occupied and empty, respectively, in |D i ⟩, read as ) The cost to evaluate these probabilities scales with the number of terms in the sum as (N e ) and (n orb − N e ) for I, J and A, B, respectively.
A clear trade-off emerges from the comparison of eqs 14 and 16, between (a) drawing from precomputed distributions with constant scaling and subsequently discarded excitations, and (b) from ad hoc constrained probability distributions with linear scaling and no discarded excitations.We will refer to unconstrained sampling (p(I), •••) as fast-weighted (eq 14) and to constrained sampling (p(I)| I∈|Dd i ⟩ , ••• ) as f ully weighted sampling (eq 16).The uniform sampling, also used for comparison in this work, allows drawing from constrained subsets with (1) time.This implies that for near-uniform probability distributions, the uniform sampling might outperform weighted sampling after taking the quality of excitation and iteration time into account.The possible sampling schemes are summarized in Table 1 and, in principle, can be freely combined among each other for the selection of first/ second particle or hole.Note that fast-weighted and fully weighted methods have additional, non-negligible memory and one-time initialization cost, that scale with (n orb 1 ), (n orb 2 ), (n orb 3 ), and (n orb 4 ) for the first and second particles and first and second holes, respectively.We will see later that the specific choice of combinations has large implications for the performance of the overall FCIQMC dynamics.In particular, fast-weighted sampling works best for the hole-selection of active spaces with an excess of unoccupied orbitals, i.e., if there are many empty orbitals, then p 2 PCHB (A|IJ) ≈ p 2 PCHB (A| IJ)| A∉|Dd i ⟩ , and performing the full weighting would only result in additional, unnecessary computational effort.Similar conclusions would be reached for active spaces with an excess of occupied orbitals, but those are more rare in practical applications and limited by the Pauli principle and will not be discussed any further.Uniform sampling works best for active spaces with near-uniform excitation probabilities and has no additional costs to guarantee orbitals to be occupied or unoccupied.Fully weighted sampling is the slowest in wallclock time but samples with the highest quality.
To date, no systematic investigation is available on which combination of uniform or (constrained) weighted PCHB sampling is the fastest.The original PCHB paper 22  (AB|IJ)) to be the fastest combination.Note that directly sampling pairs IJ and AB can reduce the memory cost, as described in ref 2, but is tied to a fast-weighted sampling scheme.
Based on the previous discussion, our improvements to the sampling of double excitations in PCHB are two-fold: (a) we suggest optimal combinations of fast, fully weighted, and uniform sampling depending on general features of the systems investigated, and (b) we improve the sampling speed from constrained distributions.We present these improvements separately in the next two subsections.
3.1.Improved Sampling Combinations.The choice of better combinations of sampling schemes greatly depends on the probability distributions from eq 14 which are highly system dependent.Different systems may prefer different excitation generation schemes based on the form of electronic interactions, the orbital basis of choice (delocalized or localized), and whether there is an excess of empty orbitals.We will analyze the shape of probability distributions for some selected systems, and numerically prove our findings in the application Section 5.
The first example we consider is a stack of ten benzene molecules at an intermolecular distance of 3.0 Å, and orbitals that are fragment-localized, i.e., they are natural, Huckel-like orbitals on each benzene.The CAS (60,60) was ten times the minimal active space of six electrons in six π-orbitals for one Journal of Chemical Theory and Computation benzene unit.As we showed in a previous publication, 9 the system as a whole is dominated by excitations inside each fragment.Additionally, incorporating single excitations between the fragments is enough to recover the full CI energy.The probability distribution p 2 PCHB (I) is displayed in Figure 1.The symmetry of repeating fragments is visible in the probabilities, and the fragments at the border have slightly different behavior than the fragments in "bulk".However, overall, it is a near-uniform distribution.For this reason, p 2 uni (I)| I∈|Dd i ⟩ is likely the best choice for the first particle, cheaply guaranteeing I to be occupied.This section is about qualitative trends and gives robust rules of thumb for selecting the sampling methods.It is still necessary to measure timings with the system at hand, as can be seen in the numerical results of Section 5.
The conditional probability of p 2 PCHB (J|I) to sample the second particle is displayed in Figure 2. The left, nonlogarithmic plot (Figure 2a) shows that the correlation is highly local.The second particle has the highest probability of being selected on the same fragment, while quickly decaying for neighboring fragments.Any choice for the second particle further away than 6 Å (second-nearest fragment) is nearly negligible.The logarithmic plot (Figure 2b) reveals the exponential decay with increasing distance.This observation shows that the distribution is highly nonuniform and weighted sampling is preferred.Moreover, as the orbital space considered, CAS (60,60) is far from a case with an excess of occupied orbitals; it is advantageous to guarantee that the second picked orbital index is indeed occupied.By this argument, fully weighted sampling with p 2 PCHB (J|I)| J∈|Dd i ⟩ is recommended.In addition, the weighted sampling automatically prefers geminal, i.e., opposite spin pair, excitations, as can be seen from comparing the blue and orange lines in Figure 2. The subsequent hole-selections are also highly nonuniform with no excess of empty orbitals, which suggests to be the best choice for such localized systems, namely, uniform selection for the first particle and fully weighted selection for the second particle and the two holes.Notably, this sampling scheme is neither the one suggested by Holmes et al. 22 nor the one we have been using in NECI. 2 The second system is full CI on the N 2 dimer at equilibrium distance with a Dunning's cc-pVQZ basis 36 using Hartree− Fock orbitals, resulting in a (14,110) active space.This is a single-reference system with highly delocalized orbitals that fulfills the Brillouin theorem, which makes the double excitations particularly important.
The probability for the first particle, p 2 PCHB (I), is displayed in Figure 3 and has a small variation.For this reason, analogously

Journal of Chemical Theory and Computation
to the benzene stack, p 2 uni (I)| I∈|Dd i ⟩ is considered to be the best choice for the first particle.
The conditional probability p 2 PCHB (J|I) to sample the second particle is displayed in Figure 4.The left, nonlogarithmic plot (Figure 4a) is nonuniform; however, the range of PCHB values is much smaller than for the stack of benzene (Figure 2a), indicating a more homogeneous distribution.The logarithmic plot (Figure 4b) reveals that the probabilities range mostly over "just 1" order of magnitude, unlike the 4 orders of magnitude in Figure 2b.There is no obvious exponential decay for the probability of the second particle due to the compact nature of the system.Overall, the N 2 dimer and similar systems might still benefit from uniform sampling of the second particle p 2 uni (J|I)| J∈|Dd i ⟩ .This suggestion is investigated in greater detail and confirmed in Section 5.
The subsequent hole-selections are again highly nonuniform, which suggests weighted sampling.Due to the excess of empty orbitals, the fast-weighted sampling of hole indices is expected to outperform the fully weighted strategy since a randomly picked index is more likely to be unoccupied.The above considerations indicate the following sampling scheme for N 2 and similar compact systems featuring delocalized orbital bases and active spaces with an excess of empty orbitals.Notably, this sampling scheme is the choice we have been using in NECI 2 and differs from the one suggested by Holmes et al. 22 3.2.Faster Constrained Sampling.This subsection describes how to efficiently draw from constrained distributions, for example, p 2 uni (I)| I∈|Dd i ⟩ from the precomputed p(I).We assume that both the unconstrained probabilities p(I) and the corresponding alias tables for alias sampling are preconstructed, i.e., we can draw in (1) from p(I).
In the literature, it has so far been assumed that constrained distributions are best sampled by rebuilding alias tables for the constrained probabilities p(I)| I∈|Dd i ⟩ . 21,22This approach still uses the precomputed probabilities p(I) to obtain p(I)| I∈|Dd i ⟩ but ignores the already available alias tables for p(I).This is an unnecessary slowdown because the computational effort for reconstructing constrained alias tables scales linearly with the number of particles N ( ) e or the number of empty orbitals n N ( ) orb e for constraints of the type We suggest to use rejection sampling, where p(I) takes the role of the candidate density and the constrained p(I)| I∈|Dd i ⟩ takes the role of the target density. 38This means we redraw from p(I), using the already available alias tables, until the required (un)occupied index is sampled (see alg. 3).If |D i ⟩ is represented as a bit mask, then the test for membership I ∈ |D i ⟩ has (1) cost, and since drawing from p(I) also has (1) cost, the whole operation of redrawing has (1) scaling with respect to system size.Note that the constant scaling with respect to the system size is only one part of the story.If the weight of the subset becomes low, then the probability of misses becomes high and the expected number of samples is given by | p I 1 ( ) . For this reason, it can be advantageous to avoid rejection sampling, if the renormalization constant ∑ I∈|Dd i ⟩ p(I) is small (see algorithm-line 3.10).It also means that rejection sampling of holes works better for systems with a large number of empty orbitals, while rejection sampling of particles works better for systems with a small number of empty orbitals.
In order to obtain the renormalized probabilities (eq 16), we need equations of the type p(I)| I∈|Dd i ⟩ = p(I)/(∑ X∈|Dd i ⟩ p(X)), which would scale with N ( ) e because of the sum at the denominator; for hole constraints p(A|I| A∉|Dd i ⟩ = p(A|I)/ (∑ X∈|Dd i ⟩ p(X|I))the scaling would be n N ( ) orb e .Since there is the following relationship for the complement we can choose to scale with the number of particles or empty orbitals, whichever is cheaper.This is particularly advantageous for active spaces with an excess of empty orbitals.Figure 5 depicts the typical situation of an active space with a similar number of particles and spatial orbitals; hence, the constrained subset, for both particle-and hole-selection, contains half of the elements of the total system size.In the same figure, we can see the linear scaling for both methods and that the redrawing method is roughly 5 times faster than rebuilding the alias sampling tables.For higher weights of the constrained subset, the redrawing method can get up to 10 times faster.Further details of the timings are discussed in the Appendix A.1.
In the context of FCIQMC, the faster sampling from constrained distributions is relevant for both single and double excitations; the singles are described in the next section.

NOVEL SCHEME FOR PRECOMPUTED SINGLE EXCITATIONS
As discussed earlier, an accurate and balanced generation of single and double excitations is of paramount importance for efficient FCIQMC dynamics.According to eq 10, a suboptimal single excitation generator leads to an exceedingly frequent sampling of single excitations, which overshadows any improvement for double excitation generators.
In this section, we introduce an efficient way to precompute the probabilities for single excitations.As for the precomputed double excitations, we start with the exact, determinantdependent, factorized sampling probability (20)   Equation 20 depends in two ways on the starting determinant |D i ⟩.The first, obvious dependence, is the restricted summation over (un)occupied orbitals (the K and C indices).This dependence is also present for double excitations and is approximated in PCHB by contracting over all orbitals (eq 13).The additional dependence for single excitations is that the magnitude of the matrix element highly depends on the starting determinant (see eq 6).For example, according to the Brillouin theorem, single excitations from the Hartree−Fock reference determinant |D HF ⟩ should have a vanishing matrix element In Table 2, we list all possible combinations of probability distributions for single excitations.As in the case of doubles (Table 1), we can draw from constrained subsets and guarantee the sampled indices to be occupied or unoccupied.Sometimes, it might be even beneficial to return back to uniform sampling if the probability distribution is uniform enough.
We have investigated whether eq 23 approximates the exact probabilities well and which of the combinations from Table 2 is the most efficient sampling scheme for single excitations.In Figure 6, the p 1 PCHB (a ↑ |1 ↑ )| a↑∉|Dd i ⟩ values are shown for the stack of 10 benzene molecules with fragment-localized MOs (the probability of drawing the hole a ↑ after having drawn a ↑ -particle from the first spatial orbital).For the exact probability (eq 20), we assume that we start from the Hartree−Fock reference determinant, which is given by doubly occupying all π-orbitals and keeping all π*-orbitals empty (blue line in Figure 6) The orange line shows the corresponding approximated probability p 1 PCHB (a ↑ |1 ↑ )| a↑∉|Dd i ⟩ (eq 23). Figure 6 shows good agreement between the approximated and exact single excitations.It also confirms the exponential decay of matrix elements with the distance and the importance of excluding spatially distant orbitals, which would be unnecessarily sampled in a uniform scheme.
The probability for selecting the particle p 1 PCHB (I)| I∈|Dd i ⟩ is near-uniformly distributed.As for the double excitation case, the particle, I, can safely be sampled uniformly, and is predicted to be the fastest sampling scheme for single excitations.
Apart from the choice of the combination of sampling schemes, we also mention that the singles benefit from the improved sampling from constrained distributions, as discussed in Subsection 3.2.
As a final remark, we would like to emphasize that the new excitation generation for singles and the improved sampling combination of doubles can also be applied to methods such as selected-CI 24 or FCI-FRI. 25When combined with those methods, localized orbital bases mainly benefit from the new weighting scheme for single excitations, and the ratio of particles to empty orbitals determines whether fully weighted or fast-weighted particle or hole-selection is to be preferred.However, some differences between FCIQMC and FCI-FRI are to be expected; for example, the number of empty orbitals where fully weighted hole-selection becomes better than fastweighted for a given system might differ for the two approaches.Furthermore, in FCI-FRI, additional techniques can be applied to obviate the need for renormalization. 25

APPLICATIONS
Four test case applications are discussed in the following to numerically show the improved performance of the novel excitation generators: (a) a chain of 30 hydrogen atoms with atom-localized orbitals, therefore dominated by single excitations, demonstrates the capabilities of our new algorithm for sampling single excitations, (b) a stack of different numbers of benzene molecules with fragment-localized MOs serves as an example for a system of fragments with highly localized correlation effects that is dominated by double excitations and highlights the improved sampling of double excitations, (c) an Fe-porphyrin model complex shows the efficiency of the new single excitation sampling for systems with mostly delocalized MOs, as long as there is some spatial separation, and (d) the N 2 -dimer with Hartree−Fock orbitals shows how a uniform particle selection for the double excitations can improve the performance of PCHB.
5.1.Computational Details.Orbital optimization (e.g., the Hartree−Fock method), their transformations (e.g., localization), integral evaluations, their transformation to the basis of active molecular orbitals, and their storage on the formatted FCIDUMP 39 file were performed using Open-Molcas. 40,41The new excitation generation has been implemented in a locally modified version of NECI 2 and will be made publicly available via our open-source repository 42 upon publication of this work.The different sampling schemes are compared by their variance reduction per wall-clock time •t ( ) 1 .FCIQMC dynamics were entirely run in stochastic mode, i.e., no semistochastic propagation was enabled.For every system and every algorithm, there were three calculations of 24 h; the efficiency was then averaged over these three runs.Since the variance of the projected energy has to be estimated (1)

Journal of Chemical Theory and Computation
under stationary conditions, all calculations started from walker population distributions of already converged wave functions. 43To guarantee the optimal Δτ, p 1 , and p 2 for each algorithm, we optimized these values for 5 × 10 4 iterations according to eqs 9 and 10, ensuring that they kept constant after reaching stationary conditions.This means that every sampling choice was benchmarked with its own optimal set of parameters.The shift was adapted via a second-order scheme 44 to keep the population close to the target population of 10 5 for every calculation.−18 Although we used the initiator approximation in all applications presented here, we would like to emphasize that our new excitation generator applies also to the original (noninitiator) FCIQMC.−48 All efficiencies were normalized to the current PCHB algorithm in NECI.Sampling from constrained subsets uses our improved redrawing algorithm (Subsection 3.2) everywhere; for more details, see Appendix A.1.3 concisely summarizes the efficiency of different combinations of sampling schemes.We refer to the following subsections for a detailed discussion on the results for each test case.

Efficiency Comparison across Excitation Generators. Table
Uniform, fast-weighted, and fully weighted sampling schemes for particles and holes are labeled as unif, full, respectively.The colon symbol (:) separates particle-from hole-selection.For example, the sampling scheme for a double excitation, p 2 uni (I) , would be as uniffull:fast-fast.
In principle, there are 3 6 = 729 different combinations of samplers, but we restrict ourselves to the handful listed in Table 3, which represent the most meaningful.The conditional probabilities for the holes and for the second particle can use  The efficiency is given by •t ( ) 1 and normalized to the currently used algorithm in NECI, which is unif:unif unif-unif:fast-fast.The fastest speedup factor per system is printed bold.The algorithms are lexicographically sorted by sampling quality unif < fast < full.The error for the last digit is given in parentheses.b The choice in NECI. 2 c The choice by Holmes et al. 22 more information as compared to the first particle; hence, the later particle and hole selections should use less uniform sampling.It follows that a full:unif single excitation generator is expected to be less efficient than unif:full, due to the unnecessary weighting of the first particle, which, as shown earlier, in general exhibits uniform distributions.

H 30 -Chain.
The first example is a neutral chain of 30 hydrogen atoms that are equally spaced 4.2 a 0 apart.The basis used was STO-3G, which contains a minimal basis set of 30 basis functions, implying that (30,30) is the full-CI space.A C 1 point-group was utilized.Closed-shell Hartree−Fock-optimized orbitals were transformed by a Pipek−Mezey localization 49 to obtain orthonormal and localized MOs.
Because of the fully atom-localized MOs, the system is dominated by single If for example, all excitations are in a fully weighted manner, then p 1 is adjusted to 98% by eq 10, i.e., mostly single excitations are attempted.This means that the choice of the double excitation generator does not have a large influence on the overall efficiency.In Table 3, we indeed see that the efficiencies are grouped by the excitation generator for the single excitations.The purely uniform generation for single excitations unif:unif performs the worst, since it samples far away holes A with the same probability as close ones.The fully weighted excitation generation full:full performs the best per iteration, but it is outperformed by unif:full when taking wall-clock time into account since the choice of the first particle is near-uniformly distributed (see Section 4).The CAS (30,30) has no excess of empty orbitals, and surprisingly, unif:fast is the fastest combination.The fast-weighted hole-selection unif:fast performs better (by roughly 2.9 2.2 ≈ 30%) than the fully weighted hole-selection unif:full.
This can be explained as follows: due to hermiticity (compare eq 22), we have S I A = S A I .If an excitation A ← I has a high weight, then the reverse excitation I ← A has a high weight, as well.If I gets only occupied via the excitation from A, then A is unoccupied when I is occupied.Since the single excitations are in good approximation the only excitations, there is no other source that would spawn particles to I except the few nearest neighbors apart from A. Hence the probability of A being unoccupied when I is occupied is not independent of the precomputed weights but rather correlated, p(A ∉ |D i ⟩|I ∈ |D i ⟩) ∝ p 1 PCHB (A|I).Numerically, we confirm that if we sample via p 1 PCHB (A|I), the rate of valid excitations is around 75%.If we would draw the hole A uniformly without guaranteeing the orbital to be unoccupied [i.e., p 1 uni (A|I)], the rate of valid excitations would be 14
The precomputed weighted single excitations improve the performance by a factor of nearly three compared to unif:unif.Since the double excitations do not play a large role in this system, the NECI and Holmes et al. approaches, differing exclusively in the double excitations, exhibit a very similar efficiency. 2,225.2.2.Stack of Benzene Molecules.The next example is two stacks of benzene at a distance of 3.0 Å with 5 and 10 molecules, respectively.The stack of 10 molecules is displayed in Figure 7.In both cases, the orbitals were fragment-localized, i.e., natural, Huckel-like orbitals on each benzene.The active space contained all the π and π* orbitals, i.e., a CAS (30,30)  for 5 benzene molecules and a CAS(60,60) for 10 benzene molecules.As we have shown in a previous publication, 9 the system as a whole is dominated by excitations inside each fragment.Additionally incorporating single excitations between the fragments is enough to recover the full CI energy.
In the case of the five molecules, we used the CASSCF (30,30) optimized orbitals from our previous work. 9The MOs for the stack of 10 fragments were generated by repeating the MO coefficients of the five fragments along the diagonal to form a block-diagonal coefficient matrix of two blocks.MOs inside each block were still orthogonal to each other, but there was a small overlap between MOs of different blocks; hence, a Gram−Schmidt orthonormalization was applied.This implies that CAS (30,30) is at the energetic minimum with respect to orbital rotations, while CAS (60,60) is only close to it.Hence, the relevance of interfragment single excitations, recovering orbital rotations, is enhanced for the CAS (60,60).In the first case, we have p 1 = 22%, while in the second case, p 1 = 33%, when sampling everything fully weighted.
In Table 3, the efficiencies for different sampling schemes are listed for both stacks of benzene.As expected from the discussion in Sections 3 and 4, we see that uniform selection of the first particle and subsequent fully weighted sampling, i.e., unif:full unif−full:full−full, works best for both types of systems.It is 3.7 times faster than the current choice in NECI and 60% faster than the choice by Holmes et al.The advantage of not sampling uniformly increases with system size; in fact, the number of direct neighbors stays the same, while the number of distant orbitals increases.We also see that unlike the H 30 chain, it is important to improve the sampling of both single and double excitations.For example, varying the sampling of doubles by going from unif:full full−full:fast−fast to unif:full unif− full:full−full improves the performance from 2.6 to 3.7.
For these active spaces with a similar number of particles and empty orbitals, the fully weighted hole-selection outperforms the fast-weighted hole-selection, but the difference is perhaps less pronounced than expected.Similar to the H 30chain, this can be explained as follows: again, due to hermiticity, we have W IJ AB = W AB IJ (see eq 12).If excitation AB ← IJ has a high weight, then reverse excitation IJ ← AB has a high weight as well.Hence, the probability of A and B being unoccupied when I and J are occupied is not independent from the precomputed weights, but rather correlated, i.e., p(A ∉ | D i ⟩|I,J ∈ |D i ⟩) ∝ p 2 PCHB (A|IJ).We can confirm this correlation numerically.The hypothetical, uncorrelated rate of a valid hole selection for this active space would be roughly 1/2•1/2 = 1/4, since half of the orbitals are valid picks for the first hole and (half − 1) of the orbitals would be valid picks for the second hole.The numerically obtained rate of valid excitations for fastweighted hole selection is roughly 50%.
Contrary to the H 30 chain, the fully weighted hole-selection is still more efficient than the fast-weighted one since there are other effects that lower the aforementioned correlation between PCHB weights and occupation.There are both single and double excitations, and there can be double excitations of the type IJ ← AR.Hence, it is more probable that excitations happen to I or J that do not empty A and B simultaneously.
5.2.3.Fe(II)-Porphyrin.The next system is a Fe(II)porphyrin model complex whose properties were intensely discussed in our previous publications. 3,4,8,9,50,51We use the converged stochastic CASSCF natural orbitals from our previous work 8 with a (32,34) active space that consists of nine doubly occupied π-orbitals, seven empty π*-orbitals, five metal-centered 3d-orbitals, four doubly occupied σ N -orbitals, four empty orbitals of the (4s4p) shell, and five empty d′orbitals (double-shell orbitals).The natural MOs for this system are delocalized but not over the whole molecule.They are still confined to certain regions and can be identified as metal-centered, aromatic, or σ-donating nitrogen orbitals.So there is spatial separation of MOs, but not as strongly as that for the stack of 5 or 10 benzene molecules.
In Table 3, the efficiencies of different sampling schemes are listed for the Fe(II)-porphyrin molecule.As for the stack of benzene molecules, the fastest choice is again unif:full unif−full:full−full, but the maximum speedup by a factor of 1.21 is much smaller than for the localized benzene stack.
5.2.4.N 2 -Dimer.The last example is full CI on the N 2 dimer at an equilibrium distance with a Dunning's cc-pVQZ basis 36 using Hartree−Fock orbitals, resulting in a (14,110) active space.This system is highly delocalized and has an excess of empty orbitals.The excess of empty orbitals means that a randomly picked index is likely unoccupied, e.g., (A|IJ)| ≈ p 2 PCHB (A|IJ)| A∉|Dd i ⟩ , which means that the fast-weighted holeselection outperforms the fully weighted one.On the other hand, it is very important to guarantee orbitals to be occupied in the particle-selection for both single and double excitations.This means that only uniform or fully weighted particleselection are expected to be performant and explains why unif−fast particle selection for double excitations performs so badly for this system (Table 3).
In the case of single excitations, the approximation of contracting over all orbitals when calculating the weights (eq 22) is particularly poor for these Hartree−Fock orbitals, which means that PCHB weighting for single excitations does not have the same benefits as those for the other systems.For this reason, we see that purely uniform sampling (unif:unif) and uniform sampling of particles followed by fast-weighted selection of holes (unif:fast) show a very similar performance.
In the case of double excitation, there is no approximation regarding the matrix elements that would break down because of the use of Hartree−Fock orbitals.This means that the PCHB weights for doubles are close to the correct weights, and from the discussion so far, we conclude that fast-weighted hole-selection is the fastest method, which we can confirm in Table 3.The probabilities for the first particle I are nearuniformly distributed (compare the discussion around Figure 4); hence, the fastest double particle-selection is either unif− unif or unif−full.The first one is faster per wall-clock time, and the latter samples with higher quality.Looking at Table 3, we see that purely uniform particle selection (unif− unif) outperforms unif−full, which makes unif:unif unif−unif:fast−fast the fastest method for this system.

PCHB AND GAS
In the following, we show how, in the stochastic variant of the generalized active space approach (stochastic-GAS), occupation number constraints can be incorporated into the new weighted sampling scheme for singles and how the weighted particle-selection for single and double excitations can improve the GAS performance.In the following, we offer a brief summary of the GAS concept.Details on the stochastic-GAS algorithm can be found in ref 9.
In GAS, the active orbitals are partitioned into a number of active subspaces.Then only some distributions of particles among these subspaces are allowed, for example, by defining a minimum and maximum particle number per GAS space.A given allowed distribution of particles among the subspaces is called a supergroup. 9,12,52For a given supergroup, all possible configurations inside each subspace are allowed, i.e., a full CI wave function inside each subspace is generated.If there is only one supergroup, the GAS spaces are disconnected and no excitations between them are allowed.In Figure 8, a possible GAS wave function is displayed.−56 As we showed in our previous work, 9 the GAS constraints can be included into the PCHB probability distributions with negligible additional run-time cost.The supergroup of a given determinant can be determined by counting the particles per GAS space [a N ( ) e operation].Counting how many particles an excitation transfers between GAS spaces is a trivial operation.Hence, for a given supergroup (and all determinants belonging to it), an excitation is GAS allowed if the resulting particle distribution is still inside the chosen GAS constraints.This implies in particular that the starting determinant |D i ⟩ is not necessary; only its supergroup is required to determine if an excitation is GAS allowed.
For every supergroup, we can then define modified PCHB probability tables that automatically incorporate the GAS constraints.If we define i sg to be a labeling index for the At run time, we have to determine the supergroup index for a given determinant (the details of how to do this efficiently are in our previous work 9 ) and then draw from the corresponding probability distributions that were precomputed from eq 25.
Apart from the user-defined GAS constraints, there is, due to the Pauli-principle, a "natural" maximum particle number per GAS space, namely, the number of spin−orbitals.The weighted particle selection based on the GAS PCHB weights (eq 25) can use that information and improve the sampling performance.
To see this, let us consider the Anderson impurity model, which consists of one impurity orbital and n bath bath orbitals. 57he model allows only single excitations and only between the impurity and the bath orbitals, i.e., there may not be any excitations between bath orbitals.If we start from a determinant, where the impurity is doubly occupied, and select a particle on a bath orbital from which to excite, there will be no valid single excitation possible.Since there is a uniform probability of N This inefficient sampling can be cured with GAS and weighted particle-selection.If we define the impurity orbital to be its own GAS space and the bath orbitals to be in a second GAS space, then there are three allowed supergroups: [2, N e − 2], [1, N e − 1], and [0, N e ], where we just listed the number of particles per GAS space.If we start now from the supergroup [2, N e − 2] and assume that K is a spin−orbital index in the bath-orbitals, then we have for all possible holes A. This implies that p 1 PCHB (K|i sg ) = 0 and only one of the two particles in the impurity is selected to be excited, each one with a probability of 1 2 .

PCHB AND SPIN PURIFICATION
In this section, we want to show how the modified Hamiltonian is automatically efficiently sampled via PCHB.We showed in a previous work how this first-order spin penalty is useful, particularly in FCIQMC, to calculate spin-pure states in a SD basis. 10imilarly to the matrix elements of the Hamiltonian (eqs 5−7), we write the matrix elements for the S 2 operator as where S z the spin-projection in the z-direction, n α OS is the number of open-shell α electrons of a given determinant, and exchange excitations are all excitations of the type i α j β ← i β j α .The diagonal term does not appear in the excitation generation, and the single excitations are not modified at all.Only the double excitations of exchange type are actually modified and their respective PCHB weight (eq 12) becomes |g AIBJ − g AJBI + J|.If we additionally assume spatial orbitals, i.e., g AIBJ = g ad σ ,id τ ,bd μ ,jd ν = g a,i,b,j δ σ,τ δ μ,ν , we finally get |J − g ijji |.
For a given spin-projection σ, we write the primed σ′ for the opposite spin-projection.We conclude that the particle selection p 2 PCHB (j σ |i σ ′) with the modified weight becomes proportional to the spin penalty, i.e., opposite spin pairs are more likely to be picked.The increased probability of exchange interactions depends continuously on J and still incorporates the value of the integrals, e.g., g ijji .The same applies to the subsequent hole-selections, which are more likely to sample exchange excitations.Hence, the modified Hamiltonian is efficiently sampled, with a continuous dependence on J.

CONCLUSIONS
In this work, we have introduced an improved PCHB algorithm in FCIQMC that takes advantage of the locality of electron correlation and that allows FCIQMC dynamics that are two to four times more efficient than currently available algorithms.
A weighted sampling method for single excitations has been presented that uses precalculated probabilities.The weights for single excitations are obtained by contracting over all orbitals in the two-electron term rather than over only the occupied ones.This approximation works particularly well for localized orbitals, where it can take advantage of the exponential decay of integrals with increasing spatial distance.Precisely in the localized basis, single excitations are pivotal for efficient FCIQMC dynamics; hence, right when weighted sampling of single excitations becomes relevant, our approximation works well.
Also, optimal combinations of uniform, fully weighted, or fast-weighted sampling of particles or holes in double excitations have been identified for a variety of chemical systems, active spaces, and forms of the orbital basis (localized or delocalized).We conclude the following qualitative rules: (a) for localized orbitals and active spaces featuring a similar number of particles and spatial orbitals, the unif:full unif-full:full-full is generally the fastest combination of sampling schemes, as shown by the fragment-localized benzene stacks and the Fe-porphyrin examples.The speedup was up to a factor of four compared to the current implementation in NECI.(b) If the system uses delocalized orbitals with an excess of empty ones, such as the N 2 dimer in a large basis, then unif:unif unif-unif:fast-fast is preferred.This combination gives a 50% speedup compared to the choice of Holmes et al.

Journal of Chemical Theory and Computation
We have also suggested an alternative method to draw from constrained subsets of precalculated probabilities that is 5−10 times faster than the sampling methods currently available and scales practically constant when keeping the particle number constant and increasing the basis set size.The details are discussed in the Appendix.This improvement can be used in all implementations of PCHB, for both singles and doubles, and improves the performance of fully weighted sampling.
Our new precomputed weighting scheme for single excitations can already be applied in combination with stochastic-GAS and the spin purification approach in FCIQMC, making both methods more efficient.
As a final remark, we would like to highlight that the full strength of the novel precomputed weighting scheme emerges when combined with the spin-adapted implementation of FCIQMC, 7,34 based on the graphical unitary group approach (GUGA).This will be the subject of a future publication.In a series of recent works, 5−7,11,28−30 we have shown that by simple orbital localizations and reorderings, it is possible to obtain CI Hamiltonian matrices with a unique and unprecedented block diagonal structure.As a consequence, extremely compact wave functions (to the limit of near-singlereference) are obtained for ground and excited states of exchange-coupled polynuclear transition metal clusters.Since localized bases are propaedeutic to the GUGA wave function compression, it is clear that an excitation generator that exploits locality would further facilitate the screening of unnecessary excitation in such sparse Hamiltonians.Our nonuniform drawing scheme in GUGA would promote those single and double excitations among nearest-neighbors and would screen out excitations that are related to decaying integral values, yielding efficiently sampled excitations in extremely sparse Hamiltonians.

A.1. Constrained Sampling
In this section, we will discuss the various ways of how to efficiently sample from constrained subsets of known discrete probability distributions.To do so, we need to recapitulate the two main algorithms for sampling non-uniform, discrete distributions in the FCIQMC community, namely, the cumulative distribution function (CDF) and the alias sampling 23,35 methods.
The CDF sampling is shown in algorithm 1.It has linear scaling for the initialization and logarithmic scaling for the sampling step.
The alias sampling is shown in algorithm 2. The (cumulative) summation over weights w i is the same for both methods, but there is additional work in the initialization step of the alias sampling.This means that the scaling in the initialization is at least linear for both methods, but CDF sampling has a smaller prefactor.In the final step of the alias sampling (algorithm 2), there are two if−else branches.Both branches have constant scaling, but the one that returns K ⌊r⌋ has an additional indirection.To avoid this case, the sum of U i has to be maximized in the initialization, which is an NP-hard problem but can be approximately achieved by pairing large U i with small U i . 58Adding this optional sorting step lets the initialization of the alias tables scale with n n ( log( )), but the sampling becomes faster.
In addition, there is another trade-off between fast construction and fast sampling for both methods.Often the calling code, in particular FCIQMC, does not want to only sample a random value but wants to also know its probability p i .The necessary renormalization of the weights = p i w w i j j does not have to be done at initialization; it could be computed upon the request of the calling code.
To illustrate these effects, we will first compare two implementations of the alias method.The STL std::dis-crete_distribution implementation of GCC 7.5.0 is optimized for construction, while our own implementation is optimized for sampling time.The C++ code that was used to generate the data is available in the Supporting Information.The code ran on one core of an AMD EPYC 7763 CPU with 2.0 GHz.
In Figure 9, we see that there is a trade-off between optimizing for construction or sampling time.For this reason, we now assume that we optimize for sampling when the sampler is initialized once and then often sampled, while we optimize for construction, if the alias sampler should be constructed often.Now we can turn our eyes toward an efficient algorithm for sampling from a constrained subset of an already constructed Journal of Chemical Theory and Computation alias sampler.The main idea (algorithm 3) is to just sample via existing alias tables until the sampled index is contained in the subset.Since the alias sampling is so fast, this is usually faster than rebuilding alias or CDF tables for the constrained subsets.
There are some crucial details to improve the scaling, to correctly deal with floating point arithmetic, and to correctly work with zero probability distributions.If we are assuming that the index i is contained in the constrained subset M, it is drawn with a probability = There are certain situations when all weights and all p i are zero; then the application of eq 28 would wrongly yield ∑ j∈M p j = 1, although it should be zero.This has to be distinguished from the case where it is indeed ∑ j∈M p j = 1.For this reason, we have to directly evaluate ∑ j∈M p j , if the right-hand side of eq 28 is exactly one.There is also another reason that necessitates the direct evaluation.Equation 28 is not exactly true for floating numbers.Since the left-hand side appears in the denominator for renormalization (compare algorithm 3), these small differences can "blow up", if the overall expression is small.For this reason, the direct calculation is preferred, if N renorm ≈ 0. Both cases are caught in line If N renorm is nonzero but small, then algorithm 3 will perform many iterations in the while-loop (line 3.14).In this case, it is advised to switch back to CDF sampling (see line 3.10).
It is advantageous to represent M as both a sorted enumeration of the contained indices and as a bit mask, since both have an advantage for certain operations.The bit mask allows a very fast lookup to determine i ∈ M and the complement is just the bit wise NOT.The sorted enumeration can be faster to perform restricted summations over an index.
In Figure 10, we show the performance of constrained sampling via redrawing (algorithm 3) and compare against sampling via rebuilding the CDF or alias tables.The rebuilt alias tables are optimized for initialization, i.e., there is no sorting step in algorithm 2. We restrict the system size to 500 since we assume that calculations in the near future will not exceed 500 spin-orbitals.The subset size, measured as fraction of the number of elements, is varied with the following values {0.25, 0.5, 0.9}.As theoretically expected, all methods scale linearly with the system size, but there is considerable difference between redrawing and rebuilding.Redrawing is near-independent from the subset fraction since the resampling in the while-loop of algorithm 3 is so fast.Only if the subset weight gets minuscule, it is necessary to switch to a rebuilding algorithm (see line 3.10).The rebuilding algorithms on the other hand depend on the subset fraction; the higher the subset fraction, the slower they become.The difference between CDF and alias sampling is negligible for rebuilding Figure 9. Sampling times of the alias method with and without the optional sorting step in the initialization (compare algorithm 2) against the system size n.The dotted lines are purely the sampling times, while the solid lines include the sampling and construction times.and drawing once.If the subset fraction is 0.5, which is the usual case for balanced active spaces with equal number of spatial orbitals and particles, then redrawing is 5 to 10 times faster than rebuilding.
nonuniform (frequently referred to as weighted) visited determinant pairs |D i ⟩ ∼ |D j ⟩.
and the work by Neufeld and Thom 21 assumed fully weighted particle selection (p 2 PCHB (I)| I∈|Dd i ⟩ and p 2 PCHB (J|I)| J∈|Dd i ⟩ ) and fast-weighted hole selection (p 2 PCHB (A|IJ) and p 2 PCHB (B|IJA)) to be the fastest combination.In our earlier work using PCHB, 2,9 we have assumed uniform particle selection (p 2 uni (IJ)| I,J∈|Dd i ⟩ ) and fastweighted hole selection (p 2 PCHB

Figure 1 .
Figure 1.PCHB probability of the first particle p 2 PCHB (I) for a double excitation in a stack of 10 benzene molecules.The orbitals are ordered by fragment, and the vertical dotted lines separate MOs of different benzene molecules.Since α and β electrons have the same probability, assuming RHF-type orbitals, we could use the spatial orbital index.The horizontal dashed line gives the probability of the corresponding unconstrained uniform distribution p 2 uni (I) as reference.The value is obtained as 1/n orb = 1/120 = 8.3 × 10 −3 .

Figure 2 .
Figure 2. PCHB probability of the second particle after having picked an α electron in the first spatial orbital p 2 PCHB (J|1 ↑ ) for a double excitation in a stack of 10 benzene molecules.The orbitals are sorted by fragment, and the vertical dotted lines separate MOs of different benzene molecules.We again use spatial orbital indices.The blue line represents an opposite spin p 2 PCHB (j ↓ |1 ↑ ), while the orange line represents a parallel spin p 2 PCHB (j ↑ |1 ↑ ) for the second electron.The right figure is a logarithmic version of the left one.The horizontal dashed line gives the probability of the corresponding unconstrained uniform distribution p 2 uni (J|1 ↑ ) as the reference.The value is obtained as 1/n orb = 1/120 = 8.3 × 10 −3 .

Figure 3 .
Figure 3. PCHB probability of the first particle p 2 PCHB (I) for a double excitation in the N 2 dimer.The orbitals are sorted by probability.Since α and β electrons have the same probability, assuming RHFtype orbitals, we could use the spatial orbital index.The horizontal dashed line gives the probability of the corresponding unconstrained uniform distribution p 2 uni (I) as reference.The value is obtained as 1/ n orb = 1/220 = 4.5 × 10 −3 .

Figure 4 .
Figure 4. PCHB probability of the second particle after having picked an α electron in the first spatial orbital p 2 PCHB (J|1 ↑ ) for double excitation in the N 2 dimer.We use again spatial orbital indices, which are sorted by joint spatial probability, i.e., p 2 PCHB (j α |1 ↑ ) + p 2 PCHB (j β |1 ↑ ).The blue line represents an opposite spin p 2 PCHB (j ↓ |1 ↑ ), while the orange line represents a parallel spin p 2 PCHB (j ↑ |1 ↑ ) for the second electron.The right figure is a logarithmic version of the left one.The horizontal dashed line gives the probability of the corresponding unconstrained uniform distribution p 2 uni (J| 1 ↑ ) as reference.The value is obtained as 1/n orb = 1/220 = 4.5 × 10 −3 .

Figure 5 .
Figure5.Sampling time for drawing from a constrained subset that contains half of the elements of the total system size using two different algorithms: redrawing method (blue line, algorithm 3) and rebuilding an alias table for the constrained subset (orange line).The right figure is a logarithmic version of the left one.

Figure 6 .
Figure6.probability of the hole in a single excitation after having picked an α electron in the first spatial orbital, p 1 PCHB (a ↑ |1 ↑ ), on a stack of 10 benzene molecules.The blue line shows the exact probability (eq 20) assuming that the starting determinant |D i ⟩ = |2, 2, 2, 0, 0, 0; 2, •••⟩ is given by doubly occupying all π-orbitals and keeping all π*-orbitals empty for each benzene molecule.The orange line shows the approximated PCHB probabilities for single excitations (eq 23).Only unoccupied α spin−orbitals, i.e., allowed holes, are considered.The orbital indices are sorted by decreasing exact probability (blue line).(b) Logarithmic scale of (a).The horizontal dashed line gives the probability of the corresponding unconstrained uniform distribution p 1 uni (a ↑ |1 ↑ )| a↑∉|Dd i ⟩ as reference.The value is obtained from the inverse of the number of empty α-orbitals 1/(n orb ↑

Figure 7 .
Figure 7. Stack of 10 benzene molecules at a distance of 3.0 Å.

Figure 8 .
Figure 8. Example of a GASCI wave function.
particle from the bath orbitals, a uniform particle selection will produce invalid excitations with a probability of at least N N with the number of elements in M. If M contains many elements, it is advantageous to scale with the complement via

. Excitation Probabilities. The
aim of excitation generators in FCIQMC is to suggest a new determinant, |D j ⟩, from |D i ⟩, with a probability proportional to their matrix element.So the optimal excitation generator would sample with the following probabilities two determinants are connected, and |D j ⟩ ← |D i ⟩ for a specific excitation from |D i ⟩ to |D j ⟩.Moreover, the labels |D i ⟩ 1 |D j ⟩ or |D i ⟩ to indicate whether |D j ⟩ is obtained via single or double excitations from |D i ⟩ to |D j ⟩.

Table 1 .
Different Sampling Methods and Their Run-Time Scaling

Table 2 .
Different Sampling Methods for Single Excitations and Their Scaling

Table 3 .
Efficiency of Different Combinations of Sampling Schemes from Tables1 and 2for All Systems a a