Fast algorithm for generating random bit strings and multispin coding for directed percolation

We present efficient algorithms to generate a bit string in which each bit is set with arbitrary probability. By adopting a hybrid algorithm, i.e., a finite-bit density approximation with correction techniques, we achieve 3.8 times faster random bit generation than the simple algorithm for the 32-bit case and 6.8 times faster for the 64-bit case. Employing the developed algorithm, we apply the multispin coding technique to one-dimensional bond-directed percolation. The simulations are accelerated by up to a factor of 14 compared with an optimized scalar implementation. The random bit string generation algorithm proposed here is applicable to general Monte Carlo methods.


I. INTRODUCTION
A computer performs computation in the unit of words, which consist of a fixed number of bits.Multispin coding (MSC) is a technique to reduce the memory space and execution time in Monte Carlo simulations by packing information of spins into a computer word.In modern CPUs, the width of a computer word is 32 or 64, and bit operations between words are usually executed in one cycle.Therefore, a simulation of a system with discrete degrees of freedom can be markedly accelerated by effectively utilizing MSC and bit operations.The MSC technique was first mentioned by Friedberg and Cameron in 1970 [1], but Rebbi and co-workers proposed MSC in the form widely used today [2,3].Since then, the MSC of spin systems, especially for the Ising model, has been extensively studied, mainly in the 1980s [4][5][6][7][8][9][10].A specialpurpose computer for Ising models was developed on the basis of MSC [11].The MSC technique has also been applied to a cellular automaton [12].
In MSC, a spin configuration is updated by a singlespin flip algorithm of the Metropolis type [13].It is known that a single-spin flip algorithm becomes very inefficient in the vicinity of the critical point owing to critical slowing down.To address the problem of critical slowing down, Swendsen and Wang proposed a cluster-flip algorithm [14].While the Swendsen-Wang algorithm was first applied to a system with discrete degrees of freedom, Wolff proposed an algorithm for applying the cluster-flip technique to continuous spin systems by discretizing the degree of freedom [15].Since the cluster-flip algorithms are highly efficient compared with the single-spin flip algorithm, it is common to adopt a cluster algorithm in systems with Ising-like discrete degrees of freedom or systems whose spins degree of freedom can be discretized by Wolff's method.However, there are systems with Isinglike degrees of freedom for which a cluster-flip algorithm is not known.One of such systems is the directed percolation (DP), which is the directed version of isotropic percolation.Consider a lattice consisting sites and bonds.
Each bond is open with some probability and blocked otherwise.In isotropic percolation, a site becomes active when it connected to another active site with an open bond.In DP, however, a site can be activated only along a preferred direction.The universality class of DP is different that of the isotropic percolation [16].The transition from a laminar flow to turbulence is expected to belong to the DP universality class, which has been confirmed in several experiments [17,18].The sites in DP have two states, active and inactive.Therefore, it is natural to consider applying the MSC technique to DP simulation.However, to implement the MSC of DP, it is necessary to generate a bit string in which each bit is set independently with an arbitrary probability as described later.In the present manuscript, we propose a new algorithm for generating such bit strings efficiently.Employing the algorithm, we implement MSC for onedimensional bond-directed percolation and confirm that the calculations are significantly accelerated.
The rest of the article is organized as follows.The random bit string generation algorithm is described in the next section.The application of the algorithm to 1d-BDP with the MSC technique is described in Sec.III.Section IV is devoted to a summary and discussion.The associated code is available at [19].

II. RANDOM BIT STRING GENERATION ALGORITHM
We consider an algorithm to generate a bit string with length N bit such that each bit is 1 with a given probability p and otherwise 0. A simple algorithm is shown in Algorithm 1.Here, U r (a, b) is a stochastic real variable distributed uniformly from a to b.This simple algorithm involves generating random numbers N bit times regardless of the value of p.However, when p is small, the number of bits to be set is also small.Employing this fact, we can reduce the cost of generating bit strings.In the following, we propose two such algorithms.One algorithm determines the number of bits to set first and then determines their positions, which will be described next.The other algorithm expresses the desired bit string as a logical OR of independent bit strings, which will be described in Sec.II B.

Algorithm 1 Simple Algorithm
if Ur(0, 1) < p then 4: end if 6: end for A. Binomial-Shuffle Algorithm In the simple algorithm, it is necessary to generate N bit random numbers regardless of the probability p.Since the number of set bits is about pN bit , we can reduce the number of random numbers generated by first determining the number of set bits, which follows the binomial distribution, and then shuffling their positions.We refer to this algorithm as the binomial-shuffle algorithm.First, we determine the number of set bits m out of N bit with probability p.This is a random number following the binomial distribution of N bit trials with probability of success p.After determining the number of set bits, we choose their positions.Adopting Floyd's sampling algorithm, this selection process involves random number generation m times.The pseudocode of the binomialshuffle algorithm is shown in Algorithm 2.Here B(n, p) is a stochastic integer variable following the binomial distribution with parameters n and p, where n is the number of trials and p is the probability of success.We can implement the function B(n, p) with O(1) complexity by adopting Walker's method of aliases [20,21].U d (a, b) is a stochastic integer variable distributed uniformly from a to b.Since we need one random number to determine the number of set bits and pN bit random numbers to shuffle the positions of the bits on average, this algorithm involves the generation of pN bit + 1 random numbers.
Algorithm 2 Binomial-Shuffle Algorithm s ← s ∨ (1 i) The binomial-shuffle algorithm reduces the cost of generating a bit string by first determining the number of set bits in O(1) complexity and next determining their position in O(pN bit ) complexity.Todo and Suwa proposed an efficient algorithm to generate a bit string in which each bit is set with some probability p k , where k is the index of the bits [22].They employed the space-time interchange technique together with Walker's method of aliases and achieved the generation of such a bit string in O( k p k ) complexity.This algorithm was applied to spin systems with long-range interactions and achieved O(N ) complexity for the cluster Monte Carlo method without introducing any cutoff [23].We need a bit string in which all bits are set independently with identical probability.Since this is a special case of Todo and Suwa's case, we can apply their algorithm to our problem as follows.Consider an N bit -length bit string with one of the bits is set randomly.We generate such bit strings k times and take the logical OR between them.Then each bit of the resulting bit string is 1 with probabil- We choose the number k following the Poisson distribution with parameter λ so that the probability p becomes p on average.Since the probability that the number of events is k in the Poisson process with parameter λ is Therefore, we have λ = −N bit log(1 − p).Then we obtain the following algorithm: 1) determine an integer k following the Poisson distribution with parameter λ = −N bit log(1 − p), 2) generate k bit strings in which one of the N bit bits is set randomly, and 3) take the logical OR between them.We refer to this algorithm as the Poisson-OR algorithm, whose pseudocode is shown in Algorithm 3.Here Poisson(λ) is a stochastic integer variable following the Poisson distribution with parameter λ.We can generate such an integer with O(1) complexity by adopting Walker's algorithm.We need one random number to determine k and −N bit log(1 − p) random numbers to generate the bit strings for disjunction, and therefore, the Poisson-OR algorithm involves the generation of −N bit log(1 − p) + 1 random numbers.While the number of random numbers generated in the Poisson-OR algorithm is always larger than that in the binomial-shuffle algorithm since pN bit ≤ N bit log(1 − p), they are close to each other when p is small.Additionally, the loop body of the Poisson-OR algorithm is simpler than that of the binomial-shuffle algorithm.Therefore the Poisson-OR algorithm can be faster than the binomial-shuffle algorithm.
Algorithm 3 Poisson-OR Algorithm In the previous algorithms, the bit string in which all bits are zero was adopted as the initial value.We can reduce the number of random numbers generated by preparing appropriate initial bit string.Our central idea is to first generate a bit string such that each bit is set with an approximated probability p , and then correct the difference |p −p| by binomial-shuffle or Poisson-OR algorithms.Consider n random bit strings in which each bit is set with probability 1/2.We form a new variable y n by taking the logical OR or logical AND between {x i }.For example, y 1 = x 1 and y 2 = x 2 ∨ x 1 or y 2 = x 2 ∧ x 1 , etc. Let pn be the probability that each bit of y n is set.Then the possible values of pn are as follows.
When pn is expressed in binary notation, the number of digits after the radix point is n.Suppose the binary notation of pn is given by pn where (• • • ) bin denotes the binary notation.Then the bit string y n such that each bit is set with probability pn can be generated by the following procedure: The above procedure involves the generation of n random bit strings and n − 1 bit operations.We adopt the obtained y n as the initial value and correct the difference of |p n − p|.When pn < p, we correct y n as where z is a bit string such that each bit is set with probability p .Since we have When p < pn , we correct y n as Then we have and therefore, In both cases, the probability of the correction p is O(|p− pn |).Therefore, the number of random numbers that must be generated for the correction becomes small when pn is close to p.
As an example, consider the probability p = 0.6447, which is the critical point of 1d-BDP for the case of N bit = 32.If we adopt the binomial-shuffle algorithm without the correction, pN bit + 1 ∼ 21.6 random numbers must be generated on average.The binary notation of p is p = 0.6447 = (0.101001010 First, consider the one-digit approximation p1 = 1/2 = (0.1) bin .The corresponding bit string is y 1 = x 1 .The probability for the correction is p = (0.6447 − 0.5)/(1 − 0.5) = 0.2894.Then the number of random numbers that must be generated to generate a bit string for the correction z is p N bit + 1 ∼ 10.26.Therefore, the total number of random numbers generated is 11.26, which is much smaller than that for the algorithm without correction.Next, consider the three-digit approximation p3 = 5/8 = (0.101) bin .The corresponding bit string y 3 can be obtained as Since p = (0.6447 − 0.625)/(1 − 0.625) ∼ 0.053, the number of random numbers that must be generated to generate z is p N bit + 1 ∼ 2.68.Since we need three random numbers to generate y 3 , the total number of random numbers generated is 5.68.The next possible step of sixdigit approximation will require at least seven random numbers, and therefore, the three-digits approximation is the most efficient.If the number of digits in the approximation increases, the number of random numbers that must be generated for the correction decreases, whereas that required to generate the initial bit string increases.Therefore, there is an optimal number of digits for approximation.The optimal number of digits for the approximation and the expected number of random numbers generated are shown in Fig. 1.This figure shows the case where 0 ≤ p ≤ 0.5.One can generate a bit string for p > 0.5 by first generating a bit string with probability 1 − p and inverting it.The expected number of random numbers generated to generate a 32-bit string in which each bit is set with arbitrary probability p is at most 7.

D. Benchmark Results
We performed benchmark tests on HPE SGI 8600 system with Intel Xeon Gold 6184 CPU at the Institute for the Solid State Physics of the University of Tokyo.The program was compiled using Intel C++ compiler 18.0.1 with the option -O3 -xHOST and executed as a singlethreaded process on a single CPU core.The generation speed of random bits is shown in Fig. 2. Here, we adopt the unit MBPS (millions of bits per second), which is 1 when one million bits are generated in one second.When k bit strings of N bit width are generated in t [s], the generation speed is kN bit • 10 −6 /t [MBPS].We estimate the generation speed by observing the time required to generate 4000000 bit strings in which each bit is set with probability p = 0.6447.The Poisson-OR algorithm with the correction from the three-digit approximation was the fastest.Compared with the simple algorithm, the generation speed was about 3.8 times faster for the 32bit case and 6.8 times for the 64-bit case.Adopting the 64-bit implementation, the performance of the binomialshuffle algorithm is improved by 24% where that of the Poisson-OR algorithm is improved by 38%.The time required to generate a random bit string is expected to be roughly proportional to the number of random numbers generated.When the probability p is close to the n-digit-approximated probability pn , then the number of random numbers generated for correction becomes small as shown in Fig. 1.Consider the region 0.125 < p < 0.1875.When p is close to 0.125, then threedigit-approximation from 1/8 will exhibit the best performance, while four-digit-approximation from 3/16 will exhibit the best performance when p is close to 0.1875.Therefore, it is expected that the performance inversion will occur between the three-and four-digit approximations in the region 0.125 < p < 0.1875.To demonstrate this, the probability dependence of the time required to generate random bit strings is shown in Fig. 3.The performance of the three-and four-digit approximations is reversed at p = 0.1805, which is close to the expected value of p = 0.181.The time required to generate 32-bit random bit strings 10 7 times is denoted by the dashed lines.We adopt the Poisson-OR algorithm with correction.The color corresponds to the number of digits n used in the approximation.One can see that the performance inversion occurs near the theoretically expected probabilities, i.e., the intersection points of Nr.

III. APPLICATION TO DIRECTED PERCOLATION
Employing the fast random bit-string generation algorithms developed in the previous section, we apply MSC to 1d-BDP.The time evolution of 1d-BDP is shown in Fig. 4 (a).Each bond is open with probability p and blocked otherwise.The state of the ith site at time t is denoted by σ t i .The site is active when σ t i = 1 and inactive when σ t i = 0.If a site at time t + 1 is connected to an active site at time t with an open bond, then that site becomes active.The states of the sites at time t + 1 are determined by the states at time t and the time evolution is performed by iterating this process.The scalar implementation to determine the states of the sites at time t + 1 from the states at time t is shown in Algorithm 4. Here, we omit the processing of the boundary conditions.Algorithm 4 Scalar Implementation of 1d-BDP if Ur(0, 1) < p then 4: if Ur(0, 1) < p then 7: end if 10: end for We pack the information of N bit sites in a bit string, where N bit = 32 or 64.If a bit is set, then the corresponding site is active and inactive otherwise.The kth bit string at time t is denoted by s t k .The bit string s t k contains the information of which each bit is set with probability p and take the logical OR between s t k and x 1 as t 1 = s t k ∧ x 1 .Then each bit of s t k survives to t 1 with probability p.Therefore, t 1 can be considered as the active sites at time t + 1 connected to the active sites at time t with the lower left bond.Similarly, we generate t 2 = s t k ∧ x 2 , which denotes the active sites at time t + 1 connected to the active sites at time t with the lower right bond.Then the site configuration at time t + 1 is obtained by taking the logical OR between t 1 and (t 2 1).Note that the most significant bit of t 2 should be copied to the least significant bit of s t+1 k+1 .A schematic illustration of the implementation is shown in Fig. 4 (b) and its pseudocode is shown in Algorithm 5. RBS(p) denotes a random bit string of length N bit in which each bit is set with probability p.The processing of the boundary conditions is omitted.
We perform the benchmark simulations of 1d-BDP using the MSC implementation.The benchmark conditions are the same as those in Sec.II D. The cluster growth from a single seed at the criticality p = 0.6447 is shown in Fig. 5 (a).The system size is L = 32768, the time evolution is performed for 32768 steps and 10 3 independent samples are averaged.Power-law behavior is ob- x1 ← RBS(p) s t+1 k+1 ← (t2 (N bit − 1)) 9: end for served for the number of active sites n(t) ∼ t Θ , where Θ = 0.313 [24].The time required to perform these simulations is shown in Fig. 5 (b).The fastest algorithm was the Poisson+OR algorithm with the correction, whose speed was 14 times that of the scalar algorithm.
The relaxation process from the all-active state at the criticality p = 0.6447 is shown in Fig. 6 (a).The sys-  tem size is L = 32768 with the periodic boundary condition.The time evolution is performed for 32768 steps and 10 independent samples are averaged.Power-law decay t −α is observed for the density of the active sites ρ(t) = n(t)/L, where α = 0.159 [24].Time required to perform these simulations is shown in Fig. 6 (b).While the fastest algorithm is the Poisson+OR algorithm with the correction, in this case, the increase in speed is only 4.5 times, which is much smaller than that in the cluster growth simulation.

IV. SUMMARY AND DISCUSSION
We have developed two efficient algorithms to generate a bit string in which each bit is set with arbitrary probability: the binomial-shuffle and the other is the Poisson-OR algorithms.The binomial-shuffle algorithm first determines the number of set bits and next determines their position.The Poisson-OR algorithm first determines the number of bit strings in which one bit is set randomly and then takes the logical OR between them.While expected number of generated random numbers is smaller in the binomial-shuffle algorithm than in the Poisson-OR algorithm, the Poisson-OR algorithm is faster owing to the simple loop structure.The number of random numbers generated has been reduced by adopting a probability approximation and a correction.By approximating the probability in binary notation with a finite number of digits, we generate an approximated bit string as an initial input.Then the number of the random numbers that must be generated for the correction markedly decreases.
We developed the MSC technique for 1d-BDP using the random bit string generation algorithms and achieved a marked increase in speed.The MSC was more effective in the simulation of cluster growth from a single seed than in that of the relaxation from the fully active state.This is due to the local density of the active sites.The number of random numbers generated for the scalar algorithm is proportional to the density of active sites, while that for MSC is independent of the density.Therefore, the efficiency of MSC decreases as the density of active sites decreases.Since the density of active sites in the relaxation process decreases monotonically, the efficiency of MSC decreases over time.In the case of cluster growth simulation, we only update the region between the leftmost to rightmost active sites.Then the local density of active sites hardly changes and MSC works effectively for this case.
The random bit string generation algorithm is expected to be applicable to general Monte Carlo simulations on lattice systems.In the present work, we did not consider the use of SIMD instructions.SIMD stands for single instruction multiple data and it allows datalevel parallelism with the SIMD register.For example, a 512-bit register is available in the Intel Advanced Vector Extensions (AVX-512).By using SIMD instructions, the efficiency of MSC can be further improved.In recent years, general-purpose computing on graphics processing units (GPGPU) has attracted the interests of many researchers.The MSC of the Ising model was implemented on GPGPU [25], and Komura and Okabe implemented the Swendsen-Wang algorithm on GPGPU [26].The implementation of the MSC algorithm presented in the manuscript on GPGPU should also be attempted in future.

FIG. 1 :
FIG. 1: (Color online) Expected number of random numbers generated Nr for n-digit approximation in (a) the binomialshuffle algorithm and (b) the Poisson-OR algorithm.The cases of N bit = 32 are shown.

8 FIG. 2 :
FIG.2:(Color online) Generation speed of random bits set with probability p = 0.6447 in the unit of MBPS (millions of bits per second) for (a) 32-bit and (b) 64-bit cases.The results for the simple algorithm (Simple), the binomialshuffle algorithm (BS), the binomial-shuffle algorithm with correction from p3 = 5/8 (BS from 5/8), the Poisson-OR algorithm (PO), and the Poisson-OR algorithm with correction (PO from 5/8) are shown.

FIG. 3 :
FIG.3:(Color online) Time required to generate random bit strings together with the expected number of random numbers generated Nr, which is shown in Fig.1 (b).The time required to generate 32-bit random bit strings 10 7 times is denoted by the dashed lines.We adopt the Poisson-OR algorithm with correction.The color corresponds to the number of digits n used in the approximation.One can see that the performance inversion occurs near the theoretically expected probabilities, i.e., the intersection points of Nr.

FIG. 4 :
FIG. 4: (Color online) (a) Time evolution of 1d-BDP.Each bond is open (solid lines) with probability p and blocked (dashed lines) otherwise.If the site on the lower left or lower right of an active site is connected by an open bond, then that site is activated.(b) Implementation of 1d-BDP by bit operations.Although we performed 32-bit or 64-bit implementation, 8-bit implementation is shown for visibility.
FIG.5: (Color online) (a) Cluster growth of 1d-BDP from a single seed at the criticality (log-log axes).The system size is L = 32768.The time evolution is performed for 32768 steps and 10 3 independent samples are averaged.The solid line denotes t Θ , where Θ = 0.313.The results with the scalar code (Scalar) and bit-operation implementation with the binomialshuffle algorithm (BS) and Poisson-OR algorithm (PO) are shown.We adopt the correction from the three-digit approximation p3 = 5/8 for both the BS and PO algorithms.(b) Time required for the simulations.

FIG. 6 :
FIG. 6: (Color online) (a) Decay of the density of 1d-BDP at the criticality (log-log axes).The time evolutions of the densities of the active sites from the fully active state are shown.The system size is L = 32768.The time evolution is performed for 32768 steps and 10 independent samples are averaged.The solid line denotes t −α where α = 0.159.The results with the scalar code (Scalar) and bit-operation implementation with the binomial-shuffle (BS), the Poisson-OR (PO) algorithms are shown.We adopt the correction from the three-digit approximation p3 = 5/8 for both BS and PO algorithms.(b) Time required for the simulations.