Off-lattice and parallel implementations of the pivot algorithm

The pivot algorithm is the most efficient known method for sampling polymer configurations for self-avoiding walks and related models. Here we introduce two recent improvements to an efficient binary tree implementation of the pivot algorithm: an extension to an off-lattice model, and a parallel implementation.


Introduction
Self-avoiding walks are non-intersecting paths on lattices such as the two-dimensional square lattice or the three-dimensional simple cubic lattice.Due to universality, they exactly capture the essential physics of the excluded-volume effect for polymers in the good-solvent limit, and as such can be used to study features such as the value of the Flory exponent ν which relates the geometric size of a walk to the number of monomers in the chain.
The pivot algorithm is the most efficient known method for sampling self-avoiding walks of fixed length.It is a Markov chain Monte Carlo method, which was invented by Lal [1], but first studied in depth by Madras and Sokal [2], who also invented an efficient hash table implementation.Recent improvements to the implementation of the pivot algorithm [3][4][5] have dramatically improved computational efficiency to the point where it is possible to rapidly sample polymer configurations with up to 1 billion monomers [6].
In this paper, we will describe two recent improvements in algorithms to sample selfavoiding walks, focusing in particular on the pivot algorithm.In Sec. 2 we describe an offlattice implementation of the SAW-tree data structure [5].In Sec. 3 we describe a parallel implementation of the pivot algorithm which improves the sampling rate for very long walks.Finally, we have a brief discussion about prospects for further progress and conclude in Sec. 4.

Off-lattice implementation
The SAW-tree data structure [5] is a binary tree that encodes information about the self-avoiding in an efficient way in nodes in the tree.In particular, the leaves of the tree consist of individual monomers, while the internal nodes store aggregate information about all of the monomers that are below that node within the tree, as well as "symmetry" information which encodes transformations that must be applied to sub-walks before they are concatenated together.The aggregate information that must be stored includes information about the extent of the subwalk in the form of a "bounding volume", which is taken to be a rectangle for square-lattice walks, and a rectangular prism for simple-cubic-lattice walks.For lattice self-avoiding walks, the symmetry elements are rotations and reflections that leave the lattice invariant.See [5] for a full description of the implementation.
Although lattice self-avoiding walks capture the universal behaviour of polymers in the goodsolvent limit, there are strong arguments for why off-lattice models of polymers may have advantages under certain circumstances.Firstly, they provide an opportunity to empirically model more realistic interactions for polymers, and thus to reproduce not only universal features but also make precise experimental predictions.Secondly, under some circumstances it may be the case that the effect of the lattice may have a non-negligible effect, for example when trying to understand the nature of the globule transition it may be the case that the restriction to the lattice significantly influences the nature of the transition.Finally, while lattices have discrete symmetry groups, the symmetry group corresponding to reflections and rotations of R d is the continuous orthogonal group O(d).This continuous group allows for more freedom for performing pivot moves, and it is conceivable that this additional freedom may enhance sampling efficiency under some circumstances.
We implement the SAW-tree for the bead-necklace, or tangent-hard-sphere, model, which consists of a fully flexible chain of hard spheres that just touch.A typical configuration for this model in R 2 is shown in Fig. 1.We will now describe the key features of our implementation, and will present evidence that the off-lattice SAW-tree implementation of the pivot algorithm has O(log N ) performance in line with the performance of the original lattice SAW-tree implementation.The description will not be self-contained, and the interested reader is referred to [5] for relevant details.
The orthogonal group O(2) is used as the symmetry group for R 2 , and similarly O( 3) is used for R 3 .The orthogonal group includes rotations as the subgroups SO(2) and SO(3) respectively, but also includes reflection moves.
Symmetry group elements are sampled uniformly at random so as to preserve the Haar measure [7] on the group.This automatically ensures that the Markov chain satisfies the detailed-balance condition, and so must be sampling configurations with uniform weights.
As for ergodicity, we feel that it is extremely likely that the algorithm is ergodic.For lattice models the pivot algorithm has been proved to be ergodic; this was first done for Z 2 and Z 3 in the seminal paper of Madras and Sokal [2].Interestingly, inclusion of reflections seem to be necessary for ergodicity for lattice models.In the continuum, it is our view that the additional freedom afforded as compared to the lattice should mean that pivot algorithm is ergodic in this case, too.We do not have sufficient insight into the problem to know whether the extra freedom would allow one to have an ergodic algorithm with only rotations (and not reflections).Some theoretical work has been done previously on the ergodicity of pivot moves for continuous models [8], but this is not directly relevant here as the proof relied on double-pivot moves.
The key decision for the SAW-tree implementation for the bead-necklace model is the choice of bounding volume to be used.The bounding volume is a shape which is stored in nodes in the SAW-tree, such that it is guaranteed that the entire sub-chain which is represented by the node is completely contained within the bounding volume.The use of a bounding volume is necessary for the rapid detection of self-intersections when a pivot move is attempted.
The natural choice of the bounding volume for Z 2 is the rectangle, and for Z 3 the natural choice is the rectangular prism.This is because these shapes snugly fit the sub-chains that they contain (in the sense that the sub-chains must touch each boundary or face of the shape), and the shapes are preserved under lattice symmetry operations.
The natural shape for the bounding volume for the bead-necklace model for R 2 would seem to be the circle, and similarly for R 3 the natural choice would be the sphere.This is because these are the only shapes that are invariant under the action of O(2) and O(3) respectively.
One of the operations that must be performed with bounding volumes [5] is the merge operation, which involves combining two bounding volumes (which contain sub-chains) to create a bounding volume that contains both of the original bounding volumes (and hence contains both sub-chains).In contrast to the situation for lattice models, the bounding volumes which result from the merge operation do not necessarily form a snug fit for the polymer sub-chains.This is illustrated in Fig. 2 for an example in R 2 where the snugly fitting bounding circles for two sub-chains are merged together so that they contain the concatenated walk.The concatenated walk does not touch the boundary of the larger circle.A priori, we had no expectation about whether the lack of snug fit for the bounding volumes would prove to be a significant problem.We considered it possible that the error from the fit would grow rapidly as one moved up the SAW-tree, and this would have worsened the performance of the intersection testing algorithm.But, we found that in fact this was not a problem at all.We estimated the mean ratio of the diameter of the bounding volume to the square root of the mean value of the squared end-to-end distance R 2 E 1/2 .We found that as the length of the chains increased the ratio was approaching a constant for both R 2 and R 3 , indicating that the error was becoming saturated.For chain lengths of N = 10 6 this ratio was only 1.45 for R 2 , and 1.71 for R 3 .Thus, in the average case this suggests that the lack of a snug fit only results in a constant factor error in the diameter of the bounding volume for the off-lattice implementation.This means that the behaviour of the lattice and off-lattice implementations should be essentially the same, up to a constant factor.
We evaluated the mean CPU time per pivot move for a range of polymer lengths, for lattice and off-lattice SAW-tree implementations in two and three dimensions on Dell PowerEdge FC630 machines with Intel Xeon E5-2680 CPUs, and plot the results of these computer experiments in Figs 3 and 4.
We found that the time per pivot move attempt was somewhat worse for the off-lattice implementation as compared to the lattice implementation, which was to be expected due to the increased number of operations required for computations involving the symmetry elements and coordinate vectors.But, in absolute terms the performance is still impressive, and for polymers with 10 7 monomers pivot attempts are performed in mean CPU time of less than 6µs for R 2 , and in less than 40µs for R 3 .We clearly observe O(log N ) behaviour in each case, which is strong evidence that the off-lattice implementation behaves in fundamentally the same way as the original lattice implementation of the SAW-tree.

Parallel implementation of the pivot algorithm
The SAW-tree implementation of the pivot algorithm [5] is remarkably efficient, but it suffers from one significant drawback: the intersection testing and SAW-tree update procedures are inherently serial operations.This makes it difficult to take advantage of additional cores to improve the rate at which polymer configurations are sampled.To some extent this issue is obviated by the fact that for number of monomers N up to the order of tens of millions or even 100 million it is possible to run simulations in parallel on multicore machines, and still obtain results in a reasonable clock time.But, in the regime where a large amount of memory is needed for truly large N , of the order of 10 8 − 10 9 , on the Dell PowerEdge FC630 machines with Intel Xeon E5-2680 CPUs on which  computer experiments are being run this prevents all cores being simultaneously used due to memory constraints1 .Under these circumstances most cores must be left idle while data is being collected.
Here we will briefly sketch a method to improve the sampling rate by utilising additional cores in exactly this difficult regime.
The key insight is that as the number of monomers increases, the probability of a pivot move being successful decays as a power law of the form N −p , with p ≈ 0.19 for Z 2 , and p ≈ 0.11 for Z 3 .For N = 10 9 on Z 2 , the probability of a pivot move being successful is 0.019, which means that on average roughly 50 unsuccessful pivot attempts are made for each success.
Given that most proposed pivot moves in this regime fail, and so do not result in any update being made for the self-avoiding walk, it is possible to perform many pivot attempts in parallel without this effort being wasted.
For example, imagine that we are sampling SAWs of 10 9 steps via the pivot algorithm, and we may test for success or failure of up to ten pivot moves simultaneously.Note that a move consists of a proposed monomer location to act as the centre of the pivot move, and a proposed symmetry operation.Suppose for the first batch of ten proposed moves {M 1 , M 2 , • • • , M 10 }, that each of these moves were unsuccessful.Then, we can move on to another batch, and none of the work performed by any of the threads was wasted.Suppose for the second batch {M 11 , M 12 , • • • , M 20 } that the first 6 moves M 11 , • • • , M 16 are unsuccessful, but M 17 is successful.Then we need to perform the update associated with the move M 17 which must happen as a serial operation performed by a single thread.It does not matter whether M 18 , M 19 , M 20 were successful or not: these tests will need to be performed again in case the update has altered the result of the test.The next batch will then consist of ten proposed moves {M 18 , M 19 , • • • , M 27 }.
The tests for success or failure will occur for each thread regardless of the outcome of the tests performed by other threads.But, provided the probability of multiple successful moves occurring in a batch is low, then most of this work will not be wasted.The lower the probability of success, the greater the potential for speed up to occur by exploiting parallelism.
We have implemented this idea in a prototype C program with OpenMP being used for managing the parallel pivot attempts.The SAW-tree is held in shared memory where all threads can access it for performing intersection tests.When a pivot move is found to be successful, then the update is performed by a single thread while all other threads remain idle.
We performed computer experiments to test this implementation on the aforementioned FC630 machines for SAWs of various lengths on the square lattice.We utilised 24 threads, with batches (or chunks) of 48 pivot attempts which meant that each thread made two attempted pivot moves on average.We collated the calendar time per pivot attempt in µs in Table 1.The value t 1 is the mean CPU time for a single thread, while t 24 is the mean CPU time for the 24 threads running in parallel.We see that as N increases the probability of a move being successful decreases, and the relative performance of the parallel implementation to the serial implementation improves.For N = 10 9 there is roughly a four-fold improvement in performance.
Although it is suitable as a proof-of-concept, the implementation developed thus far is only a prototype, and more work remains to be done to improve its performance.In particular, it should be possible to re-use some information from intersection tests even if these moves are scheduled to occur after a move that is found to be successful.For example, if a move is found to cause a self-intersection between monomers labelled l and m along the chain, then if the prior succesful move involved a pivot site outside of the interval l to m then this would not have any effect on the self-intersection.Nonetheless, even in its current state the performance gain is sufficient to make it worthwhile for use in the large N , memory-limited regime.

Discussion and conclusion
Schnabel and Janke [9] have very recently implemented a binary tree data structure which is similar to the SAW-tree for the bead-necklace model, as well as a model for which the Lennard-Jones interaction is implemented.The implementation for the bead-necklace model appears to have roughly the same computational efficiency as the implementation sketched here.The efficient implementation for the Lennard-Jones polymer model is very interesting, and a significant advance on the state of the art.It will be interesting to see if further progress in this direction can be made, for example in the evaluation of Coulomb interactions which would be necessary for efficient simulation of polyelectrolytes.Full details for the off-lattice SAW-tree implementation of the pivot algorithm will be presented elsewhere in future.
More work needs to be done to test and improve the implementation of the parallel version of the pivot algorithm.In future, the parallel implementation of the pivot algorithm will allow for improved simulations of very long SAWs on the square lattice.The method will result in significant speed-ups for SAWs with hundreds of millions or even one billion steps, especially for the square lattice.

Figure 2 .
Figure 2. Merging bounding circles for two sub-chains to form the bounding circle for the concatenated chain in R 2 .

Figure 3 .
Figure 3. CPU time per pivot move attempt for the bead-necklace model in R 2 , in comparison to SAWs in Z 2 , plotted against the number of monomers N .

Figure 4 .
Figure 4. CPU time per pivot move attempt for the bead-necklace model in R 3 , in comparison to SAWs in Z 3 , plotted against the number of monomers N .

Table 1 .
Performance of the prototype parallel implementation of the pivot algorithm for Z 2 .