Parallelized calculation of permutation tests

Abstract Motivation Permutation tests offer a straightforward framework to assess the significance of differences in sample statistics. A significant advantage of permutation tests are the relatively few assumptions about the distribution of the test statistic are needed, as they rely on the assumption of exchangeability of the group labels. They have great value, as they allow a sensitivity analysis to determine the extent to which the assumed broad sample distribution of the test statistic applies. However, in this situation, permutation tests are rarely applied because the running time of naïve implementations is too slow and grows exponentially with the sample size. Nevertheless, continued development in the 1980s introduced dynamic programming algorithms that compute exact permutation tests in polynomial time. Albeit this significant running time reduction, the exact test has not yet become one of the predominant statistical tests for medium sample size. Here, we propose a computational parallelization of one such dynamic programming-based permutation test, the Green algorithm, which makes the permutation test more attractive. Results Parallelization of the Green algorithm was found possible by non-trivial rearrangement of the structure of the algorithm. A speed-up—by orders of magnitude—is achievable by executing the parallelized algorithm on a GPU. We demonstrate that the execution time essentially becomes a non-issue for sample sizes, even as high as hundreds of samples. This improvement makes our method an attractive alternative to, e.g. the widely used asymptotic Mann-Whitney U-test. Availabilityand implementation In Python 3 code from the GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest under an Apache 2.0 license. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Permutation tests are frequently used for non-parametric testing and are incredibly valuable within computational biology, with applications within genome-wide association studies (Browning, 2008;Dudbridge and Gusnanto, 2008;Purcell et al., 2007), Pathway Analysis (Jeuken and Kä ll, 2018;Subramanian et al., 2005) and expression quantitative trait loci studies (Doerge and Churchill, 1996;Sul et al., 2015). Monte Carlo-based sampling techniques (Segal et al., 2018) and exact tests that derive full permutation distributions are roughly the two ways to implement the permutation test.
Exact tests are traditionally seen as unattractive for large sample sizes, as the number of permutation grows super-exponentially with the sample size. Nonetheless, Green's dynamic programming algorithm (Green, 1977), which was made explicit by others (Pagano and Tritchler, 1983;Zimmermann, 1985), partially overcome this computational problem. This algorithm is significantly less computationally demanding than the naïve approach. However, the exact test's popularity for larger sample sizes has not attracted much attention in the last couple of decades. We report here on an extension using a Graphics processing unit (GPU) implementation to compute parallelized exact tests and found it superior to the other tested alternatives in terms of speed and accuracy.

Algorithm
Here, we will describe our parallelization of the Green method to calculate exact tests. First, in Section 2.1, we describe the main objective, perform hypothesis testing with an exact test, then a description of the Green algorithm in Section 2.2, and, finally, a description of how to parallelize the algorithm in Section 2.2.

Hypothesis testing
Consider a two-sample hypothesis testing setting for central tendency, i.e. we want to know if the considered response in one group B is generally larger than in another group A. Let

5392
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.  (Fay and Proschan, 2010).
A way to perform the test is to determine how extreme the x A i is under the null hypothesis. Typically, one assumes a particular parametric family of distributions and computes the P value of the test as the probability of observing s obs or a more extreme value in the alternative's direction under the assumption that the null-hypothesis is true. However, it is rarely possible to analytically compute this probability, and one often has to resort to asymptotic approximations. A permutation test approach to the testing problem is to assume instead that the labels A and B are exchangeable under H 0 : Under the null hypothesis the distribution of the test statistic would remain the same for any permutation of x. However, this property would not hold under the alternative H A (Huang et al., 2006). One can thus calculate how frequently samples with sample sums greater or equal than s obs appears when resampling from x.
We can formulate the P value as Prðs obs Sjx; H 0 Þ, where PrðSjx; H 0 Þ is the probability mass function, and S is a random variable denoting the sum's value in the first sample under the permutation distribution. The computationally expensive part is to obtain PrðSÞ which is estimated by concatenating . . . ; x nþm Þ, and draw all possible subsets of length m from x and count the number of occurrences of all possible sums; when the numbers of occurrences of all possible sums are available, the distribution is accessible.
Assume a random variable x Ã ¼ ðx Ã 1 ; . . . ; x Ã m Þ, that is a randomly sampled subset from x with j elements and its corresponding sum is x Ã i , where S 2 ½0; s max . Define N½s; m to be the number of ways we can sample subsets x Ã with m elements in such a way that their sum S ¼ s. Now PrðS ¼ sÞ can be expressed as the fraction of ways that a subset x Ã can be sampled so that its sum ends up to as S ¼ s to the number of ways it can be sampled with any sum Combinatorics gives us that the denominator of the above Equation 1 can be expressed as, The calculation of the numerator is intricate. A naïve algorithm that would exhaustively calculate the sum for each possible subset of size m and compare it to s, for all s, would need Alternatively, the same framework can be used to calculate the mid P value (Routledge, 1994) as, p mid ðs obs Þ ¼ 1 2 Prðs obs Þ þ X smax s¼s obs þ1 PrðsÞ: As mentioned above, there is no closed formula to calculate N½s; m. However, it is possible to develop a dynamic programming algorithm to obtain N½s; m (Pagano and Tritchler, 1983;Zimmermann, 1985), described thoroughly in the next section.

Efficient calculation of N [s, m]: the Green Algorithm
A dynamic programming algorithm for calculating N½s; m was first presented in Green (1977) and more in detail described by others (Gebhard and Schmitz, 1998;Zimmermann, 1985). Here, we will give a walk-through of the algorithm, mostly to describe its parallelization in Section 2.3. We also provide a simple use case of the algorithm in Supplementary Note S1.
We can find a recursive expression for N½s; m by considering a scenario where x Ã is drawn instead of the full set x from a subset x i ¼ fx h g i h¼1 consisting of only the first i features of x. To do so; we first need some definitions. Define N i ½s; j as the number of ways we can sample j elements so that their sum becomes s from a subset x i . If we know how to calculate N i ½s; j, we also know how to calculate We also define M i ½s; j to be the number of ways we can sample x i so that the sample elements sum to s and including the last element x i in the sample.
We can now form a recursion of N i ½s; j as the number of ways to sample x i has to be equal to the number of ways to sample x iÀ1 with the number of ways to sample x i that include x i . So, We can express M i in terms of N iÀ1 by noting that M i ½s; j ¼ N iÀ1 ½s À x i ; j À 1 when x i s, and otherwise zero. We can hence express the recursion 4 as, Let's now turn to the boundary conditions of this recursion. The empty set 1 trivially reach the sum s ¼ 0, thus We cannot sample j from i elements if j > i. Also, N i ½s; j has to be zero when either i 0 (i.e. checking the empty set of x), when s < 0, or when s max < s (i.e. when s is outside the boundary of possible sums). Hence, By combining the base case 7 and 6 with the generic subrecursion 5, the final recursion is, The pseudo-code of the top-down dynamic programming code of the recursion in Equation 8 is given in Supplementary Algorithm S1. The algorithm needs some explanation. Instead of using one extensive array N½0 . . . s max ; 0 . . . ðm þ nÞ; 0 . . . m, two smaller arrays, N old ½0 . . .s max ; 0 . . .m and N new ½0 . . .s max ; 0 . . .m, are swapped and rewritten in each iteration of i in an oscillatory fashion-to save memory, see line 24. It is the complete rewriting of N new in the next iteration that makes this possible (any of the conditions in the recursion relation will re-calculate each entry N new ).
We save quite some memory by keeping N new and N old instead of the full three-dimensional array. The former two arrays require Oðs max Á m þ s max Á mÞ ¼ Oð2 Á x max m Á mÞ ¼ Oðx max Á m 2 Þ, whereas the latter array requires Oðs max Á ðm þ nÞ Á mÞ ¼ Oðx max Á m 3 Þ. Moreover, this improvement in memory usage is a quintessential difference for the parallelized algorithm (GPU's memory storage can easily be a bottleneck).
A second point, notice the structure of the for-loops; they could easily have been arranged in whatever way and still obtain correct computations. However, the dimensions of the two arrays have to be adjusted appropriately related to the outer-most loop. Nevertheless, this specific order has a purpose; it is parallelizabledescribed in the next section.
One final note on Supplementary Algorithm S1, by plain observation on the nested loops, it easy to see that the running time is

Parallelization over two dimensions: s and j
The meaning of parallelizing over two dimensions is, in this case, to fix one of the variables and check whether the two other variables are parallelizable given the third fixed variable. In practice, the fixed variable is the outer-most for-loop, and for each iteration of this variable, then within this loop, everything is calculated in parallel over the two other variables.
Out of the three variables, it is only necessary to find one such variable to fix, and instead of exhaustively checking all possibilities, one can check recursion 9 to see which variable is parallelizable. By comparing the left side to the right side, the only variable that is not dependent on contemporary action of itself is variable i, i.e. N i ½s; j ¼ f ðs; s À x j ; i À 1; j; j À 1Þ, and, furthermore, the other two variables are not possible to fix. Below is a verification that N i ½s; j is parallelizable given that i is fixed.
Initiation: In the lines 6-9 in Supplementary Algorithm S1, constants are set, and initialization of both arrays, N new ½0 . . .s max ; 0 . . .m and N old ½0 . . .s max ; 0 . . .m, occur. There is no conflict for the parallelization of this step.
Maintenance: 1 i ðm þ nÞ þ 1 Consider iteration i. In lines 13-16, the array N new is only dependent on constants (i.e. the boundary conditions 8 and 7). Thus, the computation of N new is parallelizable. Furthermore, in lines 17-20, N new is only dependent on elements from N old and x, which both are invariant for i ¼ 1 (except for N old , that switch values at the end of the iteration i, however, all computations for i are already done). Therefore, N new is parallelizable here too. Finally, at line 24, N old is switched with N new , and no parallelization occurs here. Hence, the algorithm is parallelizable for iteration i.
When entering the next iteration of i, i.e. i i þ 1, the same arguments above apply.
Termination: i ¼ ðm þ nÞ þ 2 When arriving at line 10 in Supplementary Algorithm S1 and i ¼ ðm þ nÞ þ 2, it will not enter the loop. By the maintenance of Algorithm 2.3, one can be sure that the computation of Nð0 . . .s max ; ðm þ nÞ; mÞ is correct. Since there is no computation after the for-loop-block, hence, there are no more modifications on Nð0 . . .s max ; ðm þ nÞ; mÞ, and it is safe to return this array.

Discretization of real numbered samples into integer valued samples
Our test is defined for integer valued distributions of the scores, x. However, approximately we may use the procedure if we first discretize any real value distributed scores, y A and y B , where y ¼ ðy A 1 ; . . . ; y A m ; y B 1 ; . . . ; y B n Þ ¼ ðy 1 ; y 2 ; . . . ; y nþm Þ. This was achieved by partitioning the samples range into n w discretization windows, each of length, l w ¼ jmaxðy A ;y B ÞÀminðy A ;y B Þj nwÀ1 . Each of these windows covers I i ¼ ½minðy A ; y B Þ þ ði À 1 2 Þl w ; minðy A ; y B Þ þ ði þ 1 2 Þl w , for i ¼ 0; . . . ; n w À 1, which enables us to map any continues sample into discrete scores, as x ¼ x 0 ʯy 2 I x 0 , with values in the interval x 2 ½0; n w À 1. Unfortunately, this comes to the cost of discretization errors, which will be a function of the number of discretization windows, n w . We will investigate the effects of such discretization in Section 4.

Compared methods
A couple of methods were used as comparison to our implementation. We implemented t tests through the function scipy.stats.ttes-t_ind, and Mann-Whitney U tests through scipy.stats.mannwhitneyu, both functions in scipy version 1.4.1. We downloaded the FastPerm method from the master branch of https://github.com/bdsegal/ fastPerm. When executing FastPerm we applied the default parameter-tuning as described in the implementation guide, available at the method's GitHub repository. We implemented the r-package Coin (https://CRAN.R-project.org/package¼coin) version of the shift-method (exact test) as part of our python package and used it as a benchmark.

System
Performance figures were recorded on a 8 core Intel i7-9700K with an NVIDIA GeForce RTX 2070 graphics card. Some of the experiments we compared this configuration's performance to similar computers equipped with NVIDIA GeForce RTX 2060 and one with a NVIDIA Titan X Pascal graphics card.

Implementation
A python 3.6 implementation implementing the below algorithms was made available under an Apache 2.0 license. We implemented our algorithm together with our discretization strategy as a CUDA (Chakrabarti et al., 2012) enabled Python module. The implementation and all results of this paper are available in reproducible form from a GitHub repository https://github.com/statisticalbiotechnol ogy/parallelPermutationTest.

Results
We implemented the algorithm described above, and set out to characterize the algorithm's performance. To establish that the strategy executes in a practically useful time scale, we first tested the running time performance as a function of the number of discretization windows and its dependence on sample size. Subsequently, we tested the accuracy of our discretization strategy to establish that it is asymptotically unbiased and precise compared to other methods. Finally, we applied the method to a relatively large-scale proteomics dataset to establish the methods' usefulness in a practical test scenario.
For some of the tests, we were able to benchmark our GPU implementation of the Green algorithm (named Green Cuda in the following text) against other methods. Here we implemented a singlethread version of the Green algorithm (Green Singlethread) on the CPU and a multi-thread version of Green algorithm (Green Multithread) on the CPU and downloaded a previously described Monte Carlo-based sampling method called the fast permutation method, FastPerm (Segal et al., 2018), the shift algorithm's implementation in the popular r-package Coin (Hothorn et al., 2006;Hothorn et al., 2008) for exact permutation test (here named Shift Coin), and used Python scipy's implementation of the t test and Mann-Whitney U test.

Test of running time requirements
4.1.1 Running time as a function of sample size, n We first tested the running time requirements of the testes methods as a function of sample size. Here, we selected uniformly distributed samples, y A $ Uð0; nÞ and y B $ Uð0; nÞ, and drew samples of size n ¼ jy A j ¼ jy B j 2 ½50;100; 200; 250;300;350;400;450; 500 ¼ n. For each n 2 n, using five samples replicates. The average time to calculate the corresponding P values was recorded and plotted as a function of n (Fig. 1). For the tested sample sizes, Green Cuda scales well with sample size, and for all sample sizes, Green Cuda was found between 15 and 50 times faster than FastPerm, see Figure 1b at n ¼300 and n¼100, respectively. However, they seem to reach a similar running time around n ¼500.
The Green Singlethread should have a running time that expands as Oðn w n 3 Þ ¼ Oðn 3 Þ as n w was held constant in the experiment. This corresponds well with our observations in Figure 1.
Some Monte Carlo-based approaches are, unlike e.g. FastPerm, leaving the choice of the number of sampled permutations to their users. The number of permutations is inversely proportional to the granularity of the P values estimated by the MC-sampler, and hence governs their possible precision, at least when no other techniques such as importance sampling is used. We hence measure the running time for different amounts of permutation samples with the MCsampling functionality of the package Coin, putting it in the context of the running time of Green Cuda (Fig. 1c). The two methods seem to have similar running times for about 10 5 -10 6 permutation samples, tentatively suggesting that the Green Cuda will be the faster method of the two methods, whenever a precision of the estimated P values is desired to be better than 10 À5 -10 À6 .
Furthermore, to see how the running time depends on the hardware, the test was repeated on other GPUs. However, we found a relatively small difference in performance between the tested graphic cards (see Supplementary Fig. S1).

Running time as a function of the number discretization windows, n w
We also wanted to establish that our implementation scales well with the number of discretization windows used for the test. Again, we sampled from y A $ Uð0; 500Þ and y B $ Uð0; 500Þ with sample size n ¼ m ¼ 500, with five replicates. We plotted the running time as a function of n w 2 n w ¼ ½50; 100; 200; 250; 300; 350; 400; 450; 500 in Figure 2.
The single-threaded implementation of the Green algorithm (Green singlethread) should theoretically have a running time complexity Oðn w n 3 Þ. As sample size n is kept constant in the experiment, we expect the running time to expend as Oðn w Þ. Indeed, Figure 2 confirms this for Green singlethread.

Test of memory allocation
We characterized the memory requirements of Green Cuda by increasing sample size n and m, and a growing number of bins n w . The first experiment (Fig. 3a), varied the set sizes jy A j ¼ jy B j ¼ m ¼ n 2 n ¼ ½10; 20; 30; . . . ; 480; 490; 500, for three different bin sizes n w of 64, 128 and 256. In the second experiment (Fig. 3b), the number of bins was the variable n w 2 n w ¼ ½10; 12; 14; . . . ; 396; 398; 400 for three different set sizes jy A j ¼ jy B j ¼ n ¼ m of 125, 250 and 500. In both experiments, the data was sampled from y A $ N ð0; 1Þ and y B $ N ð0; 1Þ and the number replicates was 1000.
The amount of data that can be handled by a GPU at each point in time is limited by the GPU's memory size. Here we used an NVIDIA GeForce RTX 2070, which allows for 7982 MiB memory allocation. One could imagine settings where s max is so large that one cannot calculate N even for one data point. However, we would Fig. 1. Running time requirements of the compared algorithms. We plotted the time required to calculate P values for samples from y A $ Uð0; nÞ and y B $ Uð0; nÞ for different sample sizes n. The mean time and the 95% confidence interval around the mean time to calculate each of 5 replicate samplings was plotted in (a) linear and (b) log scale. It should be noted that the execution times where highly reproducible, and it might be hard to see the confidence interval in the plots. (c) We further investigated the running time as a function of the number of Monte Carlo samples for a MC-based approach. We compared running time for a Monte Carlo sampled to draw samples of jy A i j ¼ jy B i j ¼ m ¼ n 2 ½200; 300; 400, with five replicates. The horizontal lines represent the running time for Green Cuda to compute the same samples Fig. 2. Running time as a function of the number of discretization windows. We plotted the mean of the mean and standard deviation of the required running time, as wall time, as a function of the number of discretization windows, n w . Time was plotted (a) in normal scale, and (b) log-scale. Note that the similarity in execution speed makes it hard to separate the series for Green Multithread and Coin Shift. Also note that the other methods were excluded from the plot, as the discretization step is exclusively present in the Green algorithm instead run into floating-point problems before reaching this type of memory problem for such cases. For large set-sizes, 1000 < jy A j þ jy B j, the sums in the entries in N start go beyond the maximum double-point precision in CUDA i.e. %1:79e 308 . A future improvement of our algorithm could be to reduce the values in each iteration by dividing all elements in N i ½s; j by a normalizing factor in each iteration over i, which would help us reduce the problems of N becoming too large.

Test of accuracy and precision
4.3.1 Test dependency of window size While the Green Algorithm is a non-parametric test for discrete test statistics, the algorithm's performance will be a function of the number of windows, n w , we use for discretizing any continuous data. We hence wanted to characterize the influence of n w on the accuracy of our test. We selected samples from a Normal distribution and compared the computed P values with the ones from a regular t test ( Supplementary Fig. S2). The results suggest that both the accuracy and precision of the test improves when increasing the number of windows. However, the effect seems to saturate for n w > 30.

Comparison to other method's accuracy and precision against comparative methods
We subsequently wanted to test the accuracy of the estimated P values. Again we used Normal distributed samples and used t testcalculated P values as reference. As benchmark comparisons, we again used the FastPerm method and Python's asymptotic Normal distribution implementation of the Mann-Whitney U test, scipy.stats.mannwhitneyu. We plotted the ratio, pÃ pt , where p Ã is the tested P value and p t is given by a t test, as a function of sample size for two different effect sizes (Fig. 4). Overall, the pÃ pt of the Green Cuda were found closer to 1, and less dependent on the sample size than the ones from the compared methods. The reason for the Mann-Whitney U test deviating from the results of the t test, particularly in Figure 4b, is that the test has comparatively low efficiency when testing on normal distributed data. It is also important to note that the Mann-Whitney U test, which depends on ranking statistics, would not be under the same null model as the compared methods in the presence of ties. However, we expect ties to be rare when sampling from continuous distributions.

Calibration test
We also tested the parallelized shift method's P values uniformity under the null hypothesis (Murdoch et al., 2008). This property is of particular importance for studies where we test many variables for the same sample, as it is the base for efficient multiple hypothesis testing (Efron, 2012;Storey and Tibshirani, 2003). Here, we compared Green Cuda against the FastPerm method, Mann-Whitney U test and a regular t test. We picked 10000 samples from the Normal distribution and a log-Normal distribution and plotted each method has estimated P values as a function of their relative rank ( Supplementary Fig. S3). We found that the calibration of the parallel Green method was on par with the test. However, unsurprisingly the calibration seems to be entirely off for the t tests on log-Normal distributed data. We see that the Mann-Whitney U test's calibration appears conservative, while the FastPerm method appears anticonservative for both tested distributions.

Running time requirements for a proteomics dataset
As the last test, we tested the algorithm's performance on a dataset of breast cancer samples from the CPTAC consortium (NCI CPTAC, 2016). For the 8051 proteins for which measurements had been obtained for all samples, we tested differential abundance between 80 non-triple-negative and 26 triple-negative samples. The run-time for Green Cuda method can be found in Table 1. For the other methods: FastPerm took 45 min, 1.5 s for a Mann-Whitney U test and 1.32 s for a t test.   3. Memory allocation as a function of set size and bin size for Green Cuda. We plotted the required memory allocation to calculate P values for samples from y A $ N ð0; 1Þ and y B $ N ð0; 1Þ for different sample sizes n and bin sizes nw. In (a) the n w used were 64, 128 and 512 with 5 replicates, and in (b) n were 125, 250 and 500 with 5 replicates

Discussion
Statistical testing is the base for most scientific activities. Also, in most research areas, the amount of public data is rapidly increasing, and hence there is a need for ever more efficient methods to compute significance. Permutation tests offer an exciting method as they do not assume a particular sampling distribution, but instead, build one by permuting label associations to the observed data. This approach corresponds perfectly with the null hypothesis that there is no difference in case and control outcomes. Here, we have described a parallelized dynamic programming method to perform permutation tests. We have demonstrated that it is faster and more accurate than the sampling-based methods. Previous work by Pagano and Trichler (1983) demonstrates that one can quickly expand exact tests to handle missing values, something that rank-based not easily can handle. We note that several studies are dependent on normal approximations of non-parametric tests such as the Mann-Whitney U test. In practice, the implementation of such tests is approximations as they are asymptotic and not exact. The Green Cuda method offers an exact test that does not appear much slower but more accurate than such tests. However, admittedly, the difference between the outcomes of asymptotic and permutation tests gets smaller with an increased sample size.
Permutation tests have been successfully used in many, if not most areas of bioinformatics as a relatively assumption-free method for assessing statistical significance in inferences. In most applications the procedures involve some flavor of Monte Carlo-based sampling methodology. Here we demonstrated that for at least in the case of statistical hypothesis testing on can instead rely on calculating a full distribution of the sampling space.