Computing Permanents for Boson Sampling on Tianhe-2 Supercomputer

Boson sampling, a specific quantum computation problem, is widely regarded to be one of the most achievable fields in which quantum machine will outperform the most powerful classical computer in the near term, although up to now no upper-bound of how fast the classical computers can compute matrix permanents, core problem of Boson sampling, has been reported. Here we test the computing of the matrix permanent on Tianhe-2, a supercomputer retaining its position as the world's No. 1 system for six times since June 2013. We arrived at the time (about 77.41~112.44 minutes) to compute the permanent of a $50\times 50$ matrix in an acceptable precision. In addition, we have found that Ryser's algorithm will produce an unacceptable error with the increase of problem scale, compared to Balasubramanian-Bax/Franklin-Glynn's algorithm in the same complexity. The precision issue suggests carefully check in future research of Boson sampling, and comprehensive comparison between quantum computer and classical computer.


I. INTRODUCTION
Universal quantum computers promise to substantially outperform classical computers [1, 2], however, they have been experiencing challenges to build in practice, due to stringent requirement of high-fidelity quantum gate and scalability. For example, Shor's algorithm [3], solving integer factorization problem, is one of the most attractive quantum algorithms because of its potential to crack current mainstream RSA cryptosystem. An universal quantum computer requires 1025 qubits to factor a 1024-bit number (∼ 10 307 ) [4, 5], a key size thought to be crackable on classical computers [6]. However, the largest number factored using Shor's algorithm is 21 untill now [4,[7][8][9][10]. This motivates the research of specific quantum computation with quantum speedup and more favorable experimental conditions. Boson sampling [11] is a specific quantum computation thought to be an outstanding candidate for beating the most powerful classical computer in the near term. It samples the distribution of Bosons coming from a complex interference network. Unlike universal quantum computation, quantum Boson sampling seems to be more straightforward, since it only requires identical Bosons, linear evolution and measurement. As for classical computers, the distribution can be obtained by computing permanents of matrices derived from the uni- * Electronic address: junjiewu@nudt.edu.cn † Electronic address: xianmin.jin@sjtu.edu.cn ‡ Electronic address: xjyang@nudt.edu.cn tary transformation matrix of the network [12]. However, computing the permanent has been proved as a #P-hard task with classical computers. The best-known classical algorithm scales exponentially with the size of a matrix. These motivate a lot of advances in building larger quantum Boson sampling machine to outperform classical computers, including principle-of-proof experiments [13][14][15][16], simplified models easy to implement [17][18][19], implementation techniques [20][21][22][23], robustness of Boson sampling [24][25][26][27][28], validation of a large scale Boson sampling [29][30][31], varied models for other quantum states [32-34], etc.
However, what is the upper-bound of computing the permanent on classical computer? Solving Boson sampling problem with N Bosons on classical computers depends on computing permanents of N × N matrices. Therefore, the upper-bound is strongly related to the condition on which a quantum Boson sampling machine will surpass classical computers.
Here we present the test results of computing permanents on Tianhe-2 [35] (FIG. 1), the supercomputer which has retained its position as the world's No.1 in Top500 supercomputer lists since June 2013. The performance of Tianhe-2 is comparable with that of several millions of typical personal computers working simultaneously. We developed, optimized and tested parallel programs of two best-known algorithms, Ryser's algorithm and Balasubramanian-Bax/Franklin-Glynn's (BB/FG's) algorithm. In this paper, we report the speed performance of Ryser's algorithm and BB/FG's algorithm in real tests on Tianhe-2 supercomputer. We then show the predicted performance of computing the permanent of a 50 × 50 matrix on the full system of Tianhe-2 supercom-

A B
FIG. 1: A schematic view of computational task on permanent with Tianhe-2 supercomputer (A) and quantum Boson sampling machine (B). The computational power on calculating permanent of matrix A can be directly tested with different algorithms on Tianhe-2. The output probabilities P(S-T) can be obtained in quantum Boson sampling machine given a input state S, unitary operation matrix U and output state T, which is equivalent to compute the permanent of matrix U. The capacity of computing permanent on Tianhe-2 is employed to benchmark the state of arts and set an upper bound to be beaten by quantum Boson sampling machine.
puter. In addition, we show that Ryser's algorithm has an unacceptable accumulating rounding error from the limited word length of classical computers, compared to BB/FG's algorithm. This implies the result of Ryser's algorithm for large-size matrices may be untrustable.

II. RESULTS
Ryser's algorithm is shown in Equation (1) and BB/FG's algorithm in Equation (2). In the equations, P erm(A) denotes the permanent of the matrix A. a ij is the element of A in the ith row and jth column. N is the matrix size. δ = {δ 1 , δ 2 , ..., δ N }, where δ 1 = 1 and δ i ∈ {−1, 1} for 2 ≤ i ≤ N .
All programs we implement in this paper are for complex matrices. We can use them to test real matrices. The execution time is irrelevant to the matrix elements.

A. Speed performance
To test the upper-bound speed of computing permanents classically, we tried exploiting Tianhe-2's computing resources as many as possible (See Section IV and Appendix A). Tianhe-2 has 16,000 compute nodes, and each comprises two Intel Ivy Bridge Xeon E5 processors (CPUs) and three Xeon Phi coprocessors (denoted as MICs, because they adopt Intel Many Integrated Core architecture). Ryser's algorithm was tested with only CPUs, while BB/FG's algrithm was tested under two types of configurations: with only CPUs or with CPUs and MICs heterogeneously. We use real tests on smallscale subsystems (with ≤ 256 compute nodes) to evaluate and optimize our programs. Note that even a small-scale subsystem with 256 nodes still has a predicted LINPACK performance of 541.8 TeraFlops and a theoretical peak performance of 848.4 TeraFlops that may be ranked in the top 160 in Top500's list in Nov. 2015.
The key of gaining speedup from a parallel computer system is to implement the scalable parallel program, which can get more and more performance improvement with the increase of the computing resources. To evaluate the scalability of our programs, we tested them with different combinations of problem scale and parallel scale. According to the complexity (O(2 N N 2 ), N is the matrix size), the number of operations almost grows double with the increase of N . Execution time grows by 1.95 times in FIG. 2(A), meaning that the additional cost of larger matrix is very small with the increase of problem scale. Meanwhile, execution time decreases with nearly linear speedup (slopes of 0.54 and 0.59 in FIG. 2(B)) when the parallel scale increases. Ideal slope is 0.5, meaning no additional cost. These suggest our programs have execellent scalability with only small and fixed cost from their serial parts.
The good scalability of our programs makes large-scale tests possible. We tested Ryser algorithms with the number of nodes ranging from 2,048 to 13,000. We found it became difficult for the large-scale system to complete execution of programs with long run time, because the system reliability became worse with the increase of processing units [36]. We also found some slow nodes prolonged the total execution time of programs sometimes. Take three tests on 46 × 46 matrices in  Execution with varying matrix sizes and fixed system scale (a 256node subsystem of Tianhe-2). "Ryser@CPU" denotes results of Ryser's algorithm running with only CPUs (MICs are not used); "BB/FG@CPU" denotes results of BB/FG's algorithm running with only CPUs; "BB/FG@Hybrid" denotes results of BB/FG's algorithm running with both CPUs and MICs. The slope of the fit-line for BB/FG@Hybrid means that execution times grow by 10 0.29 = 1.95 times with the increase of matrix sizes. (B) Execution with varying system scale and fixed matrix size (a 36 × 36 matrix). Execution times of Ryser@CPU (similar with BB/FG@CPU) decrease by 10 −0.27 = 0.54 times when number of compute nodes double, and those of BB/FG@Hybrid decrease by 10 −0.23 = 0.59 times.
ample. When the number of compute nodes grows from 4,096 to 8,192, execution time decreases to nearly the half. However, when the number of nodes grows from 8,192 to 13,000, execution time decreases a little, which exposes the existence of slow nodes. The 8,192 compute nodes has accounted for more than half nodes of Tianhe-2, and the 13,000 nodes 81.25%. These results show the

B. Precision Performance
The precision issue comes from accumulated rounding errors introduced by limited word length of classical computers. We implemented Ryser's algorithm and BB/FG's algorithm with double-precision floating-point format (64 bits) of IEEE 754 standard. The total precision of a double-precision floating-point number is 53 bits, approximately 16 decimal digits (53 log 10 2 ≈ 15.96), because the fraction significand has 52 bits with an implicit bit in most cases. If no intermediate results in fitting is by the Curve Fitting Tool from MATLAB. The red points are the real tested data on the 256-node subsystem of Tianhe-2, and the black point is the predicted time used for the full system of Tianhe-2 (16,000 compute nodes: 32,000 CPUs and 48,000 MICs) to compute the permanent of a 50 × 50 matrix. The goodness of fitting is described by the following parameters: SSE = 284.2 (Sum of Squares for Error), which measures the deviation between the curve and the tested data; R square = 0.994, which indicates it to be a good fit when its value is close to 1, so is Adjusted R square, whose value is 0.9938 in our fitting, and RM SE = 2.665, which is similar to SSE.
the computational process are much larger than the final result, the accummulated errors will be marginal. Unfortunately, Ryser's algorithm may produce numbers, extremely larger than the final result. Thus, the errors of Ryser's algorithm may not be omitted sometimes. However, there is a division operation in the last step of BB/FG's algorithm as shown in Equation (2), which narrows the total error effciently. Therefore, the errors of BB/FG's algorithm is small. That means, BB/FG's algorithm is more trustable than Ryser's in terms of precision. The accumulated rounding errors vary with different matrices. Here, we use a special kind of matrices to reveal the issue. Because the modulus of any element (a complex number) in a matrix for Boson sampling problem does not exceed 1, calculations of all-one matrices' (each element is 1) permanents in a special order produce the largest number possible (detailed in Appendix B 2). Namely, the errors for all-one matrix calculations give the upper bound of errors. Meanwhile, the permanent of an N × N all-one matrix is the factorial of N in theory, allowing us to verificate easily.
To evaluate errors of Ryser's algorithm and BB/FG's algorithm, we computed absolute errors and relative error rates of permanents of all-one matrices with different matrix sizes. We found the relative error rates of Ryser's algorithm was beyond 100% with ≥ 32 matrix sizes except small vibrations for 33 × 33 and 34 × 34 matrices as shown in FIG. 4(B). In FIG. 4, both errors of Ryser's algorithm and BB/FG's algorithm grow exponetially with the increase of matrix sizes according to the fitting lines, since the vertical axes used logarithmic scale. In FIG. 4(A), the absolute errors of Ryser's algorithm are ×10 7 ∼ ×10 9 larger than that of BB/FG's algorithm, leading to the unacceptable relative error rates of Ryser's algorithm and the accurate one of BB/FG's algorithm (with the final division in Equation (2)). In FIG. 4(B), the fit-line predicts the relative error rate of BB/FG's algorithm for a 50×50 all-one matrix was about 0.0006%, and error rates does not beyond 10% when the dimensions are below 60 × 60. This shows that the precision issue overburdens Ryser's algorithm with large matrix size, and the smaller intermediate result and the last division saves BB/FG's algorithm. Our results suggest future research should use BB/FG's algorithm as a classical rival of quantum Boson sampling instead of Ryser's algorithm most considered before.
In practice, matrices are random leading to hard evaluation of their theoretical permanents. Fortunately, we can use a double check with both Ryser's algorithm and BB/FG's algorithm to mutually verify computation results. To examine the precision issue in more realistic situations, we generated three sets of random matrices: matrices corresponding to those each input (output) port of an interference network injects (ejects) one and only one Boson, matrices randomly derived from a 100 × 100 matrix, and matrices specially chosen to reveal errors more. In FIG. 5, we found the errors of random unitary matrices and derived matrices randomly chosen were marginal, but that of derived matrix specially chosen were still very large. The growing speed of errors of derived matrices specially chosen is as large as that of all-one matrices. This suggests a double check with both two algorithms is essential for future research to verify results produced by classical computers. Besides, the precision issue implicates quantum computation may outperforms classical computation not just in speed, but also in precision sometimes.

III. DISCUSSION
Tests of computing permanents on the most powerful classical computer is important for understanding the condition on which a quantum Boson sampling machine can surpass classical computation. Here we have shown tests both in speed and in precision on Tianhe-2 supercomputer, the fastest computer in the world now. Our tests show the upper-bound of computing the permanent using BB/FG's algorithm on Tianhe-2 supercomputer, and the precision issue of Ryser's algorithm, little known before. Limitted by testing conditions, the heterogeneous tests of the full system were not conducted. Our results still provide an important reference for experimental Boson sampling, although permanent calculation is not the complete simulation of Boson sampling. Because matrix permanents may approach 0, it is hard for us to give an upper-bound of relative error rates. We only gave the upper bound of absolute error above. Overall, our results provide a reference from which we can know how large quantum Boson sampling machine can surpass current classical computational capacity and implies quantum computation outperforms classical computation not just in speed, but also in precision sometimes.

IV. METHODS
We developed the multi-level parallelism of the program based on the architecture of Tianhe-2 system. We parallelized the program at the level of the outer summation ( S⊆{1,2,...,N } of Ryser's algorithm or δ of BB/FG's algorithm). The workload is distributed to the massive processes using MPI, and further distributed to the multiple threads within each process through OpenMP parallel library. We also implemented the heterogenous parallelism of the program: within one node, the workload was first divided into two parts according to a prefixed ratio, and distributed one to the process on MICs running in offload mode with the rest computed by the threads on CPUs. We optimized the programs through various means. The workload was balanced statically, and the part of program that running on MIC was rewrit- where T represents the execution time, N is the size of the matrix and n is the number of nodes used for the simulation.

V. ACKNOWLEDGEMENTS
We gratefully acknowledge the help from the National Supercomputer Center in Guangzhou. We also appreciate the helpful discussion with Xun Yi, Xuan Zhu, Jiangfang Ding, Hongjuan He, Yingwen Liu, Dongyang Wang and Shichuan Xue. This work was supported by the National Natural Science Foundation of China (NSFC) No. 61221491, and the Open Fund from HPCL No.201401-01.

FIG. 5:
Errors of computing permanents of random matrices. P erm Ryser (P erm BB/FG ) denotes the value of permanent caculated through Ryser's (BB/FG's) algorithm. "Random unitary matrix" denotes results of matrices corresponding to that each input (output) port of an interference network injects (ejects) one and only one Boson. "Random-derived matrix" denotes results of matrices randomly derived from a 100 × 100 matrix, and "Special-derived matrix" denotes results of matrices specially chosen to reveal errors more. Two algorithms give very similar results for two former matrices. However, the special-derived matrices are still overburdened by accumulated rounding errors. The relative difference between two algorithms achieves 13.9% for the 30 × 30 special-derived matrix tested.
(1994).  We paralleled and optimized the programs of Ryser's algorithm and BB/FG's algorithm on Tianhe-2 supercomputer [1]. The architecture of Tianhe-2 is shown in FIG. A1 and FIG. A2. The performance parameters of Tianhe-2 is listed in TABLE AI.

Utilizing Multiple CPUs (and CPU cores)
We implemented the parallelism of these two algorithms in Equation (1) and Equation (2) based on MPI (for utilizing multiple CPUs) and OpenMP (for utilizing multiple CPU cores) libraries. The parallelized Ryser's algorithm is shown in Algorithm 1. The initialization set ID for each process, after which the main process generate the matrix and broadcast the data to other processes through MPI library. The length of the original loop is divided into several segments according to the number of processes. Each process takes different segments as the loop section. The loop would be executed using OpenMP libraries by multi-threads. Processes totally have no data dependency except the final ALLREDU CE operation in which the main process gathers data from all the other process. Except this data gathering, no synchronization is needed. The parallelized BB/FG's algorithm is similar except the iteration body, the total length of the main loop (reduced to 2 N −1 ) and the final division operation, as shown in Algorithm 2.

Load Balance
We optimized the parallelized Ryser's algorithm through load balance. In Algorithm 1, within the calculation of j∈S a ij , the times of addition is related to Broadcast data to all other process; 5: end if 6: length = 2 N −1 # of processes 7: start = rank * length 8: end = start + length 9: for iter = start to end do 10: Convert iter to the set S;

11:
Calculate j∈S aij for each row;

12:
Multiply the sum of each row into a product: |S|. This leads an imbalance loads. Therefore, we designed a optimizing algorithm with load balance, as Algorithm 3 shows. The variable temp is just the N-bit 1's complement of iter. However, BB/FG's algorithm is free from the imbalance problem because each iteration in the loop of Algorithm 2 always performs N times additions/substractions.

Utilizing both CPUs and MICs
We implemented the heterogeneous parallelism of BB/FG's algorithm among CPUs and MICs with two steps. The first step is to rewrite BB/FG's algorithm with SIMD instructions to take advantages of the VPU(Vector Process Unit) in MIC as shown in Algorithm 4, which makes full use of MICs. Data are loaded into vector registers before calculation, the variables with prefix " " represents the vector variables, which would be stored in the vector registers in the VPU. The second step is to implement the heterogeneous parallelism between CPUs and MICs as shown in Algorithm 5. Broadcast data to all other process; 5: end if 6: length = 2 (N −1) −1 # of processes 7: start = rank * length 8: end = start + length 9: for iter = 0 to length do 10: Convert iter to the set δ, δi is 1 when the i th bit in iter is 1, -1 otherwise;

12:
Calculate delta = N j=1 sj; 13: if The number of -1 in δ is odd then j∈S aij corresponding to temp; 9: end for respectively. Some of data in these tables are used in FIG. 2 and FIG. 3.
Some screenshots and a photograph are shown here. FIG. A3 and FIG. A4 are screenshots in which Tianhe-2 was computing the permanent of a 45 × 45 matrix. FIG. A5 shows an error case in which some node failed during the execution. Different order of addition would produce different intermediate results, whose rounding error are also differ- for ity = 1 to N do P erm = P erm 2 N −1 ; 11: end if ent. Thus the order of addition in Ryser's algorithm could affect the precision badly. We tested some different order of addition, and list the errors. The result is shown in TABLE AVIII. Taking N = 3 as the example, the compute length is 8. In original order, the order of summation is: 0 − 1 + 2 − 3 + 4 − 5 + 6 − 7, where the numbers denotes the order of loop represented by variable iter in Algorithm 1. In random order, the summation order is decided randomly; in merge, the order of summation is: (((0 − 1) + (2 − 3)) + ((4 − 5) + (6 − 7))), thus the neighbouring numbers were summed up first, which is much like a tree structure. in the separate operation, the order of summation is (0 + 2 + 4 + 6) − (1 + 3 + 5 + 7) .  TABLE AIX, TABLE AX and TABLE AXI show the  detailed real test errors for

Precision Performance
where N stands for the size of matrix, n is the number of compute nodes, and a together with b are the fitting coefficients. The part of N 2 2 N comes from the complexity of BB/FG's algorithm, O (N 2 2 N ). In the parallelized algorithm, computing tasks are allocated to multiple compute nodes. Therefore, we add the number of compute nodes in the equation. We conducted the curve fitting with MATLAB and the fitting result is shown in TA-BLE AXII.     log", where "yhrun" is the command to submit the task, "-p all" means the program would be executed on all the partition of Tianhe-2, "-n 13000 -N 13000" means we would start 13,000 processes on 13,000 compute nodes, "-x cn[4068...6604]" defines the nodes that we won't use for this computation and "./BosonSampling e 45" shows the input is a 45 × 45-matrix with diagonal elements to be 1 + i and other elements to be 0. The output would be recorded in file "/dev/shm/BSTest n13000 N45 e T1.log". The line after "Result:" gives the computing result, which equals to the theoretical value (1 + i) 45 , and the last line gives the execution time.  A4: Screenshot of the node list of the 13,000 nodes used. The command line is "yhq -n BosonSampling", which output the job with the executive named "BosonSampling" in the job sequence, the partition where this job was allocated and above all, the node list of this job.  We did the precision test by computing the permanent of an N × N all-r real matrix, in which each element has a real value of r ("Value of Elements"). In this case, the ideal permanent is r N · N ! ("Theor. Value"), where N is the size of the matrix ("Size of Matrix"). "Comput. Value" is the result produced by BB/FG's algorithm. The "Abs. Error" is the absolute difference between the computing value and the theoretical value, and "Relative Error Rate" is the relative error rate.  We did the precision test by computing the permanent of an N × N all-r complex matrix, in which each element has a complex value of r ("Value of Elements"). In this case, the ideal permanent is r N · N ! ("Theoretical Value"), where N is the size of the matrix ("Size of Matrix"). "Computing Value" is the result produced by BB/FG's algorithm, which contains two parts: the real part, denoted by "Re. Part" and the imaginary part, denoted by "Im. Part". The "Abs. Error" is calculated following Error Abs. = (Re computing − Re Theoretical ) 2 + (Im computing − Im Theoretical ) 2 , and the relative error rate is 1 + i -9.75E+46 -1.13E+35 -9.75E+46 0.00E+00 5.58E+38 5.72E-09 17 10 + 10i 9.11E+33 9.11E+33 9.11E+33 9.11E+33 2.40E+20 1.86E-14 18