Efficient computation of permanents, with applications to boson sampling and random matrices

In order to find the outcome probabilities of quantum mechanical systems like the optical networks underlying Boson sampling, it is necessary to be able to compute the permanents of unitary matrices, a computationally hard task. Here we first discuss how to compute the permanent efficiently on a parallel computer, followed by algorithms which provide an exponential speed-up for sparse matrices and linear run times for matrices of limited bandwidth. The parallel algorithm has been implemented in a freely available software package, also available in an efficient serial version. As part of the timing runs for this package we set a new world record for the matrix order on which a permanent has been computed. Next we perform a simulation study of several conjectures regarding the distribution of the permanent for random matrices. Here we focus on permanent anti-concentration conjecture, which has been used to find the classical computational complexity of Boson sampling. We find a good agreement with the basic versions of these conjectures and based on our data we propose refined versions of some of them. For small systems we also find noticable deviations from a propose strengthening of a bound for the number of photons in a Boson sampling system.

In order to find the outcome probabilities of quantum mechanical systems like the optical networks underlying Boson sampling, it is necessary to be able to compute the permanents of unitary matrices, a computationally hard task. Here we first discuss how to compute the permanent efficiently on a parallel computer, followed by algorithms which provide an exponential speed-up for sparse matrices and linear run times for matrices of limited bandwidth. The parallel algorithm has been implemented in a freely available software package, also available in an efficient serial version. As part of the timing runs for this package we set a new world record for the matrix order on which a permanent has been computed.
Next we perform a simulation study of several conjectures regarding the distribution of the permanent for random matrices. Here we focus on permanent anti-concentration conjecture, which has been used to find the classical computational complexity of Boson sampling. We find a good agreement with the basic versions of these conjectures, and based on our data we propose refined versions of some of them. For small systems we also find noticable deviations from a proposed strengthening of a bound for the number of photons in a Boson sampling system.

I. INTRODUCTION
One of the major tasks for experimental quantum physics today is to implement and verify the performance of a universal quantum computer. Under common complexity-theoretical assumptions such a machine is expected to be able to solve certain problems far more efficiently than any classical computer. However, the experimental task is formidable and recently an intermediate step known as Boson Sampling [1] has become a focus for both theoretical and experimental work. The output probabilities of a Boson sampling process are given by the permanent of submatrices of the unitary matrix describing the system. The permanent of an n × n-matrix A = (a i,j ) is defined as where the sum is taken over all n! permutations of N = {1, 2, . . . , n}. It differs from the determinant by ignoring the sign of the permutation: While the determinant det(A) can be computed in polynomial time, computation of per(A) is in fact #P-hard, even for 01-matrices [2]. So, the expectation [1] is that Boson Sampling can be efficiently implemented as a physical quantum system but computing the output probabilities will be hard for a classical algorithm, while still not leading to a universal quantum computer. * per.hakan.lundow@math.umu.se † klas. markstrom@math.umu.se In order to verify that Boson sampling experiments give results which agree with the theoretical prediction we need to compute output probabilities for as large systems as possible. This can be done either exactly [3] or approximately through simulation [4]. Here the simulation algorithms still rely on exact computation of some permanents as one step in the algorithm. Our aim has been to implement an efficient, parallel and serial, freely available, software package for computation of permanents, which can either be used for direct computation of Boson sampling output probabilities or to speed-up the existing sampling programs. The resulting software package is freely available [5]. On a parallel cluster a program based on this package is able to compute the permanent of a 54 × 54 matrix, where the previous record from [3] was 48, in less core time than the previous record. Our package can be compiled for both ordinary desktop machines and supercomputers.
The hardness argument in [1] is based on some conjectures regarding the distribution of the permanent for certain random matrices. The behavior of random permanents has also become a focus for some of the scepticism regarding quantum computing [6]. Using our package we have performed a large-scale simulation study of the permanent distribution for several different families of random matrices and the second part of our paper is a discussion of this in relation to the conjectures from [1], as well the mathematical results from [7] and [8]. In short, we find support for some of the mentioned conjectures and propose some modifications based on the sampling data. We also find that one conjecture from Ref. [1], which relates the number of Bosons to the number of modes in a Boson sampling system, does not agree well with data for small matrices. This might change for larger sizes but nonetheless means that eve more care must be taken in the analysis of small Boson sampling experiments.

II. ALGORITHMS, THE PROGRAM LIBRARY AND ITS PERFORMANCE
We will here discuss how to speed-up the computation of per(A), both for general matrices and for matrices with some kind of additional structure. Following this we will look at the performance of our implementation of these algorithms, both in terms of speed and precision.

A. A parallel algorithm for permanents of general matrices
As it is formulated in Eq. (1) it would take n · n! steps to compute the permanent. However, it was shown by Ryser [9] that it can be formulated as where N = {1, 2, . . . , n}. This reduces the number of operations to about n 2 2 n . A further improvement is to use Gray-code ordering of the sets J in Eq. (3) as noted in Ref. [10]. Finding the next set in this order takes on average 2 steps. Ref. [10] also shows how to halve the number of steps by only summing over subsets of {1, 2, . . . , n − 1}. The number of operations then is reduced to the currently best n2 n . This improvement on Eq.
(3) means that the improved method can compute a permanent for n = 50 slightly faster than the method in Eq. (3) can for n = 45. In some applications, like Boson sampling, matrices may have repeated rows or columns. With this in mind let us note that repeated columns, or rows, allows us to speed up the calculation of the permanent, without changing the basic form of Ryser's formula. Assume that the distinct columns of A arec 1 , . . . ,c R and that A has m i columns equal toc i . Let Ω denote the set of all vectors Using the weighted form of Ryser's formula automatically leads to a speed-up when repeated columns are present. A very rough upper bound on the number of terms in this sum is n R . So for R small compared to n one gets a significant speed-up compared to general matrices, and for constant R the algorithm runs in polynomial time. Using a slightly more involved expansion this can be extended [11] to an algorithm with a running time of the form O(n cr ), where r is the rank of the matrix and c is some constant .. In the Appendix we give explicit algorithms for computing the permanent on a parallel computer. Here we will give a short discussion of the performance, both with respect to running time and precision, of our program for computing the permanent function. In Appendix A we give a more detailed description of the algorithm, and instructions on how to download the program package.
Our benchmarking, and our later numerical simulations, were performed on the Kebnekaise cluster at HPC2N in Umeå, Sweden. Each node on this cluster has two 14-core Intel Exon processors running at 3.5 GHz, and 128 GB RAM.

B. Precision
We start by examining issues concerning precision. The implementations of different algorithms for computing the permanent in [3] achieved far worse precision for Ryser's algorithm than for some other variations, leading to errors larger than 100% for n ≥ 32. However, by a judicious choice of precision for single numbers and summation method we find that Ryser's algorithm can be implemented with both good precision and speed.
Note that the formulation in Eq. (3) is essentially a very long sum of products. The terms, of different signs, can vary greatly in size. Some care must be taken to make sure that the resulting sum is relevant. We have experimented with three approaches: doing all computations with standard double precision, computing the product with double precision (usually safe precision-wise) and using quadruple precision for the sum, or, using only double precision and Kahan summation [12] for the sum.
Using only double precision is of course the fastest but caution is needed precision-wise if n 30. The doublequadruple precision approach runs about half as fast but the precision is quite superior. We see no significant difference in precision between Kahan summation and the double-quadruple precision approach. Also, Kahan summation only runs about 5% slower than standard summation in double precision. Thus Kahan summation is highly recommended, especially if quadruple precision is not available. Some care must be taken so that the compiler is not given too much freedom to alter the code semantically during optimization. On the different compilers we tried the default optimization setting did, however, not alter the code. Note that partial sums from each core or node should be stored in an array before the final Kahan summation, rather than just performing a Reduce-operation. The double-quadruple precision approach, on the other hand, allows for using the built-in Reduction-routines when summing up the partial sums and the use of full optimization.
It is, of course, difficult to say in general what the true precision of the result is after a permanent computation. Only for all-1 matrices, denoted J (recall per(J) = n!), and some 01-matrices corresponding to biadjacency matrices of graphs, do we have exactly known permanents. We suspect, however, that these are worstcases precision-wise, and that e.g., the random matrices used in our later simulations are somewhat safer. Comparing the results on random Gaussian matrices using respectively double-precision and double-quadruple precision shows much smaller errors than for J-matrices.
The number of agreeing digits, for both cases, is roughly 17 − 0.15n digits. For the range of n studied here this never puts the difference above 10 −11 so computational error seems not to be an issue in this study.
In Fig. 1 we show the relative error in computing per(J) using different computational scenarios. Using only double precision is sensitive to how the computation is done. The top-most set of points in the figure shows the error from computing with double precision on a single core. The middle set of points show the error when the computation (still only double precision) is run on a single multi-core node (20 ≤ n ≤ 37) and several multi-core nodes (38 ≤ n ≤ 54). It thus matters in which order the partial sums are computed and added, this is here left to Reduction (OpenMP) and AllReduce (MPI) which clearly do a good job. The bottom set of points shows the error when the double-quadruple precision approach is used. Here the order of summation now matters a lot less. The errors for single-core (15 ≤ n ≤ 40) and multi-core&node (20 ≤ n ≤ 53) are here almost indistinguishable from each other so that Reduction-AllReduce have no significant influence.

C. Running time
In Fig. 2 we show the total core time for computing the permanent of an n × n-matrix using the doublequadruple precision approach. The lower set of points (3 ≤ n ≤ 40) are for single-core and the upper set of points (20 ≤ n ≤ 53) are for multi-core single-and multi- node. For the smaller n we obviously had to repeat the calculation many times to get an estimate of the rather small running times. The almost constant run-time between 20 ≤ n ≤ 30) is due to the overhead time (ca 5 seconds) for starting the program with MPI. There is of course an overhead time for OpenMP multi-core as well, but this is much smaller. Several nodes are only used from n = 38, ranging from 2 up to 400 nodes for n = 51, 52, 53. The black line indicates a core-time of roughly 10 −9.2 n2 n seconds. The corresponding plot (not shown) for doubleprecision is very similar but runs almost twice as fast. The fluctuations between different n is however more pronounced but is not visible in a log-plot. We should mention some effects from, we think, cache memory sizes. For single-core the run time increases, as it should, by a factor slightly larger than 2. However, at n = 8, 16, 24, 32, 40 this ratio is significantly less than 2 between n and n − 1.

D. Improved algorithms for Sparse and Structured Matrices
The basic algorithms for computing the permanent can be modified to handle matrices with many zero entries in a more efficient way, and we will here give a brief discussion of this. As discussed in Ref. [13] low-depth Boson sampling set-ups lead to sparse matrices, and there it is also proven that under standard complexity theoretic assumptions computing these permanents is exponentially hard even with a constant, but sufficiently large, number of non-zero elements in each row and column. In Ref. [13] the author suggests that algorithms for this kind of matrix would be of interest and here we demonstrate that they allow for an exponential speed-up over the general case.
The first interesting class is general sparse matrices, i.e., matrices where a significant proportion of the entries are zero. For matrices of this type many of the products in Ryser's method (3) will be zero, and thus not necessary to compute. If we interpret the matrix A as the adjacency matrix of an edge-weighted graph G we find that a set J can only lead to a non-zero product if J is a dominating set in G, i.e., every vertex in G has at least one neighbor in J.
If we restrict Eq. (3) to sets J which are dominating sets we still have an exponential time algorithm, but running in time poly(n)a n , where a can be noticeably smaller than 2. Listing the minimal dominating sets of G is itself an exponentially hard problem, which can be solved in time O(1.7159 n ) [14]. However, for sufficiently sparse graphs we can instead use some simple heuristics which lead us to include both all dominating sets and some non-dominating ones, and still get a significant speed-up compared to the basic version of Ryser's method. In the sparse version of our code we have implemented this by greedily, according to vertex degree, picking a set of vertices I with disjoint neighborhoods and then listing all subsets J which contains at least one neighbor for every vertex in I.
For matrices leading to a sparse graph G we can also prove that the number of dominating sets is exponentially smaller than 2 n , leading to an exponential speed-up over the basic version of Ryser's formula. We say that a matrix is d-sparse if each row and column contains at most d non-zero entries.
Theorem 1. Let A be a d-sparse n × n matrix. Then the permanent of A can be computed in time We will return to this theorem and give a proof in Appendix B.
In Fig. 3 we show how the running time increases with n in our Fortran implementation. Clearly this allows for computation of the permanent of surprisingly large matrices. However, with this implementation the improvement quickly vanishes with densities above 5%, a more ambitious algorithm would surely improve considerably on this.
The associated graph G can also be used to describe the second class of matrices for which we have improved algorithms, namely those for which G has bounded treewidth. This direction is a classical topic [15] and it is well known that for matrices with tree-width bounded by some number k the permanent can be computed in time which is exponential in k but polynomial in n. After the first such algorithms a number of improvements have been made [16][17][18][19], but we are not aware of any efficient implementations of the general bounded tree-width methods. However, one interesting subclass of this family is the set of matrices of limited bandwidth k, i.e., matrices A where a i,j = 0 whenever |i − j| > k, for which we can give a practically useful algorithm. This class of matrices is of particular interest in connection with Boson sampling, since they include the unitary matrices for Boson systems with certain restrictions. For matrices of bandwidth k we can interpret the product in Eq. (1) as a directed walk on the associated graph G where every vertex has out-degree one and in-degree one, and when the bandwidth is limited to k every vertex is connected to only vertices which differ by at most k from its own index i. This means that the sums over these weighted paths can be done using a transfer matrix, following the general set up described in Refs. [20][21][22], leading to an algorithm which runs in time O(n 2 2 k ), and can in fact be reduced to linear time, in n, for fixed k. We describe a linear time version of this method in Appendix C.
We have implemented the band-limited method in Mathematica and in Fig. 4 we display the run times for n ≤ 100 and k = 1, 2, 3, 4, 5 for random real gaussian matrices. Our Mathematica code is available online and this method will be included in an updated Fortran package as well.
For matrices of logarithmic bandwidth this algorithm retains a polynomial run time, but with a degree which depends on the scaling of the logarithm. From the above we then get the following proposition. or more generally those within some fixed distance r, the related unitary matrix will have limited bandwidth. Here the bandwidth depends linearly on both r and the depth of the optical network. As discussed in, e.g., Refs. [23,24] these systems can be simulated in sub-exponential time.
As long as the input and output states of such a network do not have several photons in a given mode, so called collision free states, the relevant matrix will be of limited bandwidth and our linear time algorithm can be applied. If there are collisions in the output the matrix will have repeated columns and the matrix will still have limited band width as long as the number of collisions is bounded, but the number of collisions now add to the bandwidth. However, note also that if we only have repeated columns our transfer matrix based algorithm can be modified to handle this generalized version of bandlimited matrices. Here the "width" will then instead correspond to the maximum number of non-zero elements in a column. Observation 1. Using the algorithm for band-limited matrices as a subroutine the algorithm from Ref. [4] simulates band-limited Boson sampling systems in polynomial time as long as the number of collisions in the output state is bounded.
For a Haar random unitary the caveat on bounded collisions is not required, due to the Bosonic birthday paradox [25,26], which guarantees that collisions will be unlikely if the number of Bosons is not too large. For low-depth systems this theorem does not apply, but due to the bounded range of the interactions we still do not expect large numbers of collisions in a single output mode for a random input state. Providing an exact quantitative form of this statement, converging to the Bosonic birthday paradox as depth increases, would be of interest.

III. COMPLEX GAUSSIAN MATRICES
Let A = (a i,j ) be an n × n-matrix of i.i.d. complex Gaussian numbers, i.e., A is a member of the Ginibre ensemble G(n). Each element a i,j can be generated by first producing two independent real Gaussian numbers g 1 and g 2 , using the Box-Muller method [27], and then (g 1 + ıg 2 )/ √ 2 is a random complex Gaussian with mean 0 and variance 1. Clearly, the expected value of the permanent of a matrix with such entries is 0, due to symmetry. However, we are mainly interested in the properties of the modulus | per(A)|. It is known [1] that | per(A)| 2 = n!, and this is true as long as the entries are independent real or complex numbers with variance 1, so that it is more natural to work with the random variable X = | per(A)|/ √ n! and thus have X 2 = 1. Remarkably, it is also shown in Ref. [1] that X 4 = n + 1. In general, the authors of Ref. [1] can express the even moments exactly in terms of the expected number of decompositions of a k-regular multigraph into disjoint perfect matchings. An asymptotic result, following from the proof of van der Waerden's conjecture on the permanent of doubly-stochastic matrices, is then that X 2k ∼ (k/e) n for k ≥ 3. We will denote the distribution function by F (x, n) = Pr(X ≤ x) where X are samples from n × n-matrices. The density function is denoted f (x, n) = F (x, n). Also let f (x) = f (x, ∞) and F (x) = F (x, ∞) but we also use f (x) and F (x) as generic forms when the context is clear. The x-values are chosen in a geometric progression of 16 per decade (an interval of the form [10 k−1 , 10 k ) for some integer k). The density f (x, n) is then obtained from the difference quotient f ( The error in f (x) can then be obtained in the usual fashion.

A. Measuring up the distribution
For each size n we have computed the permanent of 10 6 random complex Gaussian n × n-matrices, for n = 1, 2, . . . , 30, which we will use in this section, and 10 5 matrices for n = 31, . . . , 35, to be used later. Let us take a look at the moments X k of the sampled data. In Fig. 5 we show the mean X vs 1/n. A fitted line suggests the limit mean value 0.684(1) where the error estimate is obtained from fitting on the points n ≥ k with k = 6, . . . , 12. The error bars in the figure were estimated from bootstrap resampling. For the second moment, shown in Fig. 6, we know that it should be 1 and indeed there is only noise-like deviation from the line y = 1.
In Fig. 7 the third moment is shown and the noise is still manageable. Using the line fitting procedure described for the first moment, suggests the limit 3.20(3).  In Fig. 8 we show the fourth moment X 4 , which should be n+1, plotted against n. A line y = 1+x together with the data points indicate that for larger n the data often favor a smaller moment. However, the error bars occasionally get quite pronounced. This indicates that our data set, from lack of samples, has not yet fully explored the rather heavy tails of the distribution. Finally, in Fig. 9 we show the distribution density for n = 30. This brings us to an interesting conjecture; the permanent anti-concentration conjecture, see Ref. [1]. It states that there exists a polynomial p such that F (1/p(n, 1/δ)) < δ for all n and δ > 0. Equivalently, this can be formulated as the existence of constants a, b, c such that for all n and ε > 0. In the next section we will provide numerical support for this conjecture.   This range (x < 2) contains 95% of the samples.

B. The distribution in detail
Let us try to discern how F (x) behaves for small x. In Fig. 10 we show F (x, n) for n = 6, . . . , 30 for a few values of x together with fitted lines. It turns out that F (x, n) depends quite linearly on 1/n for n ≥ 6. When smaller n are included a higher-order correction term becomes necessary. Note that F (x, n) is strictly increasing with n for x 1.77. For each fixed x we fit a line through the points (1/n, F (x, n)) and its constant term then gives us an estimated asymptotic value, i.e. we make the ansatz F (x, n) = F (x) + C(x)/n so that the slope C(x) only depends on x. By deleting individual points from the line fitting we obtain error estimates of the parameters of each line. At x 0.005 the probabilities become less than 10 −4 and the error bar for each estimate of F (x) becomes significant. We have thus only used x ≥ 0.005.
In Fig. 11 we show a log-log plot of the asymptotic F (x). We have fitted a line with slope 2 through the points x < 0.05 which corresponds to the approximation F (x) ∼ 6.0(1)x 2 . The lower inset of the figure shows the ratio F (x)/x 2 which plausibly approaches the limit 6 despite some significant noise beginning for x 0.01. The upper inset of Fig. 11 shows a log-log plot of the difference 6x 2 − F (x) approximated by 7(1)x 3 (note the rather wide error bar), as indicated by the red line having slope 3. Together they suggest F (x) ∼ 6x 2 − 7x 3 . The error estimates captures how sensitive the coefficients are to deleting data for individual n and which interval of x we fit lines to.
There is also the matter of finite-size scaling to take into account, i.e., the n-dependence. A similar analysis of the slopes of the lines in Fig. 10, i.e., the parameter C(x) mentioned above, gives that C(x) ∼ −12.0(5)x 2 . Including the finite-size term we then have the finite-size Gaussian entries. Log-log plot of F (x) and the line (red) corresponding to the approximation 6x 2 (see text). The upper inset shows a log-log plot of 6x 2 − F (x) and a line (red) corresponding to 7x 3 . The lower inset shows the ratio F (x)/x 2 and the line (red) is at y = 6. scaling f In the next subsection we will try to obtain a finite-size correction for the x 3 -term. In any case, we have so far found that F (x) 6x 2 for all n and x which supports that the anti-concentration conjecture is true.

C. The distribution of the squares
Let us now see how this translates to the distribution of the squares X 2 . The distribution function of . This gives the density function . Using Eq. (7) and Eq. (8) we obtain In Fig. 12 we show the density function f 2 (x, n) of X 2 for n = 30. It was speculated in Ref. [1] that the density f 2 (x) goes to infinity when x → 0 but we claim that it goes to a limit value. Our approximation for F (x) stays relevant for x 0.10 which from the perspective of the squares X 2 means that our formula for f 2 (x) is relevant only for x < 0.01. We are here at the lower 5% of our data set so our analysis demands a large number of samples.
Fitting the curve y = C 0 + C 1 √ x to the measured f 2 (x, n) for the range 0.0005 < x < 0.01 we find how the coefficients C 0 and C 1 depend on n. Note that C 0 should scale as C 0 = 6.0(1) − 12.0(5)/n as in Eq. (10). This is  confirmed in Fig. 13. The inset shows how C 1 depends on n and we find C 1 = 10.5(9) + 36(6)/n though the points are here quite scattered, which is reflected in the error bars. Again, the error bars indicate how much the result depends on the choice of fitted points (less) and range (more).
Adding these new terms we find the following finite- size scaling rules f (x, n) ∼ 12 − 24 n x + −21 + 72 n x 2 , (12) with the limits In Fig. 14 we show the measured f (x, n) and the function in Eq. 12 for n = 10, 20, 30 on x < 0.1. The fit is quite excellent. In Fig. 15 we show the measured f 2 (x, n) and Eq. (14) for x < 0.008, again with a good fit though the error bars for f 2 are here quite noticeable.

IV. COMPLEX CIRCULAR DISTRIBUTION
Here we let the entries of the matrix be random complex numbers of modulus 1, i.e. we let each entry be of the form exp(ıθ) where θ is uniformly distributed on the interval [0, 2π). For each size n we have collected data for 10 7 such random matrices, for n = 1, 2, . . . , 30. As before, we study the distribution of X = | per A|/ √ n!. Our investigation proceeds as in the previous section though we are here armed with better data.
The mean X is shown in Fig. 16 and from fitted 2nd degree polynomials we estimate the limit 0.7753(2). The Gaussian entries. Measured density f2(x, n) (points) and the scaling rule of Eq. (14) for n = 10, 20, 30 (blue, orange, green curves; upwards) and ∞ (dashed black curve). Error bars become significant for x 0.001. second moment is again exactly 1 but obviously we see some small fluctuations. The behaviour is quite similar to that in Fig. 6 but with less noise. The third moment X 3 in Fig. 17 prefers the limit 2.41(2), as per fitted polynomials as before. The fourth moment X 4 , shown in Fig. 18, has a linear behavior just as for the Gaussian case in Fig. 8. A rough estimate of its behavior, based on n ≤ 13, is X 4 ∼ 0.72(2) + 0.37(1)n.
The distribution density f (x) looks very similar to that of the Gaussian case. To estimate the limit distribution function F (x) we use the ansatz F (x, n) = C 0 + C 1 (x)/n + C 2 (x)/n 2 on the points n ≥ 5. Note that the linear ansatz used in the Gaussian case is not sufficient here. In Fig. 19 we show a log-log plot of the limit F (x). The red line has slope 2 so again we expect F (x) ∝ x 2 for small x. The inset shows the ratio F (x)/x 2 and, despite the noise setting in at x < 0.01, we estimate F (x)/x 2 ∼ 2.135(10). The error bar includes both errors  from excluding points in the F (x, n) ansatz and errors depending on which points x to include (we have used 0.01 ≤ x ≤ 0.08). Clearly, as shown by the inset figure, the rule F (x) ∼ 2.135x 2 breaks down for x 0.1. Including the finite-size scaling we find the following rule useful for x ≤ 0.1 for n ≥ 5: Information on higher order terms are easier found when studying the distribution of the squares. The distribution in Eq. (19) translates to f 2 (x, n) = 2.135 − 8/n + 15/n 2 so that the limit density is just the constant 2.135 for, say, x 0.01. However, plotting f 2 (x, n) reveals that the density functions are very close to linear for x 0.1 and all n ≥ 5. Using the simple ansatz f 2 (x, n) = C 0 + C 1 x/n and fitting on the interval 0.005 ≤ x ≤ 0.08 we find that the slope scales as bars mainly indicate sensitivity to which points are included. Note that the constant coefficient must scale as C 0 = 2.135 − 8/n + 15/n 2 , see Eq. (19). To conclude, after defining the coefficients we obtain the finite-size scaling rules and their respective limits become In Figs. 20 and 21 we compare the rules for f (x, n) and f 2 (x, n) to their measured counterparts. They fit very well over a surprisingly wide interval. Translating the limit f 2 (x) = 2.135 − 6x to F (x) we find F (x) = 2.135x 2 − 3x 4 which adds a correction term to Eq. (19). Note that in Eq. (26) this correction term is of order x 4 whereas it was of order x 3 in the Gaussian case.

V. BERNOULLI-DISTRIBUTED ENTRIES
We will here apply the approach in the previous section to a discrete class of random matrices, that of Bernoullidistributed ±1 entries where Pr(+1) = Pr(−1) = 1/2.  Matrices with Bernoulli-distributed entries have been studied in the mathematics literature, with an emphasis on the probability for small values of X. In [7] it was proven that with probability tending to 1 X is larger than n (n/2− ) for any fixed > 0, and that it is likewise smaller than n (n/2+ ) with probability tending to 1. Those authors also conjectured that the lower bound can be improved to exp(−cn)n n/2 for some constant c, and we will comment more on this later.
Our data consists of 10 6 samples for n = 1, 2, . . . , 30. Unfortunately we can not obtain quite the same level of precision in our scaling analysis as for the complex Gaussian and circular cases. Strong finite-size effects and erratic behavior for smaller n calls for larger matrices and many more samples.
Starting out with the first moment X in Fig. 22 we find that it is asymptotically 0.647(2). The sec- ond moment is of course 1 and the third moment is asymptotically 3.75 (5) (not shown). The fourth moment, as in the Gaussian case, appears to grow linearly with n but we only give the very rough estimate X 4 ≈ 1.25(5)n − 1.0(5) due to its erratic behavior.
In Fig. 23 we show a log-log plot of F (x) plotted versus log x where F (x) was obtained using a similar finite-size scaling ansatz as in the previous cases. However, we note here that care must be taken to only include matrices large enough since the corrections-to-scaling are considerably larger in this case. We have only used n ≥ 15 which contributes some noise to the estimated F (x) since we fit on fewer points. The red line in Fig. 23 has slope 1 and corresponds to the estimate F (x) ∼ 1.33(5)x. The inset shows the ratio F (x)/x which is clearly approaching a limit around 1.33(5).
Unfortunately our data are not good enough for a correction term of higher order and pin-pointing the finitesize scaling would also be an unreliable affair. We will simply translate our distribution function into the distribution of X 2 . In conclusion we thus find the asymptotes Note here that we claim that f 2 (x) → ∞ when x → 0 unlike for the complex Gaussian and circular cases where f 2 (x) approached a limit of 6.0(1) and 2.135(10) respectively. In Fig. 24 we show the measured f (x, n) (inset) and f 2 (x, n) for n = 30 and compare them to the limits of Eq. (31) and (33).
Finally we note that our data is compatible with a strengthening of the conjecture from [7] Conjecture 1. X is asymptotically almost surely larger than h(n) exp(−n/2)n (n/2+1/4) , where h(n) is any function tending to 0 as n → ∞.

VI. GAUSSIAN BEHAVIOUR OF MINORS OF UNITARY MATRICES
The computational hardness of Boson sampling as analysed in [1] depends on the fact that certain submatrices of Haar-random unitary matrices asymptotically behave like random matrices with Gaussian entries. Next we will investigate how close the permanent of such a submatrix is to the permanent of a random Gaussian matrix.
Let U(n) be the family of unitary n × n-matrices. We can generate random members of U(n) under the Haar-measure in the following way [28]; produce a complex random Gaussian matrix A as above, find its QRdecomposition with R = (r i,j ), let Λ = (λ i,j ) be the diagonal matrix of normalized elements of R so that λ i,j = δ i,j r i,j /|r i,j | (where δ i,j is the Kronecker delta), then U = QΛ is a random unitary matrix from U. Now let S(m, n) be a family of random matrices obtained by first generating a random unitary m×m-matrix U ∈ U(m), next let U n be the top-left n × n-submatrix of U and set S = √ mU n . Then S is a random matrix from the family S(m, n).
It is known that if S ∈ S(n 6 , n) and A ∈ G(n) then the variation distance between them is expected to be small, i.e., they have essentially the same probability distribution. The authors of Ref. [1] prove a slightly stronger result but they also think n 6 can be replaced by something much smaller, say closer to n 2+ . We will use distributions of the permanent to see if we can throw some light on the problem.
We will use the Kolmogorov-Smirnov (KS) statistic D = sup x |F (x) − G(x)| as a measure of the distance between two empirical distribution functions F (x) and G(x). For a two-sample test, we reject the null hypothesis that they are the same (at significance level α) if D > D α for certain D α . For α = 0.05 and using 10 5 samples for both distributions we get D α = 0.00607. We then first generate S ∈ S(n a , n) and compute | per(S)|/ √ n! for 10 5 different S and then compare this distribution to that of | per(A)|/ √ n! for 10 5 complex Gaussian matrices A. We will see if the distance D has an increasing or decreasing trend for different values of a. Note that when n a is not an integer we just round to the nearest integer. We use Mathematica's built-in routine for computing the test statistic when comparing two distributions in a Kolmogorov-Smirnov test as the value of D.
We have run this test for 5 different a and a wide range of n for each a: 1 ≤ n ≤ 34 for a = 2, 1 ≤ n ≤ 32 for a = 2.25, 1 ≤ n ≤ 30 for a = 2.50, 1 ≤ n ≤ 22 for a = 2.75 and 1 ≤ n ≤ 17 for a = 3. In Fig. 25 we show the KS-statistic D versus n ≥ 3 for the different a.
For a = 2 the values of D are clearly increasing at first but there is no clear trend beginning at n ≈ 28, with D staying at roughly 0.08. For a = 2.25 there is a very weak increasing trend in D. Excluding individual points from the line fit is not enough to get a decreasing trend though. For a = 2.5 there is a distinctly decreasing trend but it would take n ≈ 75 to pass a KS-test at the 5% level. For a = 2.75 the distributions actually pass a KS-test for n = 19, 21, 22 and for a = 3 they pass it for n = 12 and 13. Reading the trends in the KS-statistic D, it would thus appear that matrices from S(n a , n) are essentially indistinguishable from complex Gaussian n × n-matrices, in terms of their permanents, when a > 2. 25, while for a < 2.25 they are not, and the case of a = 2.25 appears to be a separator between the two cases which we cannot classify. If this is a correct classification, rather than an effect of slow convergence in terms of n for the lower values of a, then it would contradict the conjecture in Ref. [1] that a ≥ 2 + is enough.
It is possible that the curve for a = 2 will begin to decrease for larger n, in accordance with the conjecture from Ref. [1]. Nontheless, for a < 2.25, we see a behaviour which is distinct from the Gaussian case for the range of n used here. So, care must be taken in the analysis of Boson sampling experiments where an effective value of a close to, or equal to, 2 has been chosen if the number of Bosons is small.

VII. CONCLUSIONS
In order to compute output probabilities for Boson sampling experiments [1] one has to compute the permanent of the associated unitary matrices. Here we have presented a software package for doing such computations efficiently, both on serial and parallel machines. Our programs are efficient enough to allow us to beat the previous world record for computation of permanents in a substantial way, despite the fact that the previous record was set on a far larger cluster [3].
Our package also has specialised functions for matrices of limited bandwidth, running in time O(2 k n 2 ) for matrices of bandwidth k, and in linear time for fixed x. This makes it possible to classically simulate a Boson sampling system of depth O(log n) in polynomial time We have used our software package to perform a large scale simulation study of the anti-concentration conjecture for permanents [1]. Here we find that the conjecture agrees well with the conjecture, both for complex Gaussian matrices and other matrix classes. We also investigated how well the permanent of a minor of size n a of an n × n Haar-random unitary matrix can be approximated by the permanent of a random Gaussian matrix. Here we find some possible tension with the most optimistic version of a conjecture from Ref. [1].