Parallelizing quantum circuit synthesis

Quantum circuit synthesis is the process in which an arbitrary unitary operation is decomposed into a sequence of gates from a universal set, typically one which a quantum computer can implement both efficiently and fault-tolerantly. As physical implementations of quantum computers improve, the need is growing for tools which can effectively synthesize components of the circuits and algorithms they will run. Existing algorithms for exact, multi-qubit circuit synthesis scale exponentially in the number of qubits and circuit depth, leaving synthesis intractable for circuits on more than a handful of qubits. Even modest improvements in circuit synthesis procedures may lead to significant advances, pushing forward the boundaries of not only the size of solvable circuit synthesis problems, but also in what can be realized physically as a result of having more efficient circuits. We present a method for quantum circuit synthesis using deterministic walks. Also termed pseudorandom walks, these are walks in which once a starting point is chosen, its path is completely determined. We apply our method to construct a parallel framework for circuit synthesis, and implement one such version performing optimal $T$-count synthesis over the Clifford+$T$ gate set. We use our software to present examples where parallelization offers a significant speedup on the runtime, as well as directly confirm that the 4-qubit 1-bit full adder has optimal $T$-count 7 and $T$-depth 3.


Introduction
Quantum computers, like their classical counterparts, will require a compiler that can translate from a humanreadable input or programming language into operations that can be executed directly on quantum hardware. Circuit synthesis is an integral part of the compilation process. Given an arbitrary quantum circuit C and a universal gate set , one seeks to find a decomposition, where k represents the depth of the circuit. A myriad of algorithms currently exist to find such a decomposition [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. They are generally divided into two classes: those that synthesize approximately (i.e. ) and others that synthesize exactly. Some procedures work for a single qubit, whereas others have been generalized to multiple qubits. Most of these algorithms were designed to work over the Clifford+T universal gate set, though other gate sets such as the V-basis have also been studied [5,6].
Many of the algorithms that perform exact synthesis fall victim to the fact that the time and space used depend exponentially on both the number of qubits and the depth of the circuit in question. Even on a reasonably fast machine, the synthesis of circuits with more than a handful of qubits and layers of depth becomes intractable.
In this work, we propose a method of circuit synthesis based on a heuristic search technique commonly used in cryptanalysis: collision finding based on deterministic, or pseudorandom, walks. These are walks through a search space such that once a starting point is chosen, the path is completely determined. More generally, we show how we can use deterministic walks to traverse the space of possible circuits of a given depth and find solutions to the synthesis problem. A key ingredient in our method is a mapping from the unitary operators constructed from the gate set  to binary strings of a constant length and a suitable mapping back to the set of unitary operators. When such mappings are defined, we can synthesize circuits over any universal gate set on any number of qubits by applying any existing walk method that can search the space.
The structure of this article is as follows. We begin in section 2 with a discussion of deterministic walks and how we can map quantum circuit synthesis to these types of problems. The subsequent sections pertain to our choice of implementation of one such method, namely parallel circuit synthesis. In section 3 we briefly lay out the procedure for parallel synthesis and provide a runtime complexity estimate, detailing the important parameters which affect the scaling of our algorithm. Section 4 introduces our software implementation, pQCS (parallel quantum circuit synthesis), which performs optimal T-count synthesis using parallel search. Section 5 contains the numerical results of large-scale experiments run on a Blue Gene/Q supercomputer. Here, we showcase the significant advantages afforded to us by parallelization. We conclude in section 6 and suggest avenues of future research on this topic.

Walking through circuits
Consider a hash function    h : , typically considered to operate over binary strings. If h is a good hash function, then for an arbitrary input, Î  x , the value = Î ( )  h x y will be in practice indistinguishable from a random output. Suppose there exists another function,    r : , unrelated to h, which maps elements of its range back to the domain (such a function is commonly termed a reduction function). Repeatedly applying • r h to an input will produce a trail of points scattered throughout . However, once the initial input is chosen, the progression of the trail is completely determined, hence we favour the term deterministic rather than pseudorandom walk even though the path of the walk appears random due to the natures of h and r.
Such determinism has led to a set of algorithms with a variety of applications. One well-known variation is rainbow tables [16], which are used for finding pre-images of hash functions (conventionally with the intention of cracking passwords). Collision finding in one hash function or claw finding between two functions has also been accomplished in parallel using deterministic walks [17], and was used to find collisions in double DES [18].
Deterministic walks are advantageous due to their low storage requirement: one need only store the starting point of a walk, its ending point, and the number of intermediate steps, whereas conventional search techniques would store the value of every point computed throughout.
With this in mind, we show how one can map the problem of circuit synthesis to a problem that can be solved using an algorithm based on deterministic walks. We have, as per equation (1), a product constructed from the universal gate set  . It is possible to specify a unique way of encoding the information about of equal length ℓ (where we assume ℓ is sufficiently long as to encompass all the information described in what follows). Suppose  contains a number of single-and twoqubit gates. If we enumerate all the gates in  , then for each U i we might use a few bits to identify all the constituent gates, and maybe a few more to specify if we should use their Hermitian conjugates. We will also need to indicate on which qubit(s) they act. Furthermore, there must be some space to indicate controls and targets where appropriate. Given any gate set, we can find a way of doing this such that every possible U i can be represented by a unique string b i . Then, the concatenation will be a unique string of length ℓ k representing the product of unitaries U U k 1  . We can perform a deterministic walk over unitary matrices as follows (this process is displayed graphically in figure 1). Let us define a function μ which maps a binary string of length ℓ k to a unitary matrix over a specified gate set  . Then, define a mapping ν from the unitaries over  back to binary strings { }* 0, 1 . Finally, choose a good hash function h from { }* 0, 1 to strings of length ℓ k (this may be a simple hash function or a combination of hash and reduction-type functions). Repeatedly applying n m • • h to a randomly chosen binary string of length ℓ k will allow us to traverse products of unitaries in a pseudorandom fashion; we can then use this to search the space of possible solutions to (1).

Parallel circuit synthesis
Once we have mappings as proposed in section 2, we can reformulate the circuit synthesis problem as a problem that can be solved using search algorithms based on deterministic walks. We specifically implemented one which performs parallel claw finding. Let    h : 1 1 and    h : 2 2 be two hash functions. A claw between h 1 and h 2 is a pair of inputs Î This is, in a sense, a collision search between two functions. Our interest in claw finding stems from recent work on circuit synthesis using a meet-in-the-middle (MITM) approach [12]. The motivation for that work is as follows. One can, of course, find a decomposition of (1) by brute force, computing all possible combinations starting from depth 1 up until a solution is found. Let ξ represent the number of unitaries having depth 1. Typically, ξ will depend exponentially on the number of qubits, n. Then, the runtime for brute force synthesis of a circuit with depth k takes time x ( )  k . A MITM approach achieves a roughly square-root speedup over this, accomplished by dividing the synthesis equation in half: Databases of unitaries having the form of each side of (3) are sequentially constructed (starting from depth 1), stored in binary trees, and then searched through until a suitable decomposition is found. This reduces the size of the search space by a square root factor, yielding runtime x , where the log factor is picked up due to the binary search.
To parallelize circuit synthesis, we build on the principles of the MITM algorithm. Rather than searching through static binary trees, we search the space in parallel, adapting a search technique originally developed for cryptanalysis [17]. Though our runtime will retain the exponential dependence on n and⌈ ⌉ k 2 , it scales inversely with the number of processors, allowing us to tackle larger problems that were infeasible using previous methods, as well as speed up the synthesis of some known circuits. We provide a brief description of the algorithm here as it pertains specifically to circuit synthesis. For a more detailed description, the reader is referred to [17] or [19].
Recall equation (3), and for simplicity, let us define as representing the left and right sides of this equation. Define a suitable mapping between unitary matrices and binary strings of length ℓ k as in section 2. Then, let ¢  represent the set of binary strings that are of the form V, and likewise  those of the form W. When k is odd, ¢  and  may differ in size by a factor of ξ. In this case, we partition ¢  into equal-sized chunks ¢ ¼ ¢ x-  , , 0 1 , and consider = ¢   i independently (a search can then be executed with each ¢  i sequentially or in parallel, adding another layer of parallelism to the implementation). When k is even, we simply let = ¢   .
One way these functions might be implemented is by converting the input string into a sequence of unitary matrices (in  for z 1 and  for z 2 ), computing their product, deriving a new binary string with the information about each of the matrix elements, and then running that string through a known hash function so that the outputs of both functions are in the same space and in practice appear to be random. Let us define a 'super' function´{ . Finding a claw between z 1 and z 2 is now equivalent to finding a collision in f with distinct values for b, i.e. we must find two inputs x 1 and Consider m processors all having access to a shared memory. We will denote some fraction θ of points in  as marked or distinguished. Every processor chooses a random starting pair Repeatedly applying f produces a trail through the space of possible circuits, which roughly half of the time will produce a part of equation (3) which is an element of  , and the other half of the time will produce an element of  . The trail continues until the next input, say x d , is a distinguished point. The trail is then terminated.
The collection of found distinguished points is stored in the shared memory. Distinguished points are stored as a triple consisting of the first pair ( ) n b , 0 0 , the last pair ( ) n b , d d , and the value d, which is the number of steps taken to reach the distinguished point. When a processor finishes its trail, it will attempt to add its distinguished point to the shared memory. If it sees that a trail ending at the given point is not present in this shared memory, it will insert it and then begin a new trail. However, if it sees that there is already a triple in storage that ended at the same distinguished point but had a different starting point, it means that somewhere along the way these two trails must have merged. The processor then takes the starting points of these two trails and traces back through them to locate the merge point.
There are a number of possibilities here as depicted in figure 2. First, it could be that one trail started 'before' the other, i.e. the merge point was at the beginning of the shorter trail. Another possibility is that when the trails merged, both had just performed z 1 or both had just performed z 2 . Even if the inputs were different, this case does not provide us with a solution to the problem at hand. The final case is that immediately before they merged, one trail performed z 1 and one performed z 2 ; it is only in this final case that we have found a solution.
With the information about the inputs in the step just before the collision, we can extract the unitary matrices from the binary string and have fully synthesized our circuit.
The runtime complexity of this algorithm can be estimated by applying the parameters of our problem directly to that in [17]. The size of the spaces ¢  and  are Our algorithm then scales as where w is the number of distinguished points that can be held in memory. The parameter τ is the execution time for a single iteration of z 1 or z 2 , the bulk of which will likely be spent performing matrix multiplication. Let us assume in the worst case that we are taking the product of ⎡ ⎢ ⎤ ⎥ 3 . Thus, we obtain our final estimate: As previously mentioned, this time is still exponential in the number of qubits as well as the depth of the circuit. We also note that it is often the case that matrix multiplication can be parallelized, or that some specific properties of the implementation at hand (such as sparsity) can be leveraged so as to improve the scaling. What is key here is that the runtime benefits from being inversely proportional to the number of processors and available memory.

Implementation details 4.1. Optimal T-count synthesis
The synthesis algorithm we chose to apply our approach to is the optimal T-count algorithm presented in [13].
Such an algorithm is relevant as in many state-of-the-art methods for fault-tolerant quantum computation, T gates are considered to be expensive to implement due to the need to distill magic states (see, for example, [20]). Let  n represent the n-qubit Pauli group. We reshuffle and rewrite the decomposition of a circuit C as where t is the T-count, D is a Clifford, Î  P j n , and It thus suffices to find a set of t Paulis and a Clifford which will satisfy equation (10) up to a global phase. The dependence on the global phase can also be removed by using the channel representation of every matrix in the above equation: where the channel representation of some matrix U is the matrix with coefficients The channel representation of an n-qubit unitary has dimension4 4 n n , with each row and column being indexed by a Pauli operator.
Using the optimal T-count algorithm has afforded us a number of advantages. First of all, the T-count formulation allows us to represent each unitary matrix in the sequence as a list of n-qubit Paulis. With binary symplectic representation, we can then represent each Pauli directly as a binary string, which leads to a very simple mapping with which we can perform our deterministic walks. Another strong point of the algorithm is that the channel representations of R(P) for Î  P n are sparse matrices. Thus, we were able to implement a sparse matrix multiplication algorithm that allows us to very quickly compute most matrix products despite the channel representations having dimension4 4 n n . We can apply equations (8) and (9) to the optimal T-count synthesis to obtain a runtime estimate. Each R(P) contributes a single T gate to the circuit, and can be considered as a single layer of depth in this implementation. Thus, we have that x = -4 1 n , as all Paulis save for the identity are valid choices. Our estimate for the runtime is thus

Computer specifications
We implemented the optimal T-count version of the parallel algorithm in C++11. It is called pQCS and is available for download and research use at https://qsoft.iqc.uwaterloo.ca/#software. Parallelization was accomplished using the Boost.MPI compiled library [21]. A scaled-down version of pQCS that uses only OpenMP for parallelization (and can be run on a standard multi-core personal computer) is also available in the above package. pQCS was extensively tested on two large-scale machines. The OpenMP-only version was tested on SHARCNET's Orca using a single node with up to 16 processors at 2.2 GHz speed. The MPI version was tested on Scinet's Blue Gene/Q (BG/Q) supercomputer, which has 65536 processors at 1.6 GHz speed. The largest test we have run to date involved a total of 8192 cores. All results below are from trials on the BG/Q. A flowchart and description of the distribution of work in the MPI version is presented in figure 3.

Determining effective simulation parameters
pQCS has a number of tunable parameters. In what follows, we will synthesize a known circuit, the Toffoli gate, and explore the scaling of our algorithm.
In the original description of the parallel collision finding algorithm [17], each processor was responsible for performing not only the search for a distinguished point, but also storing it and subsequently checking the validity of any possible solutions; it is from this setup that the heuristic runtimes are derived. In pQCS, however, processors are divided into three categories (as per figure 3) that communicate via MPI. Worker processors perform deterministic walks and generate distinguished points. Distinguished points are collected and stored incore on collector processes. Each collector has access to a number of verifier processors to which pairs of walks are sent for verification when the possibility of a claw occurs. The parameters m and w may not necessarily depend, then, on the total number of processors, but rather on the number of processors in one or more of the different classes. For example, w will depend solely on the number of collectors, whereas we expect m to be a function of the number of workers, assuming a sufficient number of collectors and verifiers are in place.
First, we focus on how many collectors and verifiers we should use. We chose two values for the total number of cores, 1024 and 2048. We then varied the fraction of nodes designated as collectors in increments of 1/16, from 1/16 to 1/4 the total (values outside this range clearly yielded inferior results). For each fraction of collectors, we either used the same or double the number of verifiers. The results of these trial runs are shown in figure 4. In all of these trials, we let 1/4 of the points in the space be designated as distinguished (later we will finetune this parameter as well). Each point is the average of 100 independent trials. We find that for both total quantities of processors, the optimal number of collectors is 1/8 the total number, and for verifiers 1/4 the total. When more than 3/8 of the total processes are being used on storage and verification, there are not enough workers to perform the deterministic walks. On the other hand, when there are too many workers, each collector must store and process a larger collection of distinguished points each time. Furthermore, more time will be spent by the workers gathering and sending the increased quantity of distinguished points.
With this knowledge, we then tested the Toffoli with varying number of cores. Again, we let 1/4 of the points be distinguished and take the average of 100 independent trials. The results are shown in figure 5. Here, we clearly see the expected inverse dependence on the number of processors as predicted by (8). We do note that there is significant deviation from the expected trend when we reach 8192 cores. We suspect that for a problem of this size, the parallel overhead and communication costs outweigh the potential benefits of using this many cores.   Finally, we investigate how the runtime varies with the fraction of distinguished points, θ. In the case of the Toffoli, the amount of available memory using the above number of processors on the BG/Q is significantly greater than that required to store even the entire space. Variation of this parameter is thus somewhat contrived for such a (relatively) small problem. In this case, we would expect an inverse dependence on θ (see the appendix for more details). We ran 100 trials on 4096 processors (512 collectors and 1024 verifiers) using fractions of distinguished points {1/2, 1/4, 1/8, 1/16, 1/32 }. The results are displayed in figure 6, where we see the expected inverse dependence. We also report here our best synthesis times for the Toffoli gate, clocking in at roughly 26s on average. To fully explore the effects of this parameter (and more importantly the dependence on the available memory w), we would need to use a much larger circuit.

Benchmarking known circuits
Some of the largest circuits that were directly synthesizable by both the original MITM algorithm and optimal T-count algorithm were those with T-count 7 on 3 qubits [12,13]. There are a number of such circuits shown in figure 7. Using our knowledge from optimization of parameters in the previous section (4096 cores, 1/2 points distinguished, 512 collectors and 1024 verifiers), we obtain the synthesis times reported in table 1. We note that at roughly 25s, these times are a marked improvement over those reported in [19], which were greater than 4 minutes. This highlights the advantage of using many processors and is a promising sign that we will be able to synthesize circuits which are much larger in a reasonable amount of time.

Pushing the boundaries
The largest circuit synthesized to date using pQCS is the 4-qubit 1-bit full adder shown in figure 8. A synthesized version of this adder appeared in [12] with T-count 8, where it was accomplished using peephole optimization techniques. It was suspected that it has T-count 7 [22], which we confirm.
The first successful synthesis of the adder took 12.5 hours using 4096 cores (512 collectors, 1024 verifiers) and 1/2 points distinguished. We note that a circuit as large as the adder would likely benefit from a larger  number of processors, so more testing is in progress. A full version of the circuit is shown in figure 9. The initial output of pQCS is a sequence of Paulis and a unitary corresponding to a Clifford gate as per (10). The Pauli portion of the circuit ( ( ) ( ) R P R P 7 1  ) was generated using the algorithm given in the appendix of [13], and the Clifford component was generated using the algorithm in [23]. The resultant sequence of gates was then optimized for T-depth using T-par [24]. Interestingly, this new synthesis of the adder led to the observation that it requires identical resources as the Toffoli gate, i.e. T-count 7, T-depth 3, and to the question of whether this is a coincidence. In fact, it was subsequently pointed out to us that this adder is affine equivalent to the Toffoli (i.e. unitarily equivalent up to application of CNOTs) [25].

Concluding remarks
We presented a framework for quantum circuit synthesis based on deterministic walks, as well as an algorithm and software for parallel quantum circuit synthesis. We observed a clear advantage over existing techniques using a relatively modest number of processors and were able to directly synthesize a 4-qubit circuit, which would have been intractable using previous methods.
Ongoing and future work on pQCS includes improvements to the application structure and parallelization routines, extensions for synthesis in general over a specified gate set, and the implementation of approximate circuit synthesis. Furthermore, we seek to push the application to its limits in order to fully characterize the scaling, in particular with respect to the available memory once the circuit search spaces become sufficiently large. full with w distinguished points. The number of steps required to find a single collision in this case is q q where θ is the fraction of points which are distinguished. The first term comes from the fact that to fill the memory with w distinguished points, q w elements in the space will be traversed on average, and any given point in a new trail has a N 1 probability of landing on a previously seen point; the second term comes from the need to trace back through both trails to locate the collision, and each trail has length q 1 on average. The assumption is made that there is a single 'golden' collision. In this case, N 2 'bad' collisions will be found on average before the golden one is found. If we parallelize using m processors and assume each step in a trail takes time τ, then we obtain a runtime: The next step taken in [17] is to differentiate and find θ such that (A.1) is optimized, which is what results in the inverse-square-root dependence on w. They then performed computational experiments for a range of w and N to find optimal prefactors. However, the optimal θ is expressed in terms of w N , which when w N  (as is the case when we synthesize the Toffoli on the BG/Q) would not result in a fractional θ. So let us continue a hypothetical analysis of this form without finding the optimal θ. Consider the case where we are optimizing for T-count. In the most general case, the two halves of the MITM equation will be different sizes where α is a constant which reflects the complexity of the matrix multiplication algorithm. Then we have that