Depth optimization of CZ, CNOT, and Clifford circuits

We seek to develop better upper bound guarantees on the depth of quantum CZ gate, CNOT gate, and Clifford circuits than those reported previously. We focus on the number of qubits $n\,{\leq}\,$1,345,000 [1], which represents the most practical use case. Our upper bound on the depth of CZ circuits is $\lfloor n/2 + 0.4993{\cdot}\log^2(n) + 3.0191{\cdot}\log(n) - 10.9139\rfloor$, improving best known depth by a factor of roughly 2. We extend the constructions used to prove this upper bound to obtain depth upper bound of $\lfloor n + 1.9496{\cdot}\log^2(n) + 3.5075{\cdot}\log(n) - 23.4269 \rfloor$ for CNOT gate circuits, offering an improvement by a factor of roughly $4/3$ over state of the art, and depth upper bound of $\lfloor 2n + 2.9487{\cdot}\log^2(n) + 8.4909{\cdot}\log(n) - 44.4798\rfloor$ for Clifford circuits, offering an improvement by a factor of roughly $5/3$.


Introduction
Clifford circuits play an important role in quantum computing. Most prominently, they lie at the core of quantum error correction [2], where they are responsible for both state encoding and state/gate distillation [3]. Once error corrected, fault-tolerant computations are often expressed as Clifford+T circuits, directly implying that large chunks of such computations are themselves Clifford circuits. Clifford circuits play a key role in randomized benchmarking of quantum gates [4,5], the study of entanglement [6], and shadow tomography [7] to name a few more areas of importance.
Clifford circuits can be defined as those quantum transformations computable by the quantum circuits using single-qubit Hadamard gate H := 1  [8]; however, optimal synthesis of Clifford circuits spanning more than 6 qubits appears to be out of reach. Asymptotically optimal circuit constructions of arbitrary n-qubit Clifford computations are known: a Clifford operation can be implemented with Θ n 2 / log(n) entangling gates [9,10] in depth Θ (n/ log(n)) [1,10,11]. No better guarantees, such as asymptotic tightnessmeaning asymptotic equality discarding the lower order additive terms, however, are known.
Due to the 11-stage layered decomposition [9] over the gate library {H, P, CNOT}, asymptotic analysis of the depth of Clifford circuits relies on the bounds for CNOT gate circuits, also known as linear reversible circuits. The CNOT circuit synthesis algorithm offering asymptotically optimal upper bound comes with a high leading constant of 20-specifically, the depth complexity guarantee [1,10,11] is 20n log(n) + O( √ n log(n)). Given depth-2n implementation was known since 2007 [12], it became clear that the asymptotically optimal implementation does not offer an advantage until n becomes very large. The authors of [1] addressed this by introducing an algorithm offering depth upper bound of 4n/3+8 log(n) , that outperforms the asymptotically optimal algorithm [10,11] for n < 1,345,000 and outperforms Kutin's et al. algorithm [12] when n > 75. One of the results that we report here is a synthesis algorithm that outperforms the combination of all of the above for 70 ≤ n ≤ 1,345,000 while offering the upper bound guarantee of n + 1.9496· log 2 (n) + 3.5075· log(n) − 23.4269 -roughly a 25% reduction over [1], see Lemma 3. In this work, we focus on the CZ, CNOT, and Clifford circuits spanning no more than 1,345,000 qubits. The number 1,345,000 itself originates from [1]. We believe this bound on the number of qubits n covers all useful use cases for CZ, CNOT, and Clifford circuits. Indeed, to put this number in perspective, errorcorrecting codes often span dozens to hundreds of qubits (thousands and tens of thousands are possible albeit regarded to be on the high side [13]), quantum simulations of condensed matter systems need to rely on only slightly more than 54 qubits before they become classically intractable [14], and known likely classically difficult simulations require as few as between 70 and 185 [15] or between 109 and 111 [16] qubits. To factor a 1000+ bit integer number using Shor's algorithm-a task widely believed to be intractable classically-only (roughly) 2n to 3n qubits suffice [17,18]. This qubit count takes additional space needed for high-quality circuit optimization into account. This points to the high likelihood that the number of qubits a Clifford circuit spans will remain well under 1,345,000.
Our goal is to minimize the depth of quantum circuits, which corresponds to time to solution, being perhaps the single most important metric from the consumer's point of view (especially once the fidelity is guaranteed). Furthermore, in quantum information processing technologies, such as superconducting circuits, where the dominating source of errors is the decoherence, small depth circuits naturally improve the fidelity of the computation compared to large depth circuits. We measure the depth of circuits by counting the contribution from the two-qubit gates and discarding that from the single-qubit gates. There are two basic reasons to make this choice. First, both leading quantum information processing technologies, superconducting circuits and trapped ions, offer single-qubit gates at a much higher clock speed and fidelity compared to the two-qubit gates [19,20]. Due to available control and as motivated by Euler's angle decomposition, the number of single-qubit pulses applied between the entangling gates is never more than a small constant (e.g., 3). Thus, the depth by the two-qubit gates describes the depth of the real-life physical implementation rather closely. Second, the entangling gates we rely on, CNOT and CZ, are single-qubit equivalent to each other, each can be obtained with the minimal number of one entangling pulse in both superconducting circuits and trapped ions technologies, and neither of the two directly corresponds to the physical qubit-to-qubit interaction (such as ZX in superconducting circuits and XX in trapped ions [19,20]). Thus, both CNOT and CZ gates are available simultaneously, and their implementation costs are roughly equal-independently of the underlying technology used to implement the desired circuits.
Our work first focuses on the CZ circuits. CZ circuits are employed in the short layered decomposition of Clifford circuits [21], thus allowing to upper bound the depth of Clifford circuits more efficiently than would otherwise be possible with the reduction of -CZ-layers to -CNOT-and -P-layers. A CZ circuit can be implemented over CZ gates in depth n−1 for even n and depth n for odd n. This can be established directly, or by employing Vizing's theorem [22]. One may also show that the depth cannot be reduced further unless other gates are allowed. In our work, we employ CNOT gates and show how this helps to reduce the depth of CZ circuits roughly by a factor of 2 (Theorem 1). We utilize depth-efficient implementations of CZ circuits to construct depth-optimized CNOT and Clifford circuits.

Circuit depth guarantees 2.1 CZ circuits
We first focus on the depth-efficient no ancilla implementation of the elements of the finite group generated by CZ gates over n qubits. Recall the following well-known properties of CZ gates: CZ(i, j) = CZ(j, i), CZ(i, j) 2 equals the identity, and all CZ gates commute. These properties directly imply that any CZ circuit can be represented by a zero-diagonal upper triangular binary matrix M ∈ F n×n 2 , where m i,j = 1 for i < j iff the gate CZ(i, j) is applied (an odd number of times). The task of implementing a transformation described by the matrix M can therefore be solved by applying a set of gates that zero out all of the entries of matrix M .
We first focus on developing a small-depth circuit implementing a CZ transformation M 1 that can be described by a "rectangular" k × m region (over non-overlapping sets of k and m qubits) with ones in the matrix M ; the rest of the matrix M elements are zeroes. A straightforward implementation of such transformation can be accomplished in depth max{k, m} by a circuit with km CZ gates. Our construction described below thus offers an exponential advantage over the naïve implementation. Formally, Proof. First, recall that the action of CZ(x, y) is accomplished by the mapping |x, y → (−1) xy |x, y , i.e., it can be described as the addition of phase (−1) xy to |x, y . Thus, the phase transformation performed by M 1 is described as The latter term can be implemented by a single CZ gate acting on qubits carrying the values a 1 ⊕a 2 ⊕...⊕a k Those linear combinations can be implemented in logarithmic depth (to both compute them and uncompute after applying CZ) by a CNOT gate circuit, leading to the overall depth of 2· max{ log(k) , log(m) } + 1. We next explain how to reduce the depth by 1, leading to the advertised complexity. To accomplish the reduction, we focus on the three central layers of the constructed circuit. Observe that the middle gate is always a single CZ, and logarithmic-depth EXOR (exclusive OR, also known as modulo two addition) calculation of qubits in the sets A and B ends with a single CNOT gate. Because of the varying depths of the CNOT parts for sets A and B, the three middle stages come in the following three flavors, Each can correspondingly be rewritten in depth two, as follows: We illustrated the resulting circuit in Figure 1 for k=4 and m=5.
We next focus on a more complex version of the rectangular region M 1 defined as the transformation M 01 computed by a subset of CZ(a, b) (rather than all for the case of M 1), where a ∈ A, b ∈ B, and A ∩ B = ∅. We show that M 01 can be implemented in depth max{ k/2 , m/2 } + 2· max{ log(k) , log(m) }.
Lemma 2. The transformation M 01 over non-overlapping sets A and B with k and m qubits each can be implemented as a depth max{ k/2 , m/2 } + 2· max{ log(k) , log(m) } circuit.
left side shows basic circuit construction, and right side includes the reduction of the depth by 1.
Proof. The transformation M 01 can be written as a Boolean matrix (1) or the absence (0) of the gate CZ(a i , b j ). By a slight abuse of language M 01 can be interpreted as a rectangle A×B with zeroes and ones. To implement M 01 as an efficient circuit, we apply a logarithmic depth circuit that reduces M 01 to a transformation M 01 such that the weight (the number of ones) of rows in it is no more than m/2 and the weight of columns is no more than k/2 . Since M 01 can be interpreted as an adjacency matrix of a bipartite graph, the edge coloring problem can be solved using precisely max{ k/2 , m/2 } colors [22,23]. Edges of the same color correspond to individual CZ gates implementable in depth 1, and thus the number of colors describes the CZ gate circuit depth. We next show how to reduce M 01 to M 01 by a logarithmic depth circuit. To this end, we first show how to select a set of rows and columns of the matrix M 01 that, when inverted, simultaneously reduce the row and column weights to no more than a half, and next express this row and column inversion transformation as a logarithmic depth circuit.
To select rows and columns, start with the empty set S. Cycle sequentially through all rows and columns in an infinite loop. If inverting an entire given row/column reduces the number of ones in it, perform the inversion and add this row/column to the set S, or if it is already there remove it. Each row/column addition/removal operation reduces the number of ones in the matrix M 01 by at least one, thus this algorithm will run out of options to invert a row/column and thus can be terminated after no more than km(k+m) steps. When it terminates, M 01 has been transformed to M 01 with row and column weights of no more than a half.
Denote the sets of rows and columns identified in the previous paragraph as A and B , correspondingly. To implement this set of row and column flips, observe that rather than implementing the rectangles A × B (implements all columns) and A × B (implements all rows) sequentially, one could instead implement the rectangles A\A × B and A × B\B in parallel, since the qubit sets A , A\A , B , and B\B do not overlap. According to Lemma 1, this can be done in depth Adding the cost of the transformation M 01 → M 01 with the cost of the implementation of M 01 reveals the desired depth figure.
We now have enough instrument to prove the main result of this section. Proof. Let d(n) denote the depth of CZ circuits over n qubits. We start the proof by recalling that an n-qubit CZ circuit can be implemented in depth n − n+1 2 − n+1 2 by the reduction to graph coloring problem [22], and thus a simple upper bound holds, For odd n, the maximal number of colors, as given by the Vizing's theorem [22], is needed. For even n, a widely known simple geometric construction shows that n−1 colors suffice: take n−1 points on the plane as vertices of a regular polygon, with the last n-th point at its center. Each of the n−1 colors applies to the segment joining the point at the center with a selected vertex of the polygon, and all segments joining polygon vertices perpendicular to it. One may convince themselves that all possible segments joining any two of the n points considered are properly colored, and thus n−1 colors suffice. Furthermore, if one is limited to using the CZ gates to implement CZ circuits, the bound in Eq. (1) is tight. This follows from the counting argument, noting that the largest CZ circuit contains (n−1)n 2 CZ gates. Thus, to implement CZ circuits in shorter depth, one must thus rely on other gates, which is what we do.
We next introduce a recursive construction that is responsible for reducing the above depth figure to almost n/2 and analyze it carefully using two methods. In our recursion, at each step the set of qubits is broken into two non-overlapping sets, A with first n/2 qubits and B with last n/2 qubits. Operation Combining the above with Eq. (1) allows to obtain the following recursion, The solution to Eq. (2) can be upper bounded by the expression n/2 + 0.9937· log 2 (n) + 1.1882· log(n) − 14.6772 (for n ∈ [43..1,345,000]). However, the constant in front of log 2 (n) can be improved through a more careful analysis of the recursive decomposition based on Lemma 2. We accomplish this by considering two steps of the recursive decomposition at once. Each recursive step implements the transformation T : M 01 → M 01 , that we further refer to as Ttransformation, and the leftover operation M 01 . The circuit obtained by two steps of the decomposition can be thought of as a combination of the implementations of two layers of T-transformations (one of which applies two T-transformations to non-overlapping sets) performing the mappings over recursively defined M 01/M 01 and two layers of the implementations of M 01 (one of which applies to two non-overlapping qubit sets) via bipartite graph coloring. Recall that all four stages implement certain CZ gate transformations and thus they all commute. We will employ the commutation property to prove a better bound on the depth of the CZ circuit. Specifically, we group the implementations of M 01 and all T-transformations into two subcircuits and analyze their depths separately. The depth of the implementations of two recursively defined layers of M 01 is described by the formula n/2 /2 + n/2 /2 /2 .
Based on the above analysis, the final form the recursion takes, further improving Eq. (2), is   Figure 2(b), the difference between exact solution and the upper bound given is visually undetectable, and our result can be seen to improve the best known previously roughly by a factor of two (therefore, agreeing with the asymptotics).

CNOT circuits
Here we extend the construction of depth-efficient CZ circuits to obtain depth-efficient implementations of linear reversible circuits. A linear reversible function can be implemented exactly or up to the SWAPping of output qubits, also known as qubit reordering. An implementation up to qubit reordering may be preferred since the proper qubit SWAPping may be obtained classically, allowing to outsource this task to a classical computer and thus minimize the expensive quantum resources used. The following Lemma reports an optimized depth figure for linear reversible functions and highlights that a depth reduction by 6 is possible to achieve if it suffices to implement the desired linear function up to qubit reordering. Lemma 3. For n ∈ [70..1,345,000] an n-qubit linear reversible transformation R can be implemented in depth no more than n + 1.9496· log 2 (n) + 3.5075· log(n) − 29.4269 up to qubit reordering and depth n + 1.9496· log 2 (n) + 3.5075· log(n) − 23.4269 exactly as a circuit over {CNOT, CZ, H} gates.
Proof. We start with the LU decomposition R = LU , where L is lower-triangular and U is upper-triangular invertible Boolean matrices. Recall that the LU decomposition exists subject to proper row and/or column ordering. Such row/column reordering can be implemented as a SWAPping circuit with the SWAP depth of no more than 2, translating to the two-qubit gate depth (by those gates considered in this work as contributing to depth) of 6. Thus, the difference between the depths of implementations up to qubit reordering and the exact one is a constant equal to 6. In the following, we show that each L and U stage can be implemented in depth n/2 + 0.9748· log 2 (n) + 1.7538· log(n) − 14.7134 , and thus the total depth of CNOT circuits is upper bounded by the expression n + 1.9496· log 2 (n) + 3.5075· log(n) − 23.4269 .
Without loss of generality, focus on U . Divide the set of qubits into two, set A with the first n/2 qubits and set B with the last n/2 qubits (this assumes that the qubits are already ordered so as to accept the LU decomposition). The operation R can be written as the block matrix product where R A is the n/2 × n/2 upper triangular matrix obtained by restricting R to the set of qubits A, R B is defined similarly, W 01 is the n/2 × n/2 top right block of R, and I and 0 are the identity and zero matrices of proper dimensions. Assuming d(n) denotes the depth of the implementation of an n-qubit upper triangular matrix, first two terms in the decomposition Eq.  We employ the solution of the recursion Eq. (6) within the LU decomposition to obtain the desired upper bound, n + 1.9496· log 2 (n) + 3.5075· log(n) − 23.4269 . We start the range with n = 70, because it marks the smallest n for which our solution based on the recursion Eq. (6) beats the best known upper bound of min{2n, 4n/3 + 8 log(n) } [1,12].
Note that the circuit constructed in Lemma 3 relies on the gates from the library {CNOT, CZ, H}. It is convenient to use this gate library for didactic reasons, however, the circuit constructed in Lemma 3 can be rewritten using the same number of entangling gates and the same depth, but relying on the CNOT gates only.
Proposition 1. The circuit constructed in Lemma 3 can be implemented in the same depth and with the same entangling gate count as the original, but using only the CNOT gates.
Proof. Given the division of the set of all qubits into two non-overlapping sets A and B, a two-qubit gate is called internal to a given set if both qubits it operates on belong to this set and straddling iff it operates over two qubits belonging to different sets. Clearly, all entangling gates in such circuit are either internal to one of the sets or straddling.
Choose the sets A and B from the proof of Lemma 3. Observe that we apply Hadamard gates to all qubits in the set A in two layers. Between those two Hadamard gate layers, all internal gates are CNOT gates and all straddling gates are CZ gates. This means that we can push the left layer of Hadamard gates to the right layer to cancel both, while flipping controls and targets of some CNOT gates and turning CZ gates into CNOT gates using the following rules: and .
Observe that this operation, when applied recursively to the matching pairs of layers of Hadamards, eliminates all Hadamard gates and turns all CZ gates into CNOTs. Thus, the transformed circuit has only the CNOT gates.
We illustrate the constructions in Lemma 3 and Proposition 1 with the following Example.
A naïve algorithm focusing on depth optimization may implement this linear transformation in depth 7 by noticing that all off-diagonal ones with matrix indices over non-overlapping sets of qubits can be turned into zeroes by applying the CNOT gates with controls in the second half of the set of qubits and target in the first half. A tight schedule exists that squeezes all 49 such CNOT gates in depth 49/7 = 7.
A better circuit of depth 6 can be obtained by applying Lemma 3. First, observe that the matrix L is already upper triangular and thus the LU decomposition needs not be developed. The set A = {a 1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 } contains first 7 qubits, to which the Hadamard gates are applied, and the set B = {b 1 , b 2 , b 3 , b 4 , b 5 , b 6 , b 7 } contains the remaining 7 qubits. The 7×7 matrix R A found in the first quadrant of L gives rise to the 7×7 all-1 CZ matrix, and thus the circuit for it can be obtained from Lemma 1. Recall that this circuit EXORs qubits in the sets A and B, applies CZ gates, and uncomputes the EXORs. All five stages (opening Hadamards, finding EXOR, applying CZ, uncomputing EXOR, closing Hadamards) are clearly visible in the resulting circuit illustrated in Figure 4 on the left side. The circuit on the right side of Figure 4 is obtained from the one on the left side by applying Proposition 1.
We conclude this subsection with the comparison of the depth of CNOT circuits developed in our work to the best known previously in Figure 3. Similarly to the analogous comparison for CZ circuits, small values of n reveal a small difference between the exact solution and the upper bound (see Lemma 3), that is undetectable by eye over the full range (see Figure 3(b)). For values of n in the target range, our result improves the best known previously by a factor of almost 4/3, as expected from the asymptotics.

Clifford circuits
Recall that a Clifford circuit admits the layered decomposition -X-Z-P-CX-CZ-H-CZ-H-P- [21]. Adding depths of the implementations of CZ circuits by Theorem 1 (two layers) and CNOT circuits by Lemma 3 (single layer), we obtain the following result. Note that one of the two -CZ-layers neighbors the -CXlayer, thus allowing to merge the CNOT gates used in the largest T-transformation with the -CX-stage; accounting for this results in the reduction of the depth by either log(n/2) −1 or log n/2 /2 , depending on the first stage called by the recursion Eq. (3). We illustrated the comparison of the best known depth of Clifford circuits to that offered by our construction, based on the reduced depth of CZ and CNOT circuits (Theorem 1 and Lemma 3, correspondingly) in Figure 5.