Composability of global phase invariant distance and its application to approximation error management

Many quantum algorithms can be written as a composition of unitaries, some of which can be exactly synthesized by a universal fault-tolerant gate set, while others can be approximately synthesized. A quantum compiler synthesizes each approximately synthesizable unitary up to some approximation error, such that the error of the overall unitary remains bounded by a certain amount. In this paper we consider the case when the errors are measured in the global phase invariant distance. Apart from deriving a relation between this distance and the Frobenius norm, we show that this distance composes. If a unitary is written as a composition (product and tensor product) of other unitaries, we derive bounds on the error of the overall unitary as a function of the errors of the composed unitaries. Our bound is better than the sum-of-error bound (Bernstein,Vazirani,1997), derived for the operator norm. This indicates that synthesizing a circuit using global phase invariant distance maybe done with less number of resources. Next we consider the following problem. Suppose we are given a decomposition of a unitary. The task is to distribute the errors in each component such that the T-count is optimized. Specifically, we consider those decompositions where $R_z(\theta)$ gates are the only approximately synthesizable component. We prove analytically that for both the operator norm and global phase invariant distance, the error should be distributed equally among these components (given some approximations). The optimal number of T-gates obtained by using the global phase invariant distance is less. Furthermore, we show that in case of approximate Quantum Fourier Transform, the error obtained by pruning rotation gates is less when measured in this distance.


Introduction
It was envisioned [1,2] that quantum computers can solve certain problems much more efficiently than their classical counterparts. This notion of quantum supremacy [3] became a possibility with the design of quantum algorithms for challenging problems like factorization [4,5], unstructured search [6] and others, with applications in areas such as cryptography [7], machine learning [8], material science [9] and quantum chemistry [10,11]. At an abstract level, the algorithm usually consists of a number of unitaries, which are composed via tensor product or multiplication. These unitaries are mapped onto some implementation model, in most cases the circuit model. This step is a crucial part of the compilation process, where the unitary is decomposed into a number of gates (each one is a unitary) belonging to a universal set that can be implemented by the underlying technology of the hardware. The improvement claimed by the above-mentioned quantum algorithms usually do not take into account the number of resources like gates, ancilla, special states and so on, required to implement it on a particular hardware platform. The difference in the cost of implementation of these resources can make a significant difference in the practical advantage of these quantum algorithms over their classical counterparts. Thus it is necessary to estimate these resources to assess the practical advantage of quantum algorithms, to determine trade-offs (for example, problem sizes or parameters when quantum algorithms become more efficient), to determine the appropriate applications and to have a mutually reinforcing design of hardware and software.
The Clifford+T set is a widely studied and developed universal fault-tolerant gate set, which we consider in this paper. In this set the non-Clifford T gate is the most expensive to implement fault-tolerantly, as it requires large ancilla factories and additional operations like gate teleportation and state distillation [12,13,14], which are less accurate procedures and require additional space and time compared to a single physical gate [15,16]. Again, not all unitaries have an exact implementation with the gates of this set. The Solovay-Kitaev algorithm [17,18] guarantees that given an n-qubit unitary U , we can generate a circuit with a universal gate set like Clifford+T, such that the unitary U implemented by the circuit is at most a certain distance from U . A unitary U is exactly implementable by the Clifford+T gates if there exists a circuit with these gates that implement U . Otherwise it is approximately implementable i.e. d(U, U ) ≤ for some > 0. For a distance d, the value d(U, U ) is also called the approximation error or error. In most synthesis and re-synthesis algorithms [19,20,21,22,23,24,25] it is required to implement a circuit for a unitary that has the minimum number of certain resource like T-gate or T-depth, or which at least reduces these resources compared to the best-known results.
Among the many kinds of approximation errors that may occur while implementing a unitary, we focus on the synthesis errors (see [26] for a nice exposition). These accumulate due to the inability to implement some unitaries exactly by a discrete fault-tolerant gate set like Clifford+T. It has been proved [27] that any universal fault-tolerant gate set must be discrete. In this paper our preferred distance measure (for calculating errors) is not the trace distance or operator norm used in [17,18,28,29,24], but the global phase invariant distance used in [23,30,25] (qubit based computing), [31,32] (topological quantum computing).This is because the trace distance or operator norm does not ignore global phase, and hence leads to unnecessarily long approximating sequences that achieve a specific global phase. In most practical applications, the global phase is not significant. In [23] the authors give an empirical formula relating the T-count of arbitrary single qubit z-rotations with approximation error , where the latter is measured in the global phase invariant distance. The single-qubit z-rotation gate is defined as follows.
In [29,24] the authors give bounds on the T-count of R z (θ) as a function of , measured in the operator norm. This bound is worse than the bound obtained in [23].
Most quantum algorithms like Quantum Fourier Transform (QFT) and phase estimation, which are also fundamental building blocks of other algorithms like factorization [27], can be written in a modular form. This implies that the overall unitary V can be written as composition, i.e. multiplication and tensor product, of other unitaries, V = 1 i=m m i j=1 V ij . We call V ij as component unitaries in one decomposition of V . There can be more than one way to decompose V . To reduce a particular resource of the overall unitary we would want to minimize the number of these resources in each V ij . This can be determined by existing algorithms. For example, if the task is to optimize T-count or T-depth then we can use the T-count and T-depth optimal synthesis algorithms of [19,20] for exactly implementable unitaries and [25] for approximately implementable unitaries, but their complexity scales exponentially with the number of qubits. If we want a more efficient way to obtain the circuit then we can sacrifice the optimality and use re-synthesis algorithms [21,22]. We can also synthesize a circuit using algorithms like [17,18] and then apply a re-synthesis algorithm to reduce the T-count/depth.
The number of resources like gate count, depth, T-count and T-depth usually are inversely proportional to the approximation error [18,23,24]. Given a decomposition, a compiler has to distribute the errors among each approximately-synthesizable component unitary (V ij ) such that the overall error remains bounded by some quantity. For this we need a composition rule for the distance metric in which these errors are measured. In previous works [34,26] the authors worked with the operator norm and used the Bernstein-Vazirani bound [35] for composing the errors. They used simulated annealing algorithm to develop an automatic error management framework that distributes the errors such that the total number of a particular resource (T-count) reduces. To the best of our knowledge, before our work, no such composition rule existed for the global phase invariant distance.

Our contributions
We derive a relation between the Frobenius (and hence operator) norm and global phase invariant distance, such that an upper bound on the former implies an upper bound on the latter (Lemma 2.1 in Section 2.4). This can be useful in situations where there exists algorithms in one norm and we want to estimate some quantity in the other norm. For example, at present there exists Tcount and T-depth-optimal synthesis algorithms for multi-qubit unitaries (exactly or approximately implementable) but these use global phase invariant distance [25]. We can use these relations to get a bound on T-count or T-depth of any multi-qubit unitary in the operator norm. This is also a humble step towards understanding how the errors measured in different distances are related and compare the different algorithms. For example, we have T-count-optimal synthesis algorithms for R z (θ) gate both in the operator norm [24] as well as in the global phase invariant distance [23] and their complexities are given as function of the error in the respective distances in which they are measured. The T-count they report for the same unitary is a function of the error and it is different, higher for the operator norm. So there should be some consensus on how to compare these algorithms, for example, whether we will scale these errors and then compare the T-count or complexity or from a physical point of view there is no need for this. In order to do this, we need some relationship between the errors measured in various distances.
Apart from this, our contributions in this paper can be broadly divided into two parts.

Composition of global phase invariant distance :
First we show how the global phase invariant distance composes under multiplication and tensor product of unitaries.
We derive bounds on D P (U, V ) and show this is better than the sum-of-error ( i,j ij ) bound. Since in most cases the number of resources is inversely proportional to the error, so this indicates that working in the global phase invariant distance may reduce the resource cost. Apart from that, this result may be of independent interest, since we have a number of synthesis algorithms in qubit-based computing [23], topological quantum computing [31,32], that work with global phase invariant distance, but we did not have any composition rule for this distance.

Distribution of error while decomposing :
Next we deal with the question of how to distribute errors among the component unitaries such that we optimize the number of T-gates, while keeping the overall error bounded by some given quantity. In many popular quantum algorithms like QFT [4] and phase estimation [36] the unitary can be decomposed such that R z (θ) gates are the only approximately synthesizable component unitaries. From [23] we know how the error relates to the T-count of R z (θ).
In this case, we tried to show analytically how much gain we can have by working with the global phase invariant distance, given that we have the same error bound. So we work with some approximations derived in Section 3.2. This is in contrast to the approach taken in [34,26] where the same problem is considered but the authors use simulated annealing to reduce the number of T-gates required, while bounding the errors by Bernstein-Vazirani bound [35]. The bounds derived by us in Section 3 can also be used to develop a simulated annealing framework. These bounds can be used and the methods (our analytical ones as well as the automated ones in [26]) can be adapted to reduce other resources in qubit based computing, topological quantum computing or other models where this distance is used.
Apart from better bounds, there is another factor leading to less resources in case of approximate Quantum Fourier Transform (QFT). We show that the algorithmic error QF T accumulated due to the truncation of rotation gates [37], is less if error is measured according to the global phase invariant distance. So this results in shorter circuits. QFT is an important sub-routine in other important quantum algorithms like adder, phase estimation, factoring, order finding and hidden subgroup problem [27]. So a reduction in the resource count of QFT implies a resource reduction for all these algorithms.

Related work
A considerable amount of work has been done to develop programming languages and toolchains for resource estimation, such as Q# [38], Quipper [39], Scaffold/ScaffCC [40], Qiskit [41], staq [42], ProjectQ [43] and QuRE [44]. In the theoretical framework of [45] the authors characterize the (Q, λ)-diamond norm distance between an ideal quantum program and one which is executed on a noisy quantum hardware. This involves solving a semidefinite program [46] whose complexity scales exponentially with the number of qubits, making it computationally intractable for large systems. Resource estimations have also been done for specific problems such as in [47,48], but usually these involve a significant amount of manual work utilizing domain knowledge.
The first designs of an automatic framework for managing approximation errors were due to Haner, Roetteler and Svore [34] and Mueli, Soeken, Roetteler and Haner [26]. The latter improves on the former by using a fast symbolic method and incorporating two more kinds of approximation errors. Both of them use a simulated annealing program to solve an optimization problem and they work with the operator norm, one main reason being the availability of composition rules for the operator norm [35].

Organization
We give some preliminary definitions and results in Section 2. The composition rules for the global phase invariant distance have been derived in Section 3. The optimization programs have been given in Section 4. Bound on the number of T-gates for QFT and other algorithms, has been shown in Section 5. Finally we conclude in Section 6.

Preliminaries
We denote the set of N × N n-qubit (N = 2 n ) unitaries by U n . The (i, j) th entry of any matrix M is denoted by M ij or M [i, j]. We denote the n × n identity matrix by I n or I if dimension is clear from the context. [n] = {0, 1, . . . , n − 1}.

Clifford+T gate set
The single qubit Pauli matrices are as follows: The n-qubit Pauli operators are : By convention, σ 0 = I 2 .
The single-qubit Clifford group C 1 is generated by the Hadamard and phase gates i.e C 1 = H, S , where When n > 1 the n-qubit Clifford group C n is generated by these two gates (acting on any of the n qubits) along with the two-qubit CNOT = |0 0| ⊗ I + |1 1| ⊗ X gate (acting on any pair of qubits). The Clifford+T group J n is generated by the n-qubit Clifford group along with the T gate.
A unitary U is exactly implementable if there exists a Clifford+T circuit that implements it (up to some global phase), else it is approximately implementable. Specifically, we say V is -approximately implementable if there exists an exactly implementable unitary U such that d(U, V ) ≤ . We denote the set of exactly implementable unitaries by J n . In this paper we use the following distance measure.
Definition 2.1 (Global phase invariant distance). Given two unitaries U, V ∈ U n , we define the global phase invariant distance between them as follows.

T-count and T-depth of circuits
The T-count of a circuit is the number of T-gates in it. Suppose the unitary U implemented by a circuit is written as a product U = U m U m−1 . . . U 1 such that each U i can be implemented by a circuit in which all the gates can act in parallel or simultaneously. We say U i has depth 1 and m is the depth of the circuit. The T-depth of a circuit is the number of unitaries U i where the T/T † gate is the only non-Clifford gate and all the T/T † gates can act in parallel.

T-count and T-depth of exactly implementable unitaries
The T-count of an exactly implementable unitary U ∈ J n , denoted by T (U ), is the minimum number of T-gates required to implement it (up to a global phase). A decomposition of U with the minimum number of T-gates is called a T-count-optimal decomposition of U .
The min-T-depth of an exactly synthesizable unitary U ∈ J n , denoted by T d (U ), is the minimum T-depth of a Clifford+T circuit that implements it (up to a global phase). We often simply say, "T-count" instead of "T-count of a unitary" and "T-depth" instead of "T-depth or min-T-depth of a unitary". It should be clear from the context. Similarly, the -T-depth of V , denoted by T d (V ), is equal to T d (U ), the T-depth of an exactly implementable unitary U ∈ J n such that d(U, V ) ≤ and T d (U ) ≤ T d (U ) for any U ∈ J n and d(U , V ) ≤ .
We call a T-count-optimal (or T-depth-optimal) circuit for any such U as the -T-count-optimal (or -T-depth-optimal respectively) circuit for V .
It is not hard to see that the above definitions are very general and can be applied to any unitary V ∈ U n , exactly or approximately implementable. If a unitary is exactly implementable then = 0.

Matrix norms
Let F m×n be the vector space of all matrices of size m × n with entries in the field F, which is the field of real or complex numbers.
If . is any norm on F n , then the operator norm induced by . , of an m × n matrix A on the space F m×n is defined as follows.
In particular, the p-norm is defined as follows.
where the latter is the largest singular value of A. This is also called the spectral norm. An equivalent definition of A 2 is as follows.
The spectral norm is an induced or operator norm and in this paper, whenever we mention operator norm, we refer to this norm. The Frobenius norm or Hilbert-Schmidt norm or Schur norm can be defined as follows.

Relation between Frobenius distance and global phase invariant distance
Let U and V be two unitaries such that Tr(V † U ) = t, so Tr(U † V ) = t. The global phase invariant distance between the unitaries is: The Frobenius distance between these unitaries is: Proof. For simplicity of expressions, we write D P and D F instead of D P (U, V ) and D F (U, V ) respectively.
and N D 2 The following result was proved in [35].

Composition of global phase invariant distance
In this section we show how the global phase invariant distance composes for tensor product and multiplication of unitaries. We prove that the bound we obtain is better than the sum-of-error bound. Ideally, it is convenient to have some compact expression that works for composition of any number of unitaries. So in Section 3.2 we derive some approximations for multiplication of unitaries.

Tensor product of unitaries
In the next lemma we show that the bound we obtain is better than the sum of error bound.
i < 1 and hence m i=b 2 i < 1 for any b ≥ 1. So each of the differences in the brackets is positive and we have ∆ > 0. This proves the lemma.
A graphical comparison has been shown in Figure 5 of Appendix B.

Multiplication of unitaries
Let V 2 = U 2 E 2 and V 1 = E 1 U 1 . Then using Inequality 1 we have the following.
Both E 1 , E 2 are unitaries, so we can expand them in the Pauli basis. Let For simplicity of notations, we write a ∈ P n , instead of σ a ∈ P n . From Inequality 2 we can write where the second equality follows from the invariance of trace under cyclic permutations), and and the first inequality of the lemma follows, considering the fact that the global phase invariant distance cannot be more than 1.
To prove the second inequality consider the following difference.
Approximation and composition We want to derive a bound on D P (U, V ). We can use the bound derived in Lemma 3.3 in an iterative way as follows. First, derive a bound on D P (U 2 U 1 , V 2 V 1 ) ≤ 2 (Let). If U 2 = U 2 U 1 and V 2 = V 2 V 1 , then we use Lemma 3.3 and derive a bound on D P (U 3 U 2 U 1 , V 3 V 2 V 1 ) = D P (U 3 U 2 , V 3 V 2 ) as a function of 2 and 3 . Plugging in the expression for 2 , we get a function of 1 , 2 and 3 . More illustration can be found in Appendix A. Clearly, it can be hard to get a compact expression for general m. So we do some approximation.
Approximation-I : From Lemma 3.3 we get the following (assuming bound is less than 1).
Now we use this approximate 2 in Lemma 3.3 to calculate D P (U 3 U 2 , V 3 V 2 ). If 3 is the bound obtained then we can show the following.
Thus if we keep iterating then we can give an approximate upper bound as follows.
In Appendix A.1 (Figure 3) we show that this bound closely follows the bound derived from Lemma 3.3 and hence can be considered as a good approximation.

Approximation-II :
If i < 0.01 and m is not large then in Lemma 3.3 we can ignore the summation term within the square root and write the following.
In Appendix A.2 ( Figure 4) we have compared this bound with the bound derived from Lemma 3.3 when c = 7.5 and m ≤ 110. For applications like resource estimation we can consider this as a good enough approximation. This bound is more compact and hence more convenient for analytical treatment. But this can be more than the bound derived from Lemma 3.3 when m is less than c 2 ≈ 56. We can change the value of c and get other approximations that work well for other range of m.
Remark 3.1. From Figure 3, 4 and 5 in Appendix A and B, we see that the growth of error is quite less in case of tensor product of unitaries. So whenever possible, it is better to analyse by having more components in tensor or in parallel. For example, instead of treating U = (I ⊗ U 2 ) · (U 1 ⊗ I), if we think U = U 1 ⊗ U 2 , then the overall error is significantly less and so the estimation on resource requirement is also much less. This is another motivation for depth-optimal synthesis.

Composition of unitaries as arbitrary tensor product and multiplication
Let V = 1 i=m V i be an approximately implementable n-qubit unitary which has been decomposed as multiplication of component unitaries. Each such component unitary V i ∈ U n is further decomposed as : V i = m i j=1 V ij . Let U ij be unitaries (of proper dimension) such that D P (V ij , U ij ) ≤ ij , for each i, j. Let U i = m i j=1 U ij and U = 1 i=m U i . We want to bound D P (V, U ) as function of ij .
From Lemma 3.1 we know Now we can use Lemma 3.3 iteratively to derive D P (U, V ). Alternatively, we can use the approximations. Using Equation 3 (approximation-I) we get and using Equation 4 (approximation-II) we get Similarly, if we have decompositions that are tensor of product of unitaries i.e.
then we can use the results in Section 3.1 and 3.2 in order to bound D P (U, V ). This is again, less than the sum-of-error bound, i,j ij .

Optimization problems
The discussion in the previous section is enough to show that given the same error bound, it is advantageous to work with the global phase invariant distance, in the sense that we can allot more error per component unitary and hence get less resource estimate, which usually is inversely proportional to error. To illustrate further, in this section we formulate some optimization problems to find the distribution of approximation errors such that the total resource requirement, in our case the number of T-gates, is minimized. We solve these problems analytically for some specific scenarios and compare the resource estimates in the two distances. In case of product of unitaries we use approximation-II (Equation 4) and thus we assume conditions such that it is close to the exact bound derived.
Let C(V, ) be a cost function that captures the quantity we want to minimize like number of T-gates, T-depth, circuit depth and total number of gates in the circuit, for a given bound on the approximation error (in the global phase invariant distance) of the complete circuit. This means that if U is the unitary implemented by the circuit and V is a given unitary, then an exactly implementable unitary such that D P (U ij , V ij ) ≤ ij and it requires the minimum number of resources among all unitaries that ij -approximates V ij . So U = 1 i=m m i j=1 U ij is the unitary implemented by the circuit. C(U ij ) is the minimum number of resources required to implement U ij , i.e. C(V ij , ij ) = C(U ij ).
So our optimization program to find the minimum count of any resource is : and the program to find the minimum depth of any resource is: We have reduced the upper bound in overall error in order to account for the approximation. We have used bold fonts for variables. Since the resource count of only approximately implementable unitaries is a function of the approximation error, so in the optimization programs 5 and 6 we assume, without loss of generality, that each V ij is approximately synthesizable. One problem with such formulation is the number of parameters ij , that may grow infeasibly large. In fact, the decomposition of V may depend on and so the number of V ij (and hence ij ) may vary. A software solution was given in [34,26], where the authors used simulated annealing to solve the optimization poblem.
In this section we consider the setting where R z (θ) gates are the only approximately synthesizable unitaries, as done in [34,26]. We consider the problem of minimizing the number of T gates. Thus we use the empirical relation in [23], by which C(V ij , ij ) = k log 1 ij + k 2 where k = 3.067 and k 2 = −4.322 (7) where the error is measured in the global phase invariant distance.

Optimal cost of our optimization program
We can use Karush-Kuhn-Tucker (KKT) conditions [49,50,51] to solve the above optimization problem with inequality constraint. For simplicity and without loss of much generality we use equality in the constraint and follow the Lagrangian method of optimization. Let us consider the scenario when the unitary is written as a product of N R R z (θ) gates. Our optimization problem is as follows.
Let λ be a Lagrange multiplier. The Lagrange formulations is: To optimize, the following derivatives must be zero.
where k = 3.067 and k 2 = −4.322, we have And the optimal number of T-gates is Tensor product : If a unitary is written as tensor product of R z (θ) gates then the above analysis holds with c = 1 and δ = 0.
Optimal cost in [34,26] In [34,26] the authors have worked with the operator norm and used the Bernstein-Vazirani bound [35]. They have considered the unitary as product of component unitaries. So the constraint function is as follows.
Strictly speaking, while working with operator norm the available results [24] show that the T-count of R z (θ) vary as follows.
There is another bound, due to Selinger [29], that gives the following relation.
Clearly, both these bounds are much worse than in Equation 7. Our result in Lemma 2.1 show that an upper bound on operator norm implies an upper bound on global phase invariant distance. We do not know about the relation in the other direction. So the assumption that either of the above relations hold for global phase invariant distance is without rigorous mathematical proof.
In the following we want to argue that even if we use the same bound for both the distances (the best one, Equation 7), working with global phase invariant distance gives us less number of T-gates.
For the operator norm the optimization program is as follows.
The Lagrangian formulation is as follows.
To optimize, the following partial derivatives must be zero.
And the optimal number of T gates is

Comparison
Now we calculate the difference between the optimal costs (Equations 9 and 13).
For all practical purposes 1 − δ ≈ 1. Thus ∆ > 0 and this proves that the optimal cost i.e. number of T-gates, obtained is less if N R is roughly greater than c 2 . We emphasize that this condition arises only because we use approximation-II. If we use the exact bound (derived from Lemma 3.3) or approximation-I then we will always get less number of resources, as implied from the graphs in Appendix A.
If we do the same analysis with Equation 11 for the operator norm then we can show that error has to be distributed equally and the difference in optimal cost is as follows.
which is greater than 0 for all practical purposes.

Number of T-gates for QFT and other quantum programs
In this section we consider the Quantum Fourier Transform (QFT) in order to illustrate the advantage of working with global phase invariant distance. QFT can be used for important tasks like adder, phase estimation (QPE) and solving the hidden subgroup problem. QPE helps us to approximate the eigenvalues of a unitary operator under certain circumstances. This in turn allows us to solve other interesting problems like factoring [52], order finding problem, counting solutions to a search problem. Thus a bound on the number of resources for QFT plays an important factor in the resource estimate of all these important quantum programs. We often work with approximate QFT [37], in which the number of rotations in reduced by pruning the rotation gates with small angles.
From the results in the previous sections it is clear that working with global phase invariant distance will give a lower resource count, simply because the error bound is lower than sum-of-error. For approximate QFT, there is another reason of getting lower resource estimate. We incur some error while pruning the rotation gates. We show that this error is lower if measured in the global phase invariant distance.
We consider the QFT circuit in [27] that requires N R = n(n−1) The subscript k varies from 2, . . . , i, when the distance between control and target qubit is i − 1 i.e. control is at qubit 1 and R k (target) is at qubit i. Further decomposition into R z (θ) gates can be obtained by two facts. First, any single qubit unitary can be decomposed in terms of two H gates and R z (θ) gates [33]. Second, controlled R z (θ) can be implemented by reducing them to Fredkin and R z (θ). Fredkin gates are exactly implementable and their T-count is 7. We focus on reducing the number of T-gates for composition of R z (θ) gates.
In approximate QFT, pruning the rotation gates implies replacing the corresponding controlled R k gates (cR k ) with identity (I). Then the global phase invariant distance is calculated as follows. Let θ k = 2π/2 k .
In the operator norm we have The inequality also follows from Lemma 2.1. By Bernstein-Vazirani bound [35], here the algorithmic error due to truncation of rotation gates is QF T ≤ k∈S 0 N k k . From Inequality 16 and our results in Section 3 we can say that QF T > QF T . If we have a total error budget, then we have to allot the remaining error (i.e. excluding the approximation error due to pruning of rotation gates) among the components. Since QF T > QF T , so error alloted per component in case of operator norm is less. Thus in case of approx-QFT or any unitary where such pruning occurs, there are two factors that lead to less resource count in the global phase invariant distance -(i) algorithmic error due to pruning is less, and (ii) error accumulation during composition is less. We are not giving the exact expression, since these can be either derived analytically with proper approximations, as done in Section 4, or we can use simulated annealing, as done in [34,26], with the composition rules for global phase invariant distance. We have considered one of several implementations of the QPE, shown in Figure 2. The measurement outcomes of the top t qubits yield a t-bit approximation to the phase. t also determines the probability p of a successful measurement. If n is the desired accuracy in number of bits, then we have the following relation [27].

Quantum phase estimation (QPE)
Suppose QP E is the accuracy or phase-approximation error i.e. the absolute difference between the correct phase and its t-bit approximation. Since this has nothing to do with the difference in the unitaries implemented, so we can assume that this error value is same for both the operator norm as well as the global phase invariant distance. In this case we can assume that we want the overall error to be bounded by − QP E .
Consider QPE on U = R z (α). Using the results in previous sections we can conlcude that we use less number of T gates to implement the component unitaries if error is measured in global phase invariant distance, compared to the case where operator norm is used.

Conclusion
In this paper we studied composability of the global phase invariant distance, which has been used for unitary synthesis in qubit based computing, topological quantum computing, etc. One of the applications of these composability rules or equations is to analyse the propagation of approximation error in circuit synthesis. This in turn helps in distribution of approximation errors during compiling such that resource cost is reduced. If a unitary is written as a composition of tensor product and multiplication of component unitaries and there are unitaries approximating each component unitary, then we show that the overall error bound is less than the sum-of-error bound, that holds for operator norm. We give approximate bounds that are more compact and convenient to work with and analyse analytically. We show how one of our approximations can be used to derive error distributions rigorously. In the special case of approximate QFT, we show we can have further advantage by analysing the error due to pruning of rotation angles in the global phase invariant distance.
Such analysis on the resource requirements can also be done for other unitaries, for example time-evolution operators, which are widely used in simulating chemistry [47]. In many cases, the time-evolution operator is approximated by a Trotter-Suzuki decomposition, where the number of Trotter steps is proportional to 1 √ T E . Here T E is the accuracy error, usually measured in the operator norm. It will be interesting to probe if the global phase invariant distance has some operational meaning here and if so, whether working in this distance gives us any advantage, for example in terms of gate count. Further applications of the results derived in this paper, have been left for future work.
A Comparison of bounds derived in Section 3.2

Let us consider unitaries
Our aim is to show how error propagates, considering the bounds given in Section 3.2 -with and without the approximations. It will also show how close our approximation is. As per convention, we assume execution of the circuit is from right to left, i.e first U 1 then U 2 and so on. We consider the distance between the unitaries after implementation of each U i i.e. D P (U 1 , V 1 ), D P (U 2 U 1 , V 2 V 1 ), D P (U 3 U 2 U 1 , V 3 V 2 V 1 ) and so on. Equivalently, we calculate D P (U m , V m ), where U m = 1 i=m U i , V m = 1 i=m V i and m = 1, 2, 3, . . .. In our experiment, we consider i = j for each i, j.

A.1 Comparison with Approximation-I
First, we compare D P (U m , V m ), the bound derived from Lemma 3.3, Equation 3 and the sum-oferror bound. It is straightforward to use Equation 3 for different values of m. We show briefly how to use Lemma 3.3 to calculate the distance without approximation. Suppose we take i = 0.01 for each i. So taking m = 1 we have D P (U 1 , V 1 ) ≤ 0.01. When m = 2 then from Lemma 3.3 we have and from Equation 3 we have D P (U 2 U 1 , V 2 V 1 ) ≤ 2 1 + 2 2 + 2 2 1 1 − 2 2 − 2 1 = 0.019992.
We denote the distance calculated using approximation-I (Equation 3) by D P (., .). When m = 3 then using Lemma 3.3 we get We can keep calculating the distance values for different values of m. This will show the error propagation or error accumulation in the overall unitary i.e. distance between i U i and i V i . In Figure 3 we show the error propagation for different values of and m. Here is the upper bound on error for each unitary i.e. D P (U i , V i ) ≤ for each i. The black line shows the error derived from Lemma 3.3, the green line shows the error from Equation 3 and the red line shows the sum-of-error bound. We see that the first two bounds are definitely less than the sum-of-error bound and the approximate bound derived in Equation 3 closely follows the bound derived from Lemma 3.3. We have shown the graphs for = 10 −1 , 10 −2 , 10 −3 , 10 −4 . Since the difference is more prominent for higher values of , so we did not give the graphs for < 10 −4 .  Here U m = 1 i=m U i and V m = 1 i=m V i . D P is the distance obtained using approximation-II.

A.2 Comparison with Approximation-II
In Figure 4 we compare D P (U m , V m ), the bound derived from Lemma 3.3 (black line) and the one derived in Equation 4 (red line). The green line shows the difference. We take c = 7.5 and m ≤ 100. For even lower values of i.e. < 10 −8 the difference is even less.

B Error accumulation for tensor product of unitaries
In Figure 5 we show the error growth when the unitaries are in tensor product. That is, we use the bound in Lemma 3.1 (Section 3.1) and compare with the sum-of-error bound.