Approximate encoding of quantum states using shallow circuits

A common requirement of quantum simulations and algorithms is the preparation of complex states through sequences of 2-qubit gates. For a generic quantum state, the number of gates grows exponentially with the number of qubits, becoming unfeasible on near-term quantum devices. Here, we aim at creating an approximate encoding of the target state using a limited number of gates. As a first step, we consider a quantum state that is efficiently represented classically, such as a one-dimensional matrix product state. Using tensor network techniques, we develop an optimization algorithm that approaches the optimal implementation for a fixed number of gates. Our algorithm runs efficiently on classical computers and requires a polynomial number of iterations only. We demonstrate the feasibility of our approach by comparing optimal and suboptimal circuits on real devices. We, next, consider the implementation of the proposed optimization algorithm directly on a quantum computer and overcome inherent barren plateaus by employing a local cost function rather than a global one. By simulating realistic shot noise, we verify that the number of required measurements scales polynomially with the number of qubits. Our work offers a universal method to prepare target states using local gates and represents a significant improvement over known strategies.


I. INTRODUCTION
Quantum state preparations is the task of generating a circuit that prepares a user-defined target state.This task is the first step of several quantum algorithms, such as linear solvers [1], quantum machine learning [2,3], and quantum recommendation systems [4], where classical information is encoded in the initial state of the circuit.Similarly, in the quantum simulation of nonequilibrium systems, the initial state is often a complex state, for example, corresponding to the ground state of a many-body Hamiltonian [5][6][7][8].Because a generic state of N qubits is defined uniquely by 2 N independent amplitudes, encoding such state requires an exponentially large number of gates [9][10][11][12][13][14][15][16][17][18][19].
In real devices, the maximal number of gates per circuit is heavily limited by noise and decoherence.Quantum computing practitioners often assume that the maximal achievable circuit depth scales linearly with the number of qubits (see, e.g., the definition of quantum volume by IBM [20]).To match this requirement, it is necessary to heavily reduce the total number of gates used in state encoding.One possible strategy is to consider segments of the circuit, searching for a smaller circuit that performs the same operation, for example by scanning lists of known circuit identities [21][22][23].This method deterministically reduces the number of gates, but is limited to previously studied problems and does not apply to generic multi-qubit states.Another common method is to perform a blind search in a variational space of quan-tum circuits [24][25][26][27][28][29][30][31], trying to incrementally approach the target state.The problem with this approach is that due to the large number of variational parameters, the optimization algorithm is, again, limited to a small number of qubits only.
In this paper, we develop an efficient method to obtain an approximate encoder, given a fixed number of gates.We, first, consider quantum states that can be efficiently represented classically, such as matrix product states (MPS) [32][33][34][35][36].In this case, the optimization algorithm can be realized efficiently on a classical computer, using resources that scale polynomially with the number of gates.We exemplify our protocol for both MPSs with low entanglement and random MPSs and demonstrate a significant improvement with respect to known techniques.Next, we consider the realization of our algorithm on a quantum computer, where the optimization protocol is affected by inherent shot noise.This error source leads to barren plateaus that limit the applicability of optimization algorithms.We overcome this limitation by introducing an improved version of the algorithm, based on local cost functions, whose resources requirements scale polynomially with the number of gates.
states (MPS).In this representation, a one-dimensional state of N qubits is expressed in terms of 2N matrices, {A (0) i } N i=1 and {A (1) Here, |j 1 ...j N is the computational basis and we assumed open boundary conditions, such that the first and last matrices are, respectively, row-and column-vectors.The maximal dimension of A i is denoted by χ, and is known as the bond dimension.Here, we will consider realistic situations where the bond dimension is in the intermediate regime 1 χ 2 N/2 .As we will see below, χ determines the resources required to realize the MPS on a quantum computer in an exact manner.By fixing χ, one obtains a variational family of states with maximal entanglement entropy S max = log 2 χ.For this reason, MPSs offer a faithful description of the ground state of one-dimensional gapped Hamiltonians, whose entanglement entropy follows an area law and does not grow with N [37][38][39][40][41].Similarly, MPS are useful in quantum machine learning tasks, such as ansatz structures [42] and ansatz pertaining [43].
From Eq. ( 1) it is possible to directly derive an (inefficient) quantum circuit that encodes the MPS [44,45].To achieve this goal, one needs to consider each A i as a rank-3 tensor, acting on vectors spanned by the left bond-index space, and transforming them into higherdimension vectors spanned by the right bond-index and the physical index j i [46].This tensor can be transformed into a unitary gate by adding an extra input index, while preserving its isometric property [47].By combining the resulting N gates in sequence, one obtains an exact circuit that creates the desired MPS.The problem is that each gate acts on one physical qubit and log 2 (χ) ancilla qubits associated with the bond index.Expressing this gate in terms of native 2-qubit gates requires a complex circuit with depth exponential in the number of ancilla qubits [9,[48][49][50][51], which would be too noisy for near-term devices (See section III for more details).
A natural question is whether it is possible to obtain an approximate description of the state using a smaller number of two-qubit gates.A solution to this problem was proposed by Ran [45], following a layer-by-layer approach: (i) The first layer is obtained by simply truncating the bond dimension of the MPS to χ = 2 [52].As explained above, the resulting state can be easily translated into a sequence of N 2-qubit gates.(ii) The next layer is generated by applying the inverse of the first layer to the target state and truncating the resulting MPS to χ = 2, and so on and so forth.This protocol generates a staircase circuit of 2-qubit gates acting on neighboring pairs of qubits only, which we denote by {G l i } l=1,...,L i=1,...,N ,  where L is the number of layers, see Fig. 1(a) [53].This circuit is suitable for all near-term quantum computers, including processors with limited connectivity such as superconducting circuits, and will be used throughout this work.
The key advantage of the layer-by-layer approach is that each step of the calculation can be performed while keeping the state in an MPS form and using tensor contractions.In this way, the complexity of computing each layer grows linearly with the number of qubits, in contrast to a naive calculation whose complexity grows exponentially with the number of qubits.Note that the classical computational complexity scales quadratically with the bond dimension, which grows by a factor of 2 for each additional layer, i.e. exponentially with the number of layers.Hence, the maximal number of layers is bounded by the classical resources available and is typically set to 20 or less.To achieve a larger number of layers, it is necessary to perform a truncation of the MPS representation of the resulting state to a fixed bond dimension χ max , at the probable cost of deteriorating the overlap with the target state.At present, however, this is not an important limitation, because near-term devices are anyway limited to a small number of layers.
A more severe problem is that the layer-by-layer approach is not effective: after the first few layers, any additional layer gives a negligible improvement.To highlight this point, we probe the infidelity I = 1 − F , where the fidelity F = |f | 2 , and f is the overlap of the encoded state with the target state |ψ target , Fig. 2 shows in blue the infidelity of the layer-by-layer approach as a function of the number of layers, for two target states with N = 12 qubits and bond dimension χ = 64 [54]: (a) the ground state of an Ising model [55]; (b) a random MPS.In both cases, the layer-by-layer approach shows a very slow convergence to the target state, as a function of the number of layers L. In fact, it is not even clear that the layer-by-layer algorithm will eventually converge to I → 0 as L → ∞.Moreover, we observe that the layer-by-layer approach is not capable of reproducing the correct circuit, even in cases where the target state is created using a small number of layers only.This is due to the random kernel used in the algorithm and to its inability to "plan ahead" and create cooperation between different layers.In each step, the algorithm maximizes only the immediate overlap of the state, without taking other amplitudes into account, which may potentially facilitate the performance of future (deeper) layers.

III. OPTIMIZATION ALGORITHM FOR ENCODING MPS STATES
To improve the layer-by-layer approach, we introduce an optimization protocol that aims at finding the best circuit for a fixed number of layers and a fixed circuit topology.Developing a deterministic optimization algorithm requires us to evaluate the gradient of the cost function with respect to the variational parameters, G l i .Fortunately, the gradient of the overlap f is simply given by the original tensor network with one missing tensor, as illustrated in Fig. 1(b).The resulting circuit can, therefore, be efficiently contracted into a single rank-4 tensor with 2 4 = 16 elements, using tensor-network methods [56].We used this gradient to perform an element-by-element optimization scheme.This method, which has been used at first for optimization of sequentially generated entangled multi-qubit states and MPOs in Refs.[57,58], is known also as the Evenbly-Vidal method for tensor network renormalization [59].It has been since used in the context of variational quantum algorithms [60], quantum simulations [61], and tensor network optimization [62].The method is analogous to the coordinate-descent method in classical machine learning [63] and to the iterative update of tensors in the DMRG algorithm [33] in the effort to optimize the objective function by parts.Unlike steepest descent methods, the element-by-element optimization does not rely on a convergence rate which needs continuous adjustment using heuristics, but rather proceeds through well defined and discrete steps.At each step, we select a single gate, G l i , and find the unitary matrix that maximizes Tr[(∂f /∂G l i )G l i ].This matrix can be obtained by computing the tensor ∂f /∂G i and projecting it to the closest unitary matrix.This last step simply involves the calculation of the SVD decomposition of the gradient and the substitution of the SVD values by 1. We, then, substitute the gate with the new optimum and proceed to the next gate.When all gates have been updated, the algorithm moves to the next iteration and runs again the protocol over all gates.The algorithm is repeated several times, until a specific convergence criterion is reached.
As shown in Fig. 2, the proposed optimization protocol leads to a significant improvement over the layerby-layer approach: (a) For the ground state of the Ising model, we achieve an improvement of up to one order of magnitude with just 10 iterations.We observe that the improvement is roughly inversely proportional to the number of iterations (the curves are equally spaced on a logarithmic scale).Hence, our protocol is efficient, in the sense that it can deliver infidelities arbitrarily close to the optimal value in a polynomial time.(b) For random states, the variational circuit is not expressive enough to achieve low values of the infidelity.Nevertheless, we observe that our optimization algorithm achieves a significant improvement over the layer-by-layer scheme and converges after a small number of iterations.
To check the efficiency of our method, we compare it to a more direct approach consisting of, first, reducing the size of the MPS tensors to a truncated bond dimension χ 0 , and then, encoding them state using nqubit gates.As mentioned above, a MPS of bond dimension χ 0 , can be directly converted into a multi-qubit gate acting on k = log 2 χ 0 + 1 qubits.Constructing a general k-qubit gate requires using many 2-qubit gates, whose number can be bounded from below by the formula , which is derived by counting the number of degrees of freedom encoded in a single k-qubit gate.Combining the lower bound and the the relation between k and χ 0 , we obtain that using the direct approach, with truncated bond dimension χ 0 requires the same number of 2-qubit gates as the optimized circuit with L = 4  9 χ 2 0 − 1 3 log 2 (χ 0 ) − 4 9 layers [64].In particular, the case of χ 0 = 2 is equivalent to a single layer of 2-qubit gates, L = 1, while χ 0 = 4 corresponds to 3-qubit gates and requires number of layers of L = 6.Although this formula is originally valid for full k-qubit gates, which are equivalent to χ = 2 k−1 , it can also be viewed as a lower bound for any value of bond dimension.For an integer χ the number of degrees of freedom is proportional to χ 2 , and the number of 2-qubit gates required to encode them can be counted similarly to k-gate encoding.
In Fig 2 we compare the infidelities obtained the direct method (dashed line) with those obtained by our optimization method (continuous lines).For the Ising model ground state, we find that at low depth the optimized infidelity is significantly lower than the infidelity of the direct encoding, while at higher depths the direct encoding is nearly optimal.In contrast, for the random MPS, the direct truncation method performs poorly in comparison to the optimized circuits.This comparison highlights that the optimized circuits takes into account higher singular values of the MPS which the truncation omits, and when the tail of eigenvalues is significant, they perform much better than the direct encoding approach.
Our optimization algorithm works under the assumption of ideal 2-qubit gates and an important question is what happens in a real noisy device.Due to the limited capabilities of available quantum processors, we considered the minimal instance of a target state where the proposed optimized circuit is expected to achieve a significant advantage over the layer-by-layer method, namely a random-MPS state with N = 6 qubits, encoded with L = 3 layers, see Methods section for details.Fig. 3 compares the ideal fidelity with the results obtained in a noisy simulator and on quantum computers by IonQ and IBM.The circuit obtained by our optimization method outperformed the layer-by-layer encoding, except in the case of the IBM Lagos quantum processor, whose faulty result is comparable to the fidelity of a random state, F = 1/2 N .As discussed earlier, the difference between the two encoding strategies is expected to grow with the accessible circuit depth.

IV. STATE ENCODING ON A QUANTUM COMPUTER
Up to this point, we have assumed that the target state is known classically and all matrix elements can be computed using efficient tensor networks.We now move to a situation where the target state is not known classically, but can be prepared on a quantum computer.Encoding a state directly on a quantum computer is a desired technique for several reasons: For example, this scenario is common in the study of nonequilibrium quantum effects, such as quantum quenches where one aims at studying the time evolution of an initial state [5][6][7][8]61].In addition, this approach is useful for the compilation of the initial layers of quantum machine learning algorithms [2,3,65,66].Finally, it enables one to store the output of complex quantum calculations by obtaining a faithful shallow-circuit representation [67,68].
In all these applications, the goal is to encode the target state using a smaller number of gates [69].In a real quantum processor, the tensor derivative of the circuit cannot be performed analytically, but needs to be evaluated using tensor tomography.This process, which we refer to as gradient tomography, can be performed by measuring all possible products of Pauli matrices acting on the four sites in red in Fig. 1(b) [70].Due to the finite number of shots used to perform the tomography, the result of this procedure will include random fluctua-tions.For a typical number of shots of the order of a few thousands, the resulting uncertainties are of the order of few percents.The optimization algorithm is expected to work only as long as the gradients are large enough for the first steps of the optimization process.
Under generic conditions, the overlap between the initial configuration of the variational circuit and the target state decreases exponentially with the number of qubits.As a consequence, the amplitude of the gradient is exponentially suppressed and may not be detected experimentally.This situation, often referred to as barren plateaus [71][72][73], is exemplified in Fig. 4(a) for the case of random MPS target states with bond dimension χ = 64, encoded using L = 5 layers.The figure displays the magnitude of the unitary gradient after the initialization of the circuit ansatz, at the beginning of the optimization algorithm.In the case of a random initialization, the gradient amplitude vanishes exponentially with the number of qubits, and follows the same slope as the average fidelity between random states, F ∼ 1/2 N .
If the state is known in its MPS representation, one can try to beat the barren plateau by using a better initialization of the unitary gates.A smart initialization may bring the system to the proximity of the global minimum of the cost function, where the gradients are more pronounced.In our case, we can obtain a better initialization using the layer-by-layer algorithm.Fig. 4(a) shows the unitary gradient obtained by applying this algorithm to a single layer (blue) or to all 5 (green).Due to the initialization the gradient is greatly increased, but still vanishes exponentially ∼ 0.75 −N .For N > 15 the gradient already reaches the threshold value of 10 −2 (dashed line), beyond which shot noise becomes dominant and the optimization protocol is stuck in a barren plateau.
A common technique to address barren plateaus is to substitute the global cost function, I, with a local one.This approach was shown mathematically to solve the barren plateaus problem for shallow quantum circuit, i.e. for circuits where the circuit depth does not grow with the number of qubits [72].Note that the mathematical theorems assumed a brick-wall gate structure, which is different than the present staircase gate connectivity.According to the definition of Ref. [72], the present circuit has depth L + N and, hence, beyond the regime of validity of the rigorous theorem.As we now show, our numerical calculations give evidence that the local cost function approach is valid in this case as well.We implement this strategy by introducing the local infidelity, projection over the 0 state of the nth qubit.The fidelity F local,n represents the probability to measure the n'th qubit in the 0 state, irrespective of the state of the other qubits, and is maximal (equals to 1) when the circuit prepares the target MPS exactly.As shown in Fig. 4(b), the usage of a local cost function solves the barren plateaus for our problem: the gradient is no longer an exponential function of the number of qubits and does not cross the shot-noise threshold.
A potential drawback of the local cost function is that one needs to compute the derivative of quadratic functions F local,n , rather than a linear one, f , see Eqs. ( 2) and (3).As a consequence, finding the optimal gate requires the measurement of a larger tensor, potentially inducing more errors in the calculation.To estimate this effect, we test our optimization protocol on a phenomenological model that takes into account the finite number of shots per measurement, which we denote by M , see Methods section for details.Fig. 5 compares the fidelity obtained using the global and local cost functions, for random-MPS target states.We find that, in the case of a small number of qubits N ≤ 8, the local optimization is penalized by the larger number of measurements and leads to a larger infidelity.In contrast, for a large number of qubits, the local method performs better, as the polynomial scaling of the local gradient wins against the exponentially decreasing global gradient.The color coding (bright for small infidelities and dark for large infidelities) enables for a clear separation between successful and unsuccessful instances: for the global cost function one needs an exponential number of shots, while for the local one, a polynomial number is sufficient.

V. CONCLUSION AND DISCUSSION
In this work, we addressed the problem of quantum state encoding, namely how to encode a target state using 1-qubit and 2-qubit gates only.We opened our study with MPSs, an efficient representation of quantum states with low levels of entanglement.Realizing these states on a quantum computer requires a circuit depth that grows at least linearly with the bond dimension χ.Here, we showed how to create a state that approximates the MPS using a smaller number of gates.Our approach focuses on quantum circuits with a fixed depth and aims at finding the gate configuration that offers the best approximation to the MPS.We described an optimization technique that iteratively improves each gate, until the optimal solution is found.Each step involves a singular value decomposition of a small tensor, whose elements can be computed efficiently using tensor network techniques and require limited resources only.By considering two extreme types of MPSs (low-entanglement states and random states), we demonstrated that our algorithm converges to a solution with a good fidelity.Empirically, we observed that the algorithm is not trapped in local minimum, unlike other variational approaches where local minima are abundant [27].
Next, we considered the realization of the proposed element-by-element optimization algorithm directly on a quantum computer.This task is relevant to the situations where the MPS is not known classically, such as in the case of a quantum dynamics simulations, encoding the outputs of complex quantum calculations.In a quantum computer the tensors can be estimated up to a finite resolution only, due to the finite number of measurements.In this case, the optimization algorithm encounters barren plateaus.We explained how to solve this problem by moving to a local cost function, and demonstrated its performance by considering the effects of realistic shot noise.
The main outcome of our work is a generic algorithm that allows one to encode a quantum state using a small number of gates.Unlike gradient-based optimization algorithms, this algorithm is monotonic, in the sense that each step necessarily brings us closer to the final goal.Its key ingredient is the direct estimation of the tensor derivative of the circuit and its translation into an improved set of gates.An important question that deserves future investigation is how to choose the circuit depth and the gate connectivity that will deliver the largest fidelity.On the one hand, by considering a larger number of layers and a more complex connectivity, one expects to obtain a larger fidelity.On the other hand, in the real world, adding more gates leads to more decoherence and noise, and, hence, to a smaller fidelity.Finding the optimal number will generically depend on the entanglement of the target state and on the gate faultiness of the device.Studying this dependence theoretically and experimentally is a fundamental step towards demonstrating the applicability of our approach.

VI. METHODS
Implementation on a real device To test our method for MPS encoding on a real quantum processors we chose a quantum state for which the difference between the layer-by-layer encoder and the optimized encoder is significant.Due to the limited capability of state-of-the-art devices, we chose a small random state with N = 6 qubit and encoded it using L = 3 layers [74].We sampled 1000 random states and calculated the fidelity of the layer-by-layer encoder for each one.We, then, chose the random state that exhibited the lowest fidelity and applied our optimization algorithm to obtain a better encoder.The encoder circuits were transpiled into a sequence of native gates of single qubit rotations and CNOT gates using Qiskit libraries, which implement an exact 2-qubit gate decomposition by a sequence of 3-CNOT gates interleaved with 1-qubit rotation gates, as detailed in [75].The original (optimized) transpiled cir-cuit contained 142 (163) 1-qubit rotations, and 23 (24) CNOT gates.Next, we performed full quantum state tomography of the output state, using 3 6 = 729 circuits, with at least 200 measurement shots each.From these measurements, we estimated the density matrix of the final state, ρ exp , and computed the fidelity to the target state as F = ψ MPS |ρ exp |ψ MPS .
The noisy simulation was performed using QISKIT Aer with noise parameters extracted from the ibm_lagos quantum processor.As shown in Fig. 3, the actual device performed much worse and we attribute this discrepancy to the limited connectivity of the device: our algorithm assumes a linear connectivity between the 6 qubits, which is unavailable on the device.The QISKIT compiler solves this problem by adding SWAP gates between qubits that are not directly connected, but these extra gates lead to a much worse performance.
Realistic model of shot noise Our optimization algorithm is based on the measurement of local gradients, obtained by disconnecting a single gate from the circuit.To estimate the effects of shot noise, we expanded the gradient tensors in sums of Pauli strings and, then, replaced the coefficient of each Pauli string by a random variable sampled from a binomial distribution with M trials.This substitution left the average value of the tensor unchanged and mimicked the result of a quantum measurement with M shots.Our model disregards the effects of phase noise, which may lead to a further suppression of the gradient, but is sufficient to obtain a qualitative estimate of the effects of shot noise on the gradients.

VII. DATA AVAILABILITY
All data and figures that support the findings of this paper are available from the corresponding author on request.Please refer to Matan Ben-Dov at matan.bendov@biu.ac.il.
[46] The Ai matrices are defined up to a gauge transformation, which can be fixed by selecting a center of orthogonality and demanding that all tensors on its left and right sides, respectively satisfy The gauge fixing procedure ensures that the tensors are injective and preserve orthogonality.
[47] This is analogous to adding columns to an orthogonal rectangular matrix until it becomes a square unitary matrix.
[52] The truncation of the bond dimension is a common procedure, obtained by contracting and re-separating each neighbour pairs of tensors trough a singular-value decomposition (SVD) and keeping the two largest values.
[53] From a mathematical perspective, the gates {G l i } can be described either as unitary operators or as tensors, where the each approach has its own advantages and disadvantages.In particular, the tensor approach is convenient from a computational point of view, but misses the definitions of input and output channels and its associated unitarity.
[54] See also SI-4 for additional results using other system sizes, showing consistent results.
[55] The Hamiltonian of the Ising model is H = n JzS z n S z n+1 − hxS x n .This Hamiltonian is gapped for all hx = ±Jz.Its ground state is an example of a nontrivial state with low entanglement, which is faithfully described by MPS.Here, we consider a ferromagnetic state with hx/Jz = 0.6.
[56] The gradient described above is the unconstrained gradient, which does not take into account the unitarity requirement of the gate G l i .Gradient descent optimization algorithms follow only the part of the gradient which preserves unitarity, as detailed in SI-1, see also further reviews in Refs.[76][77][78] Supplementary Information for "Approximate encoding of quantum states using shallow circuits"

SI-1. GRADIENT WITH RESPECT TO A UNITARY MATRIX
In the main text we described an optimizations algorithm using unitary matrices as variables, without the need to use an explicit parameterization of the different gates.This approach gives us an insight into the connection between gates connectivity, quantum information, tensor-networks and the performance of the circuit as a quantum state encoder.In this section, we provide more information on the definition of the derivative according to a general complex matrix and how it compare to the constrained derivative, the derivative in respect to a matrix compliant with a unitarity constraints, see also Refs.[76][77][78].
First we will define the directional derivative of a function f ({G n }) according to the coordinate G i .The directional derivative measures the change in the function when G i is infinitesimally modified in a specific direction.The directional derivative evaluated at the point G n in the direction of ∆ is This is a partial derivative as we only varied the G i coordinate.
To define the gradient, we need to consider how the directional derivative along an arbitrary direction can be derived as a projection of a single object on the direction of the derivative.This link is expressed through an inner product In our case, we consider real functions over the domain of complex matrices, whose inner product is the real part of Hilbert-Schmidt inner product A, B = Re Tr(A † B) .Taking the real part of the trace will be essential to computing the gradient for different function, as we will see below.To calculate the gradient, one needs to calculate the directional derivative for general direction, compare it with the inner product and extract gradient matrix from eq.S2.
We now describe a few relevant examples for our cases.The first example is a function over the real matricesf overlap (A) = Tr(E A) is a simple linear function in respect to a real matrix A. Its directional derivative is and comparing the result to eq.S2 (now using the real version of the inner-product), we can conclude that in this case ∇ A f overlap = E, as expected from a linear function.
For a function over the complex matrices, perfect linearity is not possible, but it is instructive to examine a similar function which takes the absolute value of the function from beforef |overlap| (G) = Tr(T † G) , when now G and T are both complex matrices.In this case, and at the end the extracted gradient is , which is the same as the linear function shown in the previous example up to a complex phase.
Another important example is the quadratic function and comparing the derivative to gradient definition we get ∇ G f overlap = −2Tr(T † G)T .Next, we consider the constrained case of unitary matrices.In this case, the above-mentioned gradient will not deliver the direction of the steepest slope, as stepping in the direction of the free gradient does not conserve the unitarity of the matrix.The unconstrained gradient should be modified to conserve the unitarity property.This can be obtained by projecting the gradient onto the tangent space of the unitary manifold U (n) from which the matrices are taken gradf = proj G (∇(f )). (S3) The projection can be viewed as filtering the gradient to keep only the part that conserve unitarity condition: For the conservation to hold up to the first order in , the resulting condition is and the formula of the constrained gradient for unitary matrices is gradf

SI-2. THE STEEPEST DESCENT AND ELEMENT BY ELEMENT OPTIMIZATION ALGORITHMS
In the main text we mentioned two optimization methods for our general 2-qubit gate ansatz: steepest descent and element-by-element optimization.We will now expand on the two methods and their implementation for optimization of global and local cost functions.
In the gradient descent algorithm the full gradient is calculated with respect to the different unitary gates, after which the gates are updated by a small step in the direction of the gradient.Usually the gradient is subtracted from the coordinate vector according to where η is the step size which often varies adaptively along the optimization process, and ∇ i F is the unitary gradient along G i as defined in eq.S6.
For small step sizes η 1 the descent in the direction of the gradient conserves unitarity up to first order in η, and the algorithm can be performed by infinitesimal steps, according to which results in a "gradient flow" optimization, similar to the method described in [29].
For large steps size, it is necessary to enforce unitarity after each step of optimization by projecting the new gate back to the unitary manifold U (n) using The projection brings the matrix to the closest unitary counterpart according to Here, we used the singular value decomposition (SVD) of the matrix and substituted all the singular values with 1, according to the unitarity requirement.It is possible to verify this formula by checking that it offers a solution of the minimization of the trace distance between the matrix T and any unitary matrix U (as seen in the examples in SI-1), by checking that the unitary gradient is zero for this particular choice.Another possiblity is performing the optimization by moving a step on a geodesic in the direction of the gradient, which is outside the scope of this paper.
There are several differences between gradient descent performed on unitary gates and gradient descent on angle parameterization of the gates.Firstly, using the projection of a linear step on the unitary manifold might cause limited reachability for a single optimization step, and thus the number of steps needed to get to the proximity of the minima can be larger if the initial guess is far away.On the other hand, using gate parameterization induces directional biases to optimization process.For further details the reader may refer to [28,29].
The element by element optimization is based on the notion of breaking apart the bigger optimization problem to small tasks that can be solved analytically.By optimizing a single gate G l i each time while fixing the others, we can find the single gate the yields the lowest cost with respect to the selected cost function.This sub-optimization can be calculated analytically when the cost function is linear or quadratic in G i l , which can speed up the optimization process considerably.
The gradient descent algorithm is agnostic to the type of the cost function and can be applied to both the global and local cost function described in the paper.In contrast, the element-by-element optimization algorithm has to be adjusted to according to the choice of the cost-function.For the case of global cost function, we take the cost function to be the infidelity between the encoded state and the target state: In this case we can choose to optimize the overlap function Tr(G l i ∇ i l F ) and find its maximum, which will coincide with the minimum value of the cost function.Looking at the overlap is useful here, as up to a complex phase the overlap contains only linear terms in G l i which simplify the optimization algorithm.The overlap function can be expressed in terms of tensor calculus as the contraction between the gate G i l and the environment tensor ∇ i l f as described in Fig 1 .To optimize F global ({G l i }) we go over all the gates in teh circuit, replacing G l i with its optimal counterpart, the gate which maximizes the fidelity function.The process is repeated for many iterations until the cost function converges to a minimal value.Like before, here the optimal gate used at each step is the unitary closest to the matrix ∇ i l f in trace distance, which can be found by the SVD algorithm in eq.S10.For the local cost function, the element by element optimization cannot be reduced into optimization of linear function of G, and so the optimization is done on terms quadratic in G, in the form of FIG. 1.(a) Tensor diagram of the overlap between the target state |ψtarget and a staircase circuit with N = 4 qubits and L = 3 layers.(b) Gradient of the overlap used in the global optimization protocol.(c) Gradient of the fidelity used to optimize the local cost function.The red dots denote the indices of the resulting tensor.

FIG. 2 .
FIG. 2. Infidelity I as a function of the number of layers L for two target states with N = 12 qubits and bond dimension of χ = 64: (a) low entanglement ground state of the Ising model; (b) random MPS state.The obtained infidelity for different number of iterations is compared to the layer-bylayer approach [45], and to the direct bound given in the main text.

FIG. 4 .
FIG. 4. Magnitude of the unitary part of the initial gradient grad G F for L = 5 layers, as a function of the number of qubits, N , for different initalizations: (a) global cost function and (b) local cost function.In the global cost function (a), the gradient decreases exponentially with N for all initializations, while in the case of a local cost function, it is polynomial.The dashed line represents the threshold for resilience to realistic shot noise (see text).
FIG. 5. Infidelity I obtained by 20 iterations of the global and local optimization algorithms, as a function of the number of qubits and shots, for a random-MPS target state with L = 5 layers and χ = 64.

Fig. S1
Fig.S1 and S2show the infidelity of the optimization encoding obtained for random MPS target states with N = 18 and N = 24 qubits, respectively.These results are consistent with Fig.2of the main text, with the only difference that in the case of a random MPS target state, the encoding delivers larger values of infidelity.