Riemannian geometry and automatic differentiation for optimization problems of quantum physics and quantum technologies

Optimization with constraints is a typical problem in quantum physics and quantum information science that becomes especially challenging for high-dimensional systems and complex architectures like tensor networks. Here we use ideas of Riemannian geometry to perform optimization on manifolds of unitary and isometric matrices as well as the cone of positive-definite matrices. Combining this approach with the up-to-date computational methods of automatic differentiation, we demonstrate the efficacy of the Riemannian optimization in the study of the low-energy spectrum and eigenstates of multipartite Hamiltonians, variational search of a tensor network in the form of the multiscale entanglement-renormalization ansatz, preparation of arbitrary states (including highly entangled ones) in the circuit implementation of quantum computation, decomposition of quantum gates, and tomography of quantum states. Universality of the developed approach together with the provided open source software enable one to apply the Riemannian optimization to complex quantum architectures well beyond the listed problems, for instance, to the optimal control of noisy quantum systems.

In quantum physics problems, however, the elements to be optimized are usually restricted to some class. For instance, in circuit implementation of quantum computation, the elementary operations are to be unitary. Similarly, tensor network representations of multipartite quantum states impose restrictions on constituent tensors [44]. Finally, a general quantum state is given by a density matrix, i.e., a positive-semidefinite matrix with unit trace. Any valid optimization over such computation elements must respect the corresponding restrictions, which poses an important problem of how to perform optimization on specific manifolds (e.g., a manifold of unitary or isometric matrices). This is the Riemannian optimization [45][46][47][48][49] that adapts the gradient-based methods in such a way that the incremented element still satisfies the desired property (unitarity, isometry, positive semidefiniteness, etc.). For instance, the Riemannian optimization with the orthogonality constraint K T K = KK T = I on the real kernel K of a recurrent neural network has been used to avoid the gradient explosion or vanishing in Refs. [50,51].
In the present work, we develop the first-order Riemannian optimization methods over corresponding manifolds and combine them with automatic differentiation to solve various problems of quantum physics. Working with the manifold of unitary matrices, we demonstrate how to prepare an arbitrary multiqubit quantum state (including a highly entangled one) or a gate via circuit quantum computation. Working with the manifold of isometric matrices (complex Stiefel manifold), we compare the first-order Riemannian optimization methods with an algorithm that is standard for optimization of isometric tensor networks such as multiscale entanglement-renormalization ansatz (MERA) [52] and unitary hierarchical tensor networks [53]. In the case of a local Hamiltonian for a multiqubit system, we also perform optimization over a tensor network in the form of MERA and find the low-lying spectrum of a Hamiltonian with no need to explicitly calculate the reduced states and environments thanks to the automatic differentiation. Working with the cone of positive definite matrices, we derive the Riemannian gradient for two types of metric. We compare the different first-order Riemannian optimization methods by solving the illustrative problem of the ground state search and apply these methods to tomography of quantum states via the maximum likelihood estimation. Our findings demonstrate that the Riemannian optimization and the automatic differentiation form a new powerful numerical tool for solving timely problems of quantum technologies. We also provide an open-source library [54] that allows performing optimization with many natural "quantum" constraints including the constraints of positive-definiteness and isometry considered in this paper.
The paper is organized as follows. In Section II, we overview some basic machinery of Riemannian geometry needed in solving optimization problems. In Section III, we discuss some examples of the Riemannian manifolds emerging in the fields of quantum physics and quantum information science. In Section IV, we highlight the concept of the Riemannian gradient and describe how to restrict first-order optimization methods to a Riemannian manifold. In Section V, we overview the automatic differentiation algorithm. In Section VI, we overview known methods for the Riemannian optimization on the complex Stiefel manifolds of unitary and isometric matrices. In Section VI A, we compare those methods with the conventional algorithm for optimization of isometric tensor networks. In Sections VI B, VI C, and VI D, we combine those methods with automatic differentiation to solve some problems of quantum control (aimed at creating a desired multiqubit entangled state or decomposing a quantum gate). In Section VII, we combine the Riemannian optimization methods on the complex Stiefel manifolds with the automatic differentiation to search for a ground state of local Hamiltonians via MERA tensor network. In Section VIII, we consider the Riemannian geometry for a cone of positive definite matrices and derive the Riemannian gradients for two types of metric, which enables us to perform the Riemannian optimization over general quantum states and solve different optimization problems with positivity constraints, e.g., the tomography of quantum states. Finally, we make conclusions in Section IX.

II. RIEMANNIAN GEOMETRY
In this section we informally introduce some necessary preliminaries and notations of the Riemannian geometry. For in-depth introduction the authors recommend Refs. [45,55,56]. A manifold M of dimension n is a set that can be locally approximated by the Euclidean n-dimensional space. A manifold can be seen as a high-dimensional generalization of a smooth surface. For example, the sphere {x ∈ R n+1 | x 2 = 1} is an n-dimensional manifold embedded in R n+1 . Each point x ∈ M is equipped with the tangent space T x M, which is an n-dimensional vector space. For example, a point x from the two-dimensional unit sphere {x ∈ R 3 | x 2 = 1} has the tangent space T x M that is merely a tangent plane to the sphere at point x, see Fig. 1. One can define a tangent bundle T M that is a set consisting of all pairs (x, T x M). Each tangent space T x M is equipped with an inner product g x : T x M × T x M → R. A manifold M with the inner product g x is called a Riemannian manifold. The inner product g x defines a distance in the neighborhood of a point x. A curve on the manifold M is a smooth mapping γ : The length of a curve is defined through the inner product as follows: If γ 0 is a local minimum of L(γ), then γ 0 is called a geodesic. One can prove that for every v x ∈ T x M there exists a unique geodesic γ vx 0 such that Using the fact above one can introduce the exponential map [45].

Definition 1 The mapping
is called the exponential map at a point x.
Basic operations in the Euclidean space are the movements of a point (from one position to another) and the vector transport. Definition 1 of the exponential map is nothing else but a generalization of the point movement to the case of the Riemannian manifold [45][46][47]50]. However, the exponential map is often computationally inefficient in practice due to a general complexity of geodesics. A less natural but more computationally efficient way to generalize the Euclidean point movement to a Riemannian manifold is to use a retraction defined below [45].
on M, with the points γ(ti) and γ(t f ) being the initial point and the final point of the curve, respectively. The geodesic mapping γ0 corresponds to a curve [solid line γ0(t)] that has the minimal length among curves (in some neighborhood) that start at a fixed point γ(ti) and end at a fixed point γ(t f ). The point x ∈ M is equipped with the tangent space TxM (blue plane), which contains two vectors vx and wx. Solid line between the points x and Exp x (vx) ∈ M is a geodesic curve [denoted γ vx 0 (t)] such that conditions (1) are satisfied. The dotted curve between x and Rx(vx) is not a geodesic curve in general; however, it has the same tangent vector at the point x as the geodesic curve γ vx 0 (t). Retraction R is not unique but it defines alternative point movement x → Rx(vx) and the vector transport wx → τx,v x (wx) from the tangent space TxM to the tangent space T Rx(vx) M (violet plane).
Definition 2 A retraction on a manifold M is a smooth map R : T M → M such that for every x ∈ M the map One can see that a set of points {R(tv x )} t∈R is a smooth curve generalizing the concept of the point movement beyond the exponential map. If one needs to move a point x along a curve, whose direction is aligned with the vector v x , then R x (v x ) can be considered as a result of that movement, see Fig. 1. The map R x is not uniquely defined in contrast to the exponential map Exp x ; however, the former one is usually a computationally cheap alternative to the latter one. In fact, the exponential map is metric-dependent, so no analytical expression is known in general (an example is the manifold of isometric matrices with a non-canonical metric g x [48]). In contrast, the retraction is metric-independent (see Definition 2), but R is a first-order approximation to Exp x (tv x ) with respect to a small parameter t, see Fig. 1.
The Euclidean vector transport (displacement of a vector along another vector) can be generalized to a Riemannian manifold in the following way [45][46][47]50], see also Definition 3 A vector transport τ is a smooth map defined on top of a retraction R of a manifold M that maps a triple (x ∈ M, v x ∈ T x M, w x ∈ T x M) to a vector τ x,vx (w x ) satisfying the following properties for all x ∈ M: Similarly to the retraction, the vector transport is not unique. However, the vector transport is usually a computationally cheap alternative to the unique parallel transport, which we do not consider here for the sake of brevity.
Let us consider a basic example of the retraction and the vector transport. Let M be a submanifold of R n , then for a differentiable projection π : R n → M, π • π = π, the map is a retraction [50]. Let P x : R n → T x M, P 2 x = P x be a linear projector on the tangent space T x M. Then the map is a vector transport on top of the retraction (3). One can see that the maps (3) and (4) satisfy Definitions 2 and 3, respectively.

III. RIEMANNIAN MANIFOLDS IN QUANTUM PHYSICS
The complex Stiefel manifold is a Riemannain manifold V n,p consisting of all n × p isometric matrices, n ≥ p [48]. Formally, V n,p = {Z ∈ C n×p |Z † Z = I}. If n = p, then V n,n is a manifold of unitary n × n matrices. Unitary matrices describe the time evolution of a closed quantum system as the evolution operator U = exp(−iHt) is unitary provided the Hamiltonian H is Hermitian. Therefore, the complex Stiefel manifold V n,n is of great use in the circuit quantum computation utilizing unitary gates. Coherent quantum control also operates with unitary transformations [57,58], so optimization algorithms on the Stiefel manifold of unitary matrices are of great need to efficiently manipulate quantum systems and design quantum algorithms [59].
General n × p isometric matrices provide a useful parameterization of quantum channels (completely positive and trace preserving maps) via the Stinespring dilation theorem [60,61]. For a given quantum channel, the size of the corresponding isometric matrix depends on the dimension d of the quantum system and the Kraus rank of the channel that does not exceed d 2 [60]. Thanks to the Stinespring dilation, optimization methods on the Stiefel manifold of isometric matrices provide new efficient ways to perform optimization on the set of quantum channels, for instance, to perform the maximum likelihood estimation of quantum channels (process tomography) and reconstruct both Markovian and non-Markovian open system dynamics [22]. In fact, many algorithms for learning non-Markovian quantum dynamics are based on optimization techniques with restrictions [22,[62][63][64][65].
Another wide area for application of isometric and unitary matrices is quantum tensor networks [44]. Some tensor networks such as MERA [52,[66][67][68] consist of isometric and unitary tensors. As MERA is able to reproduce a wide range of exponentially and polynomially decreasing correlations in multipartite quantum systems, developing an optimization algorithm on the complex Stiefel manifold would allow us to significantly simplify and improve the simulation of many-body quantum systems including the search of low-energy spectrum and eigenstates for local Hamiltonians. In this paper, we accomplish this research programme in Section VII. Similar ideas for the use of optimization algorithms on the Stiefel manifold for MERA were independently proposed in Ref. [69], preprint of which was released a couple of days after the preprint release of the present work [70].
Another frequently used manifold is the set of density matrices, i.e., Hermitian positive-semidefinite operators with unit trace. Density operators are natural elements of dynamical optimization problems as initial states (say, in the problem of entanglement robustness against a specific noise [71]). The cone of positive-semidefinite matrices comprises both the unnormalized quantum states and the effects of positive operator-valued measures (POVMs), so optimization problems involving denstity operators and effects can be reduced to the optimization problems on the cones of positive-semidefinite operators. Duality between operators and maps on quantum states makes the cone of positive-semidefinite matrices important in the study of general completely positive maps too [72]. In the present paper, we restrict our analysis to the cone of positive-definite matrices S n ++ = { ∈ C n×n | = † , > 0} and show that such a restriction to positive-definite matrices instead of positive-semidefinite ones does not prohibit approaching boundary points of the cone. We stick to the ground state search by minimizing the variational energy [73,74] and the experiment-friendly tomography of quantum states by maximizing the likelihood function [75,76]. Some other problems where the cone of positive-definite matrices can be of help are listed in Section IX.

IV. RIEMANNIAN OPTIMIZATION
Suppose one wants to find a local minimum of some differentiable function f (x), where x ∈ R n . A solution of this optimization problem can be attained by the following iterative scheme called the gradient descent (GD) with momentum [77]: where x t is a current approximation of the minimum point, m t is a vector called momentum, η is an optimization step size, ∇f (x t ) is the gradient of the function f at point x t , and β is a constant that ranges from 0 to 1 (a typical value is 0.9). All components of the momentum vector are initially equal to zero. Starting point x 0 is an arbitrary point in R n . However, if there is a restriction that x ∈ M, then the algorithm (5)-(6) fails as it cannot guarantee x t+1 ∈ M provided x t ∈ M. The algorithm (6) fails even if m t+1 ∈ T xt M, so the GD approach has to be modified. In what follows, we consider a generalization of the first-order optimization method (5)- (6) to Riemannian geometry. To do that we use the concept of the Riemannian gradient at point x [45][46][47]50] that belongs to a tangent plane T x M, see Fig. 2.
To introduce the Riemannian gradient, in addition to the inner product g x we also need to use some reference inner product ·, · that is independent of x. If we dealt with real n × p matrices A and B, then we would use the Frobenius inner product (equivalently, the Hilbert-Schmidt inner product), i.e., A, B = tr(A T B) = tr(AB T ) [48]. For complex matrices A and B we still want the inner product to be a real number, so we follow the lines of Ref. [78] and use isomorphic real matrices , respectively, such In view of this, we define the inner product for complex matrices A and B by As the mapping A → φ(A) embeds our manifold of complex matrices in Euclidean space, we refer to the introduced inner product as the Euclidean inner product. The authors of Ref. [69] use a similar definition of the Euclidean inner product (without factor 2).
where ·, · is the Euclidean inner product and g x (·, ·) is the inner product in T x M.
We now proceed to generalization of Eqs. (5)- (6) to Riemannian geometry. The first step is to replace ∇f (x t ) to The second step is to make a retraction of the modified increment −ηm t+1 at point x t , which results in the new update for the point: If we consider the Riemannian version of an optimization algorithm with the momentum, then one needs to perform the third step and transfer the momentum vector to the new point x t+1 : All other first-order optimization methods are generalized to work on the Riemannian manifolds in a similar way. For example, the adaptive moment estimation algorithm (Adam) [9,46,47] and the adaptive magnitude stochastic gradient algorithm (AMSGrad) [46,79] use a modification of Eqs. (8) and (9), where three hyperparameters β 1 , β 2 , and are used instead of the single hyperparameter β. In the present work, we consider generalized versions of the conventional GD, the GD with momentum, Adam, and AMSGrad.

V. AUTOMATIC DIFFERENTIATION
One of the most practical ways of using the Riemannian optimization is to combine it with the automatic differentiation algorithm [43]. The automatic differentiation redirects the whole procedure of the gradient calculation to a computer. It dramatically simplifies the implementation of various numerical algorithms. The automatic differentiation is essentially an algorithmic way of thinking about the chain rule for derivatives. Assume one needs to evaluate a derivative of a composite function L : at a point (x 0 , y 0 ). Here, f : ∂x x=x0 , and a t × m matrix ∂p(y) ∂y y=y0 with the machine precision. The automatic differentiation algorithm consists of two steps. At the first step (called the forward propagation) we sequentially evaluate the values of all functions at the given point (x 0 , y 0 ): At the second step we sequentially evaluate derivatives of a function L with respect to all intermediate variables. For the sake of simplicity we denote ∂L ∂x =x. In this notation,L = ∂L ∂L = I, where I is the k × k identity matrix. Following the chain rule, we sequentially get This step is called the back propagation because it progresses in the opposite direction in comparison with the forward propagation. Combining the forward and backward propagations, we evaluate the desired derivatives of L with the machine precision. This procedure is automatically implemented for arbitrary composite functions, e.g., in TensorFlow [80], and is known as the automatic differentiation. The automatic differentiation is used for the gradient-based training of various machine learning models including neural networks and tensor networks [40]. In the following sections we show that the automatic differentiation is applicable to the calculation of gradients of objective functions during the Riemannian optimization. It makes the application of the Riemannian optimization easy and straightforward to any problem, e.g., the variational energy minimization for tensor networks such as MERA.

VI. RIEMANNIAN OPTIMIZATION ON THE STIEFEL MANIFOLD
Consider the Stiefel manifold V n,p of n × p isometric matrices with complex entries, n ≥ p. Suppose X ∈ V n,p , then exp(−iHt)X ∈ V n,p if t ∈ R and H is an n × n Hermitian matrix. The tangent space T X V n,p consists of all n × p matrices of the form −iHX, H = H † . By definition of the Riemannian manifold, we must introduce the inner product g X for elements of the tangent space T X V n,p . If g X (A, B) = 2Re tr(A † B), then this inner product is independent of X and defines the Euclidean metric [78]. The so-called canonical metric on the Stiefel manifold is a bit different and reads g X (A, B) = 2Re tr[A † (I − 1 2 XX † )B], Ref. [48]. Consider a differentiable function f : V n,p → R and its conventional gradient ∇f (X), which we treat as an n × p matrix with elements ∂f /∂X ij , i = 1, . . . , n, j = 1, . . . , p.
In what follows, we derive the Riemannian gradient ∇ R f (X) for two types of metric.
1. The Euclidean metric g X (A, B) = 2Re tr(A † B). Eq. (7) reduces to which is to be valid for all Hermitian matrices H (i.e., for all tangent vectors −iHX). This implies Hermiticity of the Multiplying both sides of the obtained equation by X and taking into account the relation X † X = I (the p×p identity matrix), we have Since ∇ R f (X) belongs to the tangent space T X V n,p by definition of the Riemannian gradient, we have ∇ R f (X) = −iH X for some Hermitian matrix H and, therefore, . Substituting this expression in Eq. (11), we get where I is the n×n identity matrix. Note that XX † is a rank-p projector, so the operator I +XX † is readily inverted, which is to be valid for all Hermitian matrices H (i.e., for all tangent vectors −iHX). This implies Hermiticity of the Multiplying both sides of the obtained equation by X and taking into account the relation X † X = I (the p×p identity matrix), we have Since ∇ R f (X) belongs to the tangent space T X V n,p by definition of the Riemannian gradient, we have ∇ R f (X) = −iH X for some Hermitian matrix H and, therefore, The next step in the Riemannian optimization is to move from one point X to another point R X (W ) along a curve, whose direction in the first point (i.e., vector W ) is defined by the Riemannian gradient. As there exist various retractions R X (W ) for the Stiefel manifold V n,p , we focus on two of them.
1. The Cayley retraction reads [47,81] R Cayley A simple vector transport based on the Cayley retraction is given by Eq. (4), where is a projector because P X P X (Y ) = P X (Y ) in view of X † X = I. 2. The singular value decomposition (SVD) A = U ΣV † of an n × p matrix A enables one to construct a projection π onto the manifold of n × p isometric matrices through the relation π(A) = U V † . The induced SVD retraction reads [82] R SVD and the vector transport is given by Eq. (4), with the projector on the tangent space being (16). A Riemannian optimization algorithm for minimization of a differentiable function f (X) on the Stiefel manifold (X ∈ V n,p ) with a specified metric is constructed by combining the Riemannian gradient ∇ R f (X), the retraction R X (W ), and the vector transport τ X,W (Z). As an illustrative example we consider the Stiefel manifold with the Euclidean metric and provide a step-by-step algorithm for the gradient descent (GD) with momentum. Depending on the chosen retraction type (the Cayley retraction or the SVD retraction), there appears a difference in one step only. The algorithm reads as follows: 1. Choose an initial isometric matrix X 0 at random.
2. Set the step size η and the value of the hyperparameter β.
3. Compute the gradient of the objective function ∇f (X t ) at a current point X t of the manifold (t = 0 in the beginning). (12) If the canonical metric is used, the expression for the Riemannian gradient in step 4 is to be modified accordingly, see Eq. (14). It is also instructive to discuss complexity of the presented algorithm. The complexity of step 3 entirely depends on the structure of f . Since we use the automatic differentiation to evaluate the gradient of f , the complexity of this step is the same as the complexity of the function evaluation. Therefore, the complexity of step 3 is task-specific and we do not take it into account in what follows; however, as the gradient is an n × p matrix too, we conclude that the complexity of the gradient calculation is greater than or equal to O(np). The complexity of step 4 is O(np 2 ), see Ref. [83].

Compute the Riemannian gradient in the Euclidean metric via formula
Step 5 is elementary and requires O(np) summation and multiplication operations. The complexity of step 6 is O(n 3 ) in the case of the Cayley retraction and O(np 2 ) in case of the SVD retraction, see Ref. [83]. The complexity of step 7 is O(np 2 ), see Ref. [83]. As result, the overall complexity is O(n 3 ) in the case of the Cayley retraction and O(np 2 ) in the case of the SVD retraction.
In the following subsections we implement the Riemannian optimization on the Stiefel manifold to solve some particular problems of quantum physics and analyze its performance. In Section VI A, we consider the Hamiltonian renormalization to study the low-energy properties. In subsequent Sections VI B, VI C, and VI D, we consider certain problems in the very diverse field of optimal quantum control [57,84,85]. For each problem, we calculate the cost function gradient by using the automatic differentiation.
A. Low-energy spectrum and eigenstates We begin with a problem of the Hamiltonian renormalization, which aims at replacing a high-dimensional Hamiltonian H by a simpler low-dimensional one with the identical low-energy spectrum. We consider this problem in order to compare performances of the conventional MERA optimizer [67] and the first-order Riemannian optimizers. We show that the first-order Riemannian optimizers with properly chosen hyperparameters outperform the conventional MERA optimizer. Suppose H is an n×n matrix, then the renormalization transform has the form H → H r = V † 0 HV 0 , where V 0 is an n × p isometric matrix such that V † 0 maps p eigenstates of H with the lowerest energy to the eigenstates of H r , p < n. As n is assumed to be large, we do not have access to the spectrum and eigenstates of H. Instead, the linear functional L(V ) := tr V † HV is readily computable, so we solve the optimization problem are the eigenvalues of the Hamiltonian H in the ascending order. The low-energy eigenvectors |ψ k of H are expressed through the eigenvectors |ϕ k of H r via |ψ k = V 0 |ϕ k , k = 1, . . . , p. The exact minimal value of L (if known) can be used to evaluate the performance of the optimization algorithm.
Results of the Riemannian optimization with the Euclidean metric are presented in Fig. 3 for various algorithms exploiting either the Cayley retraction or the SVD retraction: the Riemannian GD, the Riemannian GD with momentum, the Riemannian Adam, the Riemannian AMSGrad, and the conventional MERA optimizer. Here, we put n = 100 and p = 30 to monitor the optimization perfrormance. To make the optimization challenging, we generate a random ill-conditioned Hamiltonian matrix H as follows: (i) we generate samples {s i } n i=1 from the uniform distribution on the segment [−4, 0]; (ii) then we exponentiate samples and subtract the maximum value out of each exponent: λ i = exp(s i ) − max i exp(s i ); (iii) we generate a random unitary matrix U using QR decomposition and build a Hamiltonian ). The ill-conditioned nature of spectrum makes the optimization problem difficult to solve by the naive Riemannian GD because it forbids to pick a reasonably large step size. Negativity of spectrum allows us to compare the performance of the Riemannian optimization technique with the conventional MERA optimization algorithm that requires the Hamiltonian to be negatively defined [67] (see Section VII D for details on the conventional MERA optimizer). Fig. 3 demonstrates that the conventional MERA optimizer performs worse than the Riemannian Adam, the Riemannian AMSGrad, and the Riemannian GD with momentum.

B. Preparation of highly entangled states
Suppose one can experimentally implement any two-qubit unitary transformation on adjacent qubits arranged in line (Fig. 4). The problem is to find such unitary transformations that the final state exhibits the maximum entanglement with respect to a given cut provided all the qubits are initially disentangled, |ψ in = |e ⊗ |e ⊗ · · · ⊗ |e . We consider a cut separating N L qubits on the left from N R qubits on the right. Combining commuting unitary gates into multiqubit operators U 2m+1 = u Fig. 4), the final state is |ψ out = U M . . . U 2 U 1 |ψ in for some M determining the network depth. As the pure state entanglement can be quantified via the spectrum of the reduced density operator ( L out or R out ), we end up with the following optimization problem: Here, S 2 is the second order Rényi entropy. The theoretical maximum in Eq. (19) equals S max 2 = log min(N L , N R ) and corresponds to the maximally entangled state, which can be realized in a sufficiently deep network.
We solve the optimization problem (19) for N L = 5 and N R = 6 on the Stiefel manifold of 4 × 4 unitary matrices by using the Riemannian AMSGrad algorithm with the step size η = 0.1. We show that M ≤ 5 layers in Fig. 4 are not enough to approach the theoretical maximum and prepare the maximally entangled state. M ≥ 6 layers suffice to achieve the desired maximally entangled state and the Riemannian AMSGrad successfully solves the problem. To address the optimization process performance with M = 6 layers we plot the deviation ∆S = S max 2 − S 2 as a function of the number of iterations in the Riemannian AMSGrad (Fig. 4).

C. Optimal control for state preparation
The problem is to prepare a desired quantum state of several qubits by using a quantum circuit, whose architecture allows using CNOT gates and local (one-qubit) unitary gates as it takes place, e.g., in IBM quantum processors [86]. The initial state of the qubits is factorized, i.e., |ψ in = |e ⊗N , where N is a number of qubits. Suppose the architecture of CNOT gates is fixed but one can apply arbitrary local unitary gates, then the output state is |ψ out = M j=1 U j N k=1 u jk |ψ in , where {u jk } j=1,...,M ; k=1,...,N is a set of one-qubit unitary gates, M is a number of unitary layers alternating with layers of CNOT gates, U j = i CNOT (j) piqi if j < M , U M = I, and CNOT (j) piqi is a CNOT gate with the controlling qubit p i and the controlled qubit q i within the j-th layer, see Fig. 5. The goal is to make the output state |ψ out as close to a desirable state |ψ true as it is possible. This leads to the following optimization problem: We solve this problem by using the Riemannian optimization on the Stiefel manifold of unitary matrices and the automatic differentiation. As an illustrative example we consider N = 4 and M = 4 for a circuit depicted in Fig. 5 and a randomly chosen state |ψ true . The right-hand side of Fig. 5 demonstrates that the Riemannian Adam optimizer with the step size η = 0.05 rapidly finds such unitary gates that the overlap | ψ true |ψ out | equals 1 up to the machine precision.

D. Design and decomposition of quantum gates
Decomposition of a desired multiqubit unitary gate U true into simpler constituents is of vital importance for circuit implementation of quantum computing [86]. Elementary gates usually include the entangling CNOT gate and onequbit gates. The general aim is to find such a decomposition of a multiqubit unitary gate that involves the least possible number of elementary gates (or CNOT gates as they are usually much noisier than the one-qubit ones). This  problem is addressed by minimizing of the square of Frobenius distance between the desired gate U true and its current decomposition U learned = M j=1 U j N k=1 u jk as in Section VI C. If the distance vanishes, then the decomposition is valid. However, there is a need of discrete optimization over architectures of CNOT gates in general. Here we consider a simpler problem wherein the architecture of CNOT gates is fixed but one-qubit unitary gates u jk are optimized. This leads to the following optimization problem: subject to u jk ∈ V 2,2 for all j, k.
To provide a proof-of-principle solution of problem (21) with the use of the Riemannian optimization on the Stiefel manifold of unitary matrices, we consider a randomly generated two-qubit gate U true and decompose it into the tensor network with three CNOT gates depicted in Fig. 6. The minimum in (21) is known to be zero in this case [87]. Bottom part of Fig. 6 clearly shows that the Riemannian AMSGrad optimization with the step size η = 0.2 successfully solves the posed problem and finds the optimal gates {u jk }, for which the distance (21) becomes negligible.

A. Tensor network describing the multiscale entanglement-renormalization ansatz
Tensor networks have proven their efficiency in solving different challenging problems of quantum physics. They are useful not only for the analysis of many-body quantum systems [44,88,89] but also for the analysis of non-Markovian quantum dynamics [90,91] and benchmarking quantum circuits [92]. Some tensor networks impose specific constraints on subtensors. MERA construction requires the subtensors to be isometric [52,[66][67][68] and that enables it to describe critical many-body quantum systems. To introduce the notion of the entanglement renormalization let us consider a linear isometric mapping where N and L are numbers of particles forming the input and output Hilbert spaces, respectively, and H 1 and H 2 are single particle Hilbert spaces with dim(H i ) = χ i , i = 1, 2. Such a linear isometric mapping is used to effectively reduce the dimensionality of quantum systems if χ N 1 > χ L 2 . Z provides a natural representation of the real-space renormalization group. For fixed orthonormal bases of input and output, the operator (22) is given by a matrix Z such that ZZ † is the identity matrix and Z † Z is a projector. Since we consider an isometric mapping (22) with N input particles and L output particles, we endow the matrix Z by composed indices where each index i k takes χ 1 different values and each index j k takes χ 2 different values. The total number of elements in tensor (23) is χ N 1 χ L 2 . To be applicable to the cases when N and L scale proportionally as N = 3 k L, k ∈ N, the tensor (23) is composed of two types of simpler submatrices (building blocks) from the Stiefel manifolds, namely, u ∈ V χ 2 1 ,χ 2 1 and z ∈ V χ 3 1 ,χ2 called disentangler and elementary isometry, respectively. The disentangler satisfies and the elementary isometry satisfies where P is a projector. The tensor diagram notation [88] for u, z, and one entanglement renormalization layer with L = N/3 is given in Fig. 7. Merging entanglement renormalization layers into a single tensor network, we get the ternary MERA network [67], see Fig. 8. The role of base 3 in our description is not crucial; there exist, e.g., the binary MERA and modified binary MERA networks [93].

B. Renormalization theory for a local Hamiltonian
The ideas of the Hamiltonian renormalization in Section VI A are directly applicable to MERA. Here we consider a class of local many-body Hamiltonians and its particular form for a spin chain with periodic boundary conditions, where N is a number of spins and h (0) i,i+1 is a local term acting on two neighboring sites with numbers i and i + 1. The superscript (n) marks the order of renormalization (the number of entanglement renormalization layers used), so the superscript (0) corresponds to the initial Hamiltonian and where the numbers of input and output particles are appropriate. The diagram for a local Hamiltonian H (0) and its first renormalization H (1) are depicted in Fig. 9. The remarkable property of isometry Z is that the renormalized Hamiltonian H (1) is local too, i.e., Fig. 10. In general we have where the recurrence relation for h (n) i,i+1 is expressed through the ascending superoperator A (n) by h i,i+1 . The tensor diagram illustrating the ascending superoperator action is shown in Fig. 10. For a chain of N = 2 × 3 n spins, the n-th order renormalization results in the following Hamiltonian: The ground state energy of the spin chain per site equals where |φ is a trial state in the n-th order renormalized Hilbert space of small dimension, {u, z} is the set of all disentanglers and elementary isometries. As all elements of the set {u, z} and the trial state |φ are elements of the complex Stiefel manifold, we use the Riemannian optimization algorithms on these manifolds to solve the problem (29).

C. Results
To benchmark the Riemannian optimization approach, we consider a one-dimensional anti-ferromagnetic Ising model for spin-1 2 particles (χ 1 = 2) in the transverse magnetic field (TFI model) with periodic boundary conditions. The system Hamiltonian reads where h x is an external magnetic field directed along the x-axis, σ z i and σ x i are Pauli operators acting on a particle at site i. The model exhibits a quantum phase transition at h x = 1 in the thermodynamic limit (N → ∞). Search for the ground state in the vicinity of the quantum phase transition is the most challenging for a numerical simulation, however, it is known that the conventional MERA optimization overcomes this difficulty [94]. Here we demonstrate that the Riemannian optimization approach successfully overcomes this difficulty too.
We consider the TFI model with N = 2×3 5 = 486 spins and h x = 1. We apply n = 5 entanglement renormalization layers to the Hamiltonian (30). We bound the dimension χ 2 of the local output Hilbert space H 2 from above by χ max = 4. In general, the greater χ max the higher accuracy is achieved. We use the TensorNetwork library [95,96] to design the ascending superoperators and perform the automatic differentiation described in Section V. Then we optimize the variational energy E = 1 N φ| H (n) |φ by using the Riemannian Adam optimizer on the Stiefel manifolds. We perform 10 000 optimization steps with the step size exponentially decaying from 0.5 to 0.03. The code is available at [54].
Since the exact value of the ground state energy per site is known for the TFI model with the external field h x = 1 [97], E g = −2/N sin(π/2N ), we compare it with the output of the Riemannian Adam optimizer. The difference ∆E = E − E g is shown in Fig. 11. The achieved relative error for the ground state energy per site is of the order of 10 −5 .
Following the lines of Section VI A, to find the low-energy spectrum of the original Hamiltonian (30) we minimize the cost function tr H (n) with respect to disentanglers and elementary isometries by exploiting the Riemannian Adam optimizer. Diagonalization of the obtained renormalized Hamiltonian H (n) yields the low-energy spectrum. As an illustrative example we calculate the energy gap E 1 − E g between the first excited state and the ground state of the TFI Hamiltonian (30) for different values of the transverse magnetic field h x , see Fig. 11. Thus, our optimizer predicts that the energy gap vanishes if h x < 1 and linearly grows with h x if h x ≥ 1, which entirely agrees with the exact solution. The transition from a zero energy gap to a non-zero energy gap at h x = 1 is a clear evidence of the quantum phase transition.
We emphasize two main advantages of the Riemannian optimization approach to the entanglement renormalization. First, this approach is easy in implementation in comparison with the conventional MERA optimization (based on iterative singular value decompositions) and sometimes achieves better accuracy [69]. Second, the Riemannian optimization has a wide area of applications not limited to the low-energy physics of local Hamiltonians. Generality of the Riemannian optimization allows one to use it in different settings, e.g., in neural networks comprising MERA as a part. The library [54] provides all necessary tools for accomplishing the Riemannian optimization on complex quantum architectures.

D. Remarks on the conventional MERA optimization algorithm
In Fig. 3, we have already compared performances of different Riemannian optimization algorithms on the Stiefel manifold with the conventional MERA optimization algorithm. Here we discuss in more detail the latter optimization algorithm, which has been the main approach to the entanglement renormalization since 2009 [67,68]. The main procedure within this algorithm solves the problem minimize f (ω) := 2Re tr (Γω) subject to ω ∈ V n,m , where ω is an isometric n × m matrix from the Stiefel manifold V n,m , m ≤ n, and Γ is an arbitrary m × n matrix.
The problem (31) has an exact solution where U and W are matrices in the singular value decomposition Γ = U ΛW † . Note that f (ω opt ) = −2 is a set of singular values for Γ. The generalized algorithm for optimization of an arbitrary real-valued function F (ω) is performed by iterating the following three steps until convergence: (compute the singular value decomposition of the gradient), ω t+1 = −W t+1 U † t+1 (compute a new isometric matrix for next iteration).
This algorithm can be seen as a first-order optimization algorithm because it uses the gradient. However, this algorithm converges to the optimal point for a very restricted set of problems only. Namely, this algorithm converges to the minimum of a negative-definite quadratic form, which is artificially guaranteed in the entanglement renormalization procedure by adding a significantly negative part to the Hamiltonian.
To illustrate the limits of this algorithm, we explore its performance while optimizing a quadratic form F (ω) = ω † Hω in the simple case of orthogonal matrices ω with n = 2 and m = 1 and real symmetric 2 × 2 matrix H. As ω is merely a two-component unit vector, the Stiefel manifold is the unit circle in this case. Fig. 12 illustrates that the algorithm outputs a point ω * , which is an eigenvector of H corresponding to the maximum absolute eigenvalue. If the quadratic form ω † Hω is negative-definite, then the algorithm converges to an eigenvector corresponding to the minimal eigenvalue of H. However, the algorithm fails in finding minimum of a general quadratic form ω † Hω as Figs. 12(b) and 12(c) suggest. In contrast, the Riemannian optimization for the entanglement renormalization ansatz is free of such drawbacks.

VIII. RIEMANNIAN OPTIMIZATION ON THE CONE OF POSITIVE-DEFINITE MATRICES
Any positive-definite matrix S from the cone S n ++ defines the density operator ρ = S/tr[S]. This fact enables optimization over the set of full-rank density matrices provided one can perform optimization on the cone S n ++ of positive-definite matrices. The cone S n ++ can be parameterized in many different ways (see, e.g., [98]). Here we consider two convenient ways to do that, namely, the exponential representation S = e H , where H is a Hermitian matrix, and the Cholesky decomposition S = LL † , where L is an n × n lower triangular matrix with real and positive diagonal entries. The both representations allow us to exactly derive formulas for geodesics and parallel transport within the corresponding metrics because the relation between S on one side and H or L on the other side is a diffeomorphism. This means that there exist a smooth function F : M → S n ++ and its smooth inverse F −1 : S n ++ → M that maps S n ++ to a differentiable manifold M, see Fig. 13. In our cases, M is composed of either Hermitian matrices H or lower triangular matrices L with real and positive diagonal entries. In the latter case, function F is unique. Similarly, there exists an isomorphism between tangent bundles T S n ++ and T M. To unify and simplify the notation we will refer to all objects associated with M and T M by adding symbol to the corresponding objects in S n ++ and T S n ++ , respectively, see Fig. 13. For instance, for a given positive-definite matrix S ∈ S ++ we denote S = F −1 (S) its pre-image in M. Analogously, for a tangent vector W ∈ T S S n ++ we denote W the corresponding tangent vector in T S M, which is connected with W via the differential DSF : TSM → T S S n ++ . We use the introduced tilda-notation for other objects too, e.g., for exponential maps.
A Riemannian metric g on M induces some Riemannian metric g on S n ++ and corresponding definitions of the exponential map and the parallel transport. The general scheme is as follows: • For a given manifold M and a diffeomorphism F : M → S n ++ calculate the differentials D S F : T S M → T S S n ++ and D S F −1 : T S S n ++ → T S M.
• Define a standard Riemannian metric g S ( W , V ) on M and establish the exponential map Exp S ( W ) and the parallel transport τ S, W ( Q) of a vector Q along a geodesic starting at point S and aligned with the direction vector W at point S.
• Push forward all the introduced maps to S n ++ by using F . This means that the induced Riemannian metric g S on S n ++ is the exponential map is and the parallel transport is In Sections VIII A and VIII B we specify the above formulae in two cases: (i) M is a manifold of Hermitian matrices with the standard Euclidean metric (V, W ) = 2Re tr(V † W ) and F (H) = e H , which induces a so-called log-Euclidean Riemannian geometry on S n ++ , Refs. [99,100]; (ii) M is a manifold of lower triangular matrices with strictly positive diagonal elements that has a particular non-Euclidean metric and F (L) = LL † , which induces a so-called log-Cholesky Riemannian geometry on S n ++ , Ref. [101].

A. Log-Euclidean Riemannian geometry
Physical meaning of the exponential representation is that a density matrix ρ = S/tr[S], where S = e H , can be thought of as an equilibrium thermal state with the effective Hermitian Hamiltonian −kT H, where k is the Boltzmann constant and T is a non-zero temperature. A one-parameter curve S(t) in the cone S n ++ of positive-definite matrices has a corresponding one-parameter preimage H(t) in the space of Hermitian matrices M = {M ∈ C n×n |M = M † }, i.e., in the space of Hamiltonians (up to an irrelevant factor −kT ). Our goal is to find the exponential map Exp S (W ), the vector transport τ S,W , and the Riemannian gradient ∇ R f (S) for a function f : S n ++ → R by using some simpler expressions for M and T H M.
To begin with, we follow the lines of Refs. [99,100] and derive differentials D S F and D S F −1 , where S ≡ H. As the tangent vector reads Denoting l by τ and using the relation lim The spectral decomposition H(t) = n i=1 λ i (t) |ψ i (t) ψ i (t)| enables us to calculate the obtained integral explicitly, namely, where The derived relation between operators dS(t) dt and dH(t) dt takes a simpler form in the index-free matrix representation in some fixed orthonormal basis {|i } n i=1 , namely, where |i j| is a symmetric matrix with elements G ij (t), and • denotes the Hadamard (element-wise) product. Recalling the notation S ≡ H, we observe that Eq. (42) explicitly defines the sought differentials D S F : T S M → T S S n ++ and D S F −1 : where U S and U S are coincident unitary matrices diagonalizing both S and S ≡ H, G S is a matrix of elements (41) with {λ i } n i=1 being the spectrum of S, and G •−1 is the Hadamard (element-wise) inverse of G S , which is well defined because all the elements (41) are non-zero.
Then we equip the manifold M with the standard Euclidean metric g S ( W , V ) = 2Re tr( W † V ), which makes the exponential map Exp and the parallel transport τ trivial, namely, We pushforward the obtained formulae for the exponential map and the parallel transport on M to those on S n ++ by using Eqs. (35) and (36). The induced Riemannian metric g S (W, V ) = g S ( W , V ) in S n ++ is called log-Euclidean because S = log(S).

B. Log-Cholesky Riemannian geometry
In this section, we use results of the recent paper [101] and derive on their basis the Riemannian gradient in another metric. We consider a manifold M of lower triangular matrices with real and positive diagonal elements, so F ( S) = S S † ∈ S n ++ and S = F −1 (S) is the unique lower Cholesky factor of a matrix S ∈ S n ++ . For the convenience of the reader, in what follows we use the following notation. By (X) 1 2 we denote the lowertriangular part of a matrix X with diagonal elements halved. So X = (X) 1 2 + (X T ) T 1 2 for an arbitrary matrix X. We put X for the strictly lower-triangular part of a matrix X and D(X) for its diagonal part, so that we have X = X + D(X) + X T T and (X) 1 2 = X + 1 2 D(X) for any matrix X. To find the Riemannian gradient, we first calculate differentials Then we follow Ref. [101] and equip M with the following Riemannian metric: which induces the Riemannian metric g S (W, V ) = g S ( W , V ) in S n ++ and is called log-Cholesky metric [101] rather in analogy with log-Euclidean metric though no logarithm is used in the expression for the Cholesky lower factor S = F −1 (S).
The exponential map and the parallel transport have the following form [101]: We finish by pushing the relations (48) Using Definition 4, we finally derive the Riemannian gradient where we have used notation S = F −1 (S) for the lower Cholesky factor of S and taken into account the fact that the preimage ∇ R f (S) must be a lower triangular matrix with real diagonal elements.

C. Ground state search and other problems
To address performance of the Riemannian optimization on the cone of positive-definite matrices S n ++ , we consider a quantum system in the Hilbert space of dimension n = 100 and a random Hamiltonian H, whose ground state and ground energy are of interest. We benchmark the results of several Riemannian optimization algorithms with the actual ground energy for H, which is known at the stage of Hamiltonian generation but not accessible to optimization algorithms. Technically, the spectrum of H is a sample from a uniform distribution of eigenvalues within the segment [0, 1] and the eigenbasis for H is obtained via the QR decomposition of a random matrix. The optimization problem is minimize f (S) := tr (HS) tr (S) subject to S ∈ S n ++ , where ρ = S trS is a full-rank density matrix. The actual ground state ρ g is a rank-1 operator, which is a boundary point for the cone of positive-definite matrices. For this reason inf S∈S n only approach but not reach the infimum during the optimization process with a finite number of iterations. Thus, the optimization problem is challenging for Riemannian optimization algorithms on S n ++ . We have tested several Riemannian optimizers including the Riemannian GD, the Riemannian GD with momentum, the Riemannian Adam, and the Riemannian AMSGrad on the cone S n ++ with both the log-Euclidean metric and the log-Cholesky metric. Performance of a given algorithm is evaluated by a difference between the energy tr(HS)/tr(S) at current iteration with the actual ground state energy E g . Comparison of performances for various Riemannian optimization algorithms is presented in Fig. 14. Efficacy of the Riemannian Adam and AMSGrad algorithms with the log-Euclidean metric allows one to find the ground state and its energy with precision ∼ 10 −6 despite the fact that ρ g does not belong to the cone of positive-definite matrices. The code for implementation of algorithms is available at [54].
Clearly, the considered problem is rather an illustration of the algorithm performance because it would be easier to find the ground state with the technique of Sec. VI A. However, there exist important problems of quantum information science, where the optimal density operator is mixed (not a pure state). For instance, the maximum of the coherent information I c (Φ) = S(Φ[ρ]) − S( Φ[ρ]) gives the single-letter quantum capacity of the quantum channel Φ, and this quantity is strictly greater than 0 only if ρ is a mixed density operator ( Φ denotes the complementary channel), see Ref. [60]. It was shown in some recent research papers [102][103][104] that the two-letter quantum capacity can exceed the single-letter quantum capacity, which makes that optimization problem even more relevant. So we believe that the optimization on the cone of positive definite matrices has a number of possible applications.

D. Tomography of quantum states
Quantum state tomography is an important statistical problem aimed at reconstructing an unknown density operator by processing results of a finite number K of measurements performed on K copies of the state [75,76]. The higher the dimension of quantum system, the more challenging the problem. Linear reconstruction formulae suffer from possible non-positivity of the reconstructed operator, especially if the number of available copies K is limited. In maximum likelihood reconstruction schemes, the optimization of the likelihood function is performed over a restricted set of positive-semidefinite matrices with unit trace. Here we show that the Riemannian optimization of the likelihood function over the cone of positive-definite operators successfully solves the tomography problem.
Maximum in (56) is attained at some density operator max which is the best estimate for ρ * . Note that the logarithmic likelihood log K i=1 P (α i 1 , . . . , α i N |ρ) attains its maximum value at the same state ρ max . Using the parameterization ρ = S/tr(S), we end up with the following optimization problem on the cone of positive-definite matrices: Clearly, sup S∈S 2 N ++ f (S) = f (ρ max ). Here we test performance of the Riemannian Adam optimizer on the manifold S 2 N ++ with the log-Cholesky metric. The figure of merit is the trace distance 1 2 ∆ρ 1 = 1 2 ρ * − S/tr(S) 1 between the actual state ρ * and the state S/tr(S) updated after each iteration of the optimization algorithm. Fig. 15 depicts results for a typical numerical experiment on N = 1, 2, 3, 4 qubits in a state described by a randomly chosen 2 N × 2 N density operator ρ * . If total number of measurement outcomes K = 5 × 10 5 , then the reconstruction error varies from approximately 0.2% to 7% depending on the number of qubits. The greater K, the better the state reconstruction quality.

IX. CONCLUSIONS
Optimization over a specific manifold is a typical problem in quantum information science, from fundamental questions like compatibility of observables [106], operational restrictions in general probabilistic theories [107], and entanglement robustness [71,108] to practical questions like evaluation of quantum channel capacities [109,110] and learning an unknown quantum noise [22,63,64]. The problem becomes really involved in the high-dimensional cases and the complex tensor networks composed of many elements to be optimized simultaneously.
In this paper, we advocated the Riemannian optimization and the automatic differentiation as efficient methods to solve optimization problems on the manifolds of unitary and isometric matrices. One of the main advantages of our approach was its universality that enabled one to optimize any functional, for instance, any tensor network composed of unitary and isometric tensors without resorting to ad hoc methods valid for specific architectures only (like the conventional MERA optimization). We supported this claim by solving some problems of optimal control in circuit implementation of quantum computation aimed at preparing a desired (entangled) state or decomposing a given quantum gate into simpler gates. We also demonstrated that the Riemannian optimization over the manifold of isometric matrices enables one to effectively study the low-energy physics of high-dimensional (multipartite) systems with no regard to the Hamiltonian complexity (structure). By comparing our results with the previously known ones in Fig. 3, we observed that the Riemannian optimization on the complex Stiefel manifold of isometric matrices outperformed the conventional MERA optimization algorithm in accuracy of reproducing the low-energy spectrum.
As quantum channels are naturally parameterized by isometric matrices via the Stinespring dilation [60,61], we believe that the presented analysis can find further applications in optimization over the set of quantum channels. The need in such an optimization was pointed out in Refs. [22,63,64], where an unknown quantum noise is learned via the maximum likelihood estimation for results of a sequence of projective measurements. Another application of the Riemannian optimization on the manifold of isometric matrices is a search for the optimal code to transmit quantum information (quantum states) through noisy communication lines, i.e., in evaluation of quantum capacity of quantum channels. Last but not least is the problem of optimal coherent control for a noisy quantum system [57,58], where the methods of Riemannian optimization would be of great help.
We considered the cone of positive-definite matrices and used the exponential parameterization and the Cholesky decomposition to define the Riemannian metric on the cone and derive the corresponding Riemannian gradients.
With the results obtained we demonstrated the efficacy of the Riemannian optimization on the cone to find the ground state of a high-dimensional Hamiltonian and to reconstruct an unknown multiqubit state via the maximum likelihood estimation. Possible applications of optimization over a cone of positive-definite matrices are numerous (study of coexistence of POVM effects and compatibility of POVMs [106], quantification of quantum correlations [111,