Quantum geometric machine learning for quantum circuits and control

The application of machine learning techniques to solve problems in quantum control together with established geometric methods for solving optimisation problems leads naturally to an exploration of how machine learning approaches can be used to enhance geometric approaches for solving problems in quantum information processing. In this work, we review and extend the application of deep learning to quantum geometric control problems. Specifically, we demonstrate enhancements in time-optimal control in the context of quantum circuit synthesis problems by applying novel deep learning algorithms in order to approximate geodesics (and thus minimal circuits) along Lie group manifolds relevant to low-dimensional multi-qubit systems, such as SU(2), SU(4) and SU(8). We demonstrate the superior performance of greybox models, which combine traditional blackbox algorithms with whitebox models (which encode prior domain knowledge of quantum mechanics), as means of learning underlying quantum circuit distributions of interest. Our results demonstrate how geometric control techniques can be used to both (a) verify the extent to which geometrically synthesised quantum circuits lie along geodesic, and thus time-optimal, routes and (b) synthesise those circuits. Our results are of interest to researchers in quantum control and quantum information theory seeking to combine machine learning and geometric techniques for time-optimal control problems.

The application of machine learning techniques to solve problems in quantum control together with established geometric methods for solving optimisation problems leads naturally to an exploration of how machine learning approaches can be used to enhance geometric approaches to solving problems in quantum information processing.In this work, we review and extend the application of deep learning to quantum geometric control problems.Specifically, we demonstrate enhancements in time-optimal control in the context of quantum circuit synthesis problems by applying novel deep learning algorithms in order to approximate geodesics (and thus minimal circuits) along Lie group manifolds relevant to low-dimensional multi-qubit systems, such as SU (2), SU (4) and SU (8).We demonstrate the superior performance of greybox models, which combine traditional blackbox algorithms with prior domain knowledge of quantum mechanics, as means of learning underlying quantum circuit distributions of interest.Our results demonstrate how geometric control techniques can be used to both (a) verify the extent to which geometrically synthesised quantum circuits lie along geodesic, and thus time-optimal, routes and (b) synthesise those circuits.Our results are of interest to researchers in quantum control and quantum information theory seeking to combine machine learning and geometric techniques for time-optimal control problems.

A. Overview
Machine learning-based approaches to solving theoretical and applied problems in quantum control have gained considerable traction over recent years as researchers leverage access to enhanced computational resources in order to solve numerical optimisation problems.Concurrently, geometric control techniques in which the tools of differential geometry and topology are applied to problems in quantum information processing have been applied in a variety of quantum control programmes [1][2][3][4].The synthesis of geometry and quantum information has also recently emerged of interest to researchers in complexity geometry [5,6].It is natural therefore that the intersection between geometric and machine learning techniques in quantum control emerge as a crossdisciplinary research direction.Understanding such synergies between techniques within geometric control, quantum information processing and machine learning offers promising techniques within theoretical and applied quantum computational research, with potential application across other research domains.
In this work, we extend previous research seeking to combine techniques from geometric control, quantum informa-tion processing and machine learning in order to synthesise time-optimal quantum circuits for multi-qubit quantum systems.The development of techniques for improving the timeoptimality of quantum circuit synthesis is of interest to researchers across the spectrum of theoretical [7] and applied quantum information science given the difficulties and challenges of synthesising quantum circuits for desired computations, let alone time-optimal ones.We approach this ubiquitous problem by extending geometric methods for generating approximate normal subRiemannian geodesic (and thus timeoptimal) paths along certain Lie group manifolds of interest to quantum information processing (such as SU(2 n )) with tailored deep learning-based machine learning techniques.
Our results consist of: (1) an evaluation of certain existing approaches for approximating geodesics along Lie manifolds via discrete sequences of unitary propagators; (2) determination of the optimal set of controls for generating discrete approximations to geodesic sequences of unitaries in SU(2 n ) for application in multi-qubit systems; and (3) demonstration of the utility of adopting so-called 'greybox' machine learning architectures [8] which combine 'whitebox' architectures, i.e. prior information (such as known laws of quantum mechanics) with 'blackbox' architectures, such as various neural network architectures, into synthesising quantum circuits.

B. Problem description
The focus of this work is on the development of novel machine learning architectures that leverage results from subRiemannian geometry in a quantum control setting.Such tech-niques are of relevance to practitioners within quantum control for a variety of reasons.First, as we discuss below in our explication of subRiemannian geometry in quantum control settings, subRiemannian control problems are a generalisation of standard Riemannian control problems in that they represent a more general form of Riemannian geometry.
Second, subRiemannian quantum control problems arise where only a subset of the full Lie algebra (of generators) is itself directly accessible.This is of direct relevance to the majority of quantum control cases which may be envisioned for quantum computing devices in which one does not have access to the full set of underlying generators, say for arbitrary multi-qubit (qudit) systems with a limited gate set.Most quantum control problems are in fact, when characterised geometrically, subRiemannian quantum control problems.
Third, is the result that synthesis of quantum circuits (i.e.sequences of unitary propagators) in a time-optimal fashion using geometric techniques (in which time-optimality is equated with generating discretised approximations to minimal distance geodesics on underlying Lie group manifolds) in fact may call for subRiemannian rather than Riemannian geometric techniques.The reason for this is that in order to generate such geodesic approximations, it is often beneficial (and in some cases necessary) to restrict the underlying control subalgebra of generators to a subset of the full Lie algebra.For many multi-qubit systems, quantum circuits are more likely to approximate geodesics (and thus be characterised as time optimal) where the generating Lie algebra is restricted to what are known as one-and two-body Pauli operators (tensor products of at most two standard Pauli operators), rather than the full Lie algebra.These three issues -the prevalence of subRiemmanian geometric features in quantum control problems, the restricted availability of generators when undertaking control and the need to synthesise circuits in a time-optimal fashionmotivate the use of geometric techniques applied in this work.

C. New contributions
In this work, we report a number of experimental results based upon simulations of machine learning models for quantum circuit synthesis.
First, we report improved machine learning architectures for quantum circuit synthesis.We demonstrate in-sample improvements by standard metrics including MSE and average operator fidelity training, validation and generalisation by several orders of magnitude compared with relevant state of the art methods.We demonstrate that customised deep learning architectures which utilise a combination of standard and bespoke neural network layers, together with customised objective functions (such as fidelity measures) of relevance to quantum information processing, achieve superior results.This approach is denoted as 'greybox' machine learning, is charactersied by models that combine known prior assumptions about quantum information processing with machine learning architectures.We demonstrate enhanced performance of greybox over blackbox models.
Second, we report an improvement on previous work com-bining subRiemannian geometric training with data and deep learning [9] to synthesise quantum circuits.We show that optimal sets of controls may be obtained using a feed-forward fully-connected, Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN) and custom geometric machine learning models.However, we also report on difficulties in usefully adapting such approaches for generalisation.Third, we demonstrate that machine learning protocols seeking to learn discretised geodesic approximations in SU(2 n ) are particularly sensitive to hyperparameter tuning, including time for application of generators and coverage of training geodesics over manifolds of interest.We show that selection of small time-steps for discretised unitary evolution will result in geodesic approximations highly proximal to the identity in SU(2 n ) (as was the case in [9]), resulting in a deterioration in the ability to (in-sample and out of sample) learn geodesic approximations to target unitaries further away (by whatever relevant distance metric or norm is adopted) along the manifolds.Improving model performance to generalise beyond the proximity of the identity is shown to require small evolutionary timescales but also an increased number of segments of the geodesic approximation, though achieving the correct balance of timescale and segmentation.

D. Structure
The structure of this work is as follows.Part II provides an overview of key quantum control concepts and literature relevant to our experiments.It examines the formulation of quantum control problems geometrically in terms of Lie groups and differential geometry.It also explores seminal expositions from Nielsen et al. in which time-optimal quantum circuit synthesis problems are framed in terms of generating approximate geodesics along relevant group manifolds.Part III details the application of subRiemannian geometric theory to quantum circuit synthesis.Part IV lays out the design principles behind the series of experiments undertaken to develop improved machine learning architectures for quantum circuit synthesis via approximate discretised geodesics.Readers interested only in the technical details of the architectures should skip to this section.Part V details the results of the various experiments, with discussion set-out in Part VI.Future work and directions emerging from this research are then discussed in Part VII.Code for the experiments may be found at GitHub [10].
The necessity of quantum control for various quantum information and computation programmes globally has seen the emergent application of classical geometric control and elsewhere in an effort to solve threshold problems such as how to synthesise time optimal circuits.Nearly two decades ago, developments in applied quantum control [1,11,12] spurned the use of geometric tools to assist in solving optimisation problems in quantum information processing contexts such as applied NMR [3].Related work also explored the use of Lie theoretic, geometric and analytic techniques for controllability of spin particles [13].Since that time, the connections between geometry and quantum control/information processing across cross-disciplinary fields, via the explication of transformations that enable problems in one field, in this case quantum control optimisation objectives (such as minimising controls for synthesis or reachable targets) into another, namely the language of differential geometry.Of particular note, Nielsen et al. [14] demonstrated that calculating quantum gate complexity could be framed in terms of a distance-minimisation problem in the context of Riemannian manifolds.In that work, upper and lower bounds on quantum gate complexity, relating to the optimal control cost in synthesising an arbitrary unitary U T ∈ SU(2 n ), demonstrating the equivalence of this problem to the geometric challenge of finding minimal distances on certain Riemannian, subRiemannian and Finslerian manifolds.Subsequently, geometric techniques were utilised [15,16] to find a lower bound on the minimal number of unitary gates required to exactly synthesise U T , thereby specifying a lower bound on the number of gates required to implement a target unitary.
Research across a range of quantum control [17,18] and geometric circuit synthesis [16,19,20] has built upon results regarding the use of geometric techniques in quantum control settings.Of interest to researchers at the intersection of geometric and machine learning approaches for quantum circuit synthesis, and the focus of this work, is a technique developed in [9,21] that combines subRiemannian geometric techniques with deep learning in order to approximate normal subRiemannian geodesics for synthesis of time-optimal or nearly-time optimal quantum circuits.Our results present improved machine learning architectures tailored to learning such approximate geodesics.

Control formulations
The affinity between quantum control methods and geometric control and non-control methods arises from many sources within the literature.One fundamental reason is the intimate connection between Lie algebraic formulations of control problems, in classical and quantum settings, and the differential /geometric formulations of Lie theories on the other.In typical Lie theoretic approaches to quantum control problems [22] such as synthesis of quantum circuits, the quantum unitary of interest U is drawn from a Lie group G.A feature of Lie groups is that they are mathematical structures that are at once groups but also differentiable manifolds, topological structures equipped with sufficient geometric and analytical structure to enable analytic machinery, such as the tools of differential geometry, to be applied to their study [23].
A typical formulation of control problems in such Lie theoretic terms takes a target unitary U T to be an element of a Lie group, such as SU(2 n ), represented as a manifold.Associated with the underlying Lie group G is an Lie algebra g, say su(2 n ), comprising the generators of the underlying Lie group of interest.Quantum control objectives can then be characterised as attempts to synthesise a target unitary propagator [11] belonging to such a Lie group G via application of generators belonging to g in a controlled manner.In the simplest (noise-free) non-relativistic settings, computation is effected via evolution from U (0) = I to U T according the time-dependent Schrödinger equation: The above formulation may also be expressed in terms (discussed in more detail below) of time dependent drift H d (t) and control H j (t) Hamiltonians: The drift part of the Hamiltonian represents the (directly) 'uncontrollable' aspect of evolution (and is discussed in more detail below), while the control Hamiltonians represent evolution generated by those elements (generators) of the quantum system which are controllable, namely the generators of a Lie algebra of interest, such as, in the case of qubit systems, generalised Pauli operators.The terms v k = v k (t) represent the 'control' functions (discussed below) applied to specific generators A k ∈ g.Sometimes H k are representative of distinct generators, hence their summation, while at other times they represent different (usually linear) combinations of generators (in which case v k represents a vector of control functions), though this is mainly a stylistic choice.The timedependence of the Hamiltonians is encoded in these timedependent control functions as the generators themselves are not time-dependent.While often linear, the functional time dependence can and does often assume non-linear and complicated functional forms, especially in the presence of noise.
Analytically solving for the form of the control function is difficult and usually intractable for higher-order qudit systems, with numerical methods usually adopted instead [3].One of the motivations for the use of machine learning in quantum control problems is precisely their potential utility in learning a sufficient approximation of control functions needed to achieve quantum control objectives.
It is common (as discussed below), in appropriate circumstances, to simplify the typical time-dependent Schrödinger equation with its time-independent discretised approximation in which evolution towards a target unitary propagator U T is approximated via a sequence of successive unitaries generated by time-independent Hamiltonians H j applied at time t j for FIG.1: Sketch of geodesic path.The evolution of quantum states is represented by the evolution according to Schrödinger's equation of unitary propagators U as curves (black line) on a manifold U ∈ G generated by generators (tangent vectors) (blue) in the time-dependent case (2).For the time-independent case, the geodesic is approximated by evolution of discrete unitaries for time ∆t, represented by red curves (shown as linear for ease of comprehension).Here U ti represents the evolved unitary at time t i .duration ∆t j : where ∆t j = ∆t = T /N .That is, the unitary propagator at time t (from the identity) is the cumulative reverse product (forward-solved cumulant) of a sequence of U j so that U j = U j−1 ...U 0 .This approximation is considered appropriate where ∆t is small by comparison to total evolution time T (or equivalently total energy) and is an approximation adopted in our experiments detailed below.Adopting this approximation allows (2) to be expressed as: Here, H d,j = H d (t j ) designates the drift (or internal) part of the Hamiltonian at time t j and H j represents the control Hamiltonian at time t j : The control Hamiltonian H j , parametrised by the discretised control functions v k = v k (t j ), are composed from (usually linear) functions of the generators τ k ∈ g (where dim g = m and k indexes the generators) belonging to the corresponding Lie algebra (such as generalised (tensor products) of Pauli operators for SU (2 n ).The control functions v k (t) encompass the amplitude (energy) to be applied for time ∆t (du-ration) over which the control Hamiltonian H j is to be applied, and typically correspond, for example, to the application of certain voltages or magnetic fields for a certain period of time.The functional form of the controls v k (t) can vary, with common (idealised) representations including Gaussian or 'square' pulses.The objective of time optimal control is then to select the set of controls v j (t) to be applied (when using a discretised approximation) at time t j for time ∆t j in order to synthesise U T in the shortest amount of total time.Such geometric approaches involve reparametrisation of quantum circuits, which are discrete, as approximations to geodesics on Lie group manifolds of interest to quantum information processing [14][15][16].It is in order to solve this optimisation problem that motivates recharacterisation of problems in quantum information geometrically, such as determining and solving geodesic equations of motion.

Path-length and Lie groups
The adaptation of geometric methods and variational methods for solving optimisation problems in quantum information processing is characterised in terms of minimising distance of curves along Lie group manifolds G. Doing so requires selection of a metric (or cost functional) that intuitively measures the distance between elements in the associated Lie algebra which, in geometric terms, are represented by tangent vectors belonging to the associated tangent space T G. Costfunctionals are essentially analogous to variational functional equations such that: where g αβ represents the (not necessarily constant) metric tensor, dx/dt represents the differential Lie group elements x ∈ G with respect to the unique single-parametrisation (i.e.time) with which we are familiar.Solving the optimisation problem of interest, such as synthesising a circuit in minimal time or with minimal energy, becomes a question of minimising the cost function.Variational methods in this approach set δC = 0 and consequently use standard techniques from variational calculus to derive respective equations of motion, differential equations whose solutions (usually) take the form of exponentiated Lie algebraic elements i.e. unitary propagators, which ultimately minimise the cost functional and solve the underlying optimisation problem.It's worth explicating the form of cost functionals for quantum information practitioners who may be less familiar with geometric methods.In the discretised case, we essentially replace the integral with a sum over the various Hamiltonians such that we have: for the continuous case and where f represents the control function(s) applicable to the Hamiltonian H(t).By selecting the appropriate parametrisation of curves on the manifold (such as a typical parametrisation by arc-length), distance along a curve (representing evolution from one unitary, such as the identity, to another) can be equated to minimal time required to evolve (and synthesise) a target unitary U T of interest.In cases where there are multiple curves between two points, then one must select the minimal path over all such paths [24].Because minimising the cost functional depends itself upon solutions (unitaries) which are themselves generated by Lie algebraic elements subject to control functions, the optimisation problem of quantum control thus becomes a problem of identifying the optimal set (sequence) of control functions to be applied over time in order to minimise the cost functional.
Applying standard techniques from the calculus of variations (e.g. the Pontryagin Maximum Principle [25]) with respect to the cost functional results in the the geodesic equation of motion [15,24] which specifies the path that minimises the action and which is typically (for constant metric Riemannian manifolds) is given by: where x = x(t) ∈ G are the unitary group elements while dx/dt ∈ g represent the differential operators (tangent vectors/generators) of the associated Lie algebra.Also in (12) it is implied that the form of applicable geodesic g αβ is itself identical across the manifold (which may not always be the case).Γ j kl represent Christoffel terms obtained by variation with respect to the metric.Given a small arc along a geodesic on a Riemannian manifold, the remainder of the geodesic path is completely determined by the geodesic equation.Solutions to the geodesic equation are, in the continuous case curves and in the discrete case approximations to curves, on the manifold of interest.In the discrete case, such discretised curves are interpretable as quantum circuits which are time-optimal when such geodesics also represent the minimal distance curve linking two unitary group elements on a manifold.In this way, variational methods leveraging geometric techniques and characterisation may be utilised for synthesising quantum circuits.

Accessible controls and drift Hamiltonians
Minimising cost functionals in the way described above involves understanding what in classical control theory is described as the set of accessible controls available: those unitaries which may be synthesised via application of the controls are termed reachable targets [26].Designing appropriate machine learning algorithms using geometric methods or otherwise thus requires information on the form of control function and generators that are available to reach a desired target, such as a target unitary or quantum state.
For a given Lie group G, access to the entire set of generators g renders any element U ∈ G reachable.In quantum control settings, access to the full Lie algebraic array of generators occasionally render the problem of unitary synthesis, i.e. the sequence of generators and control pulses, analytically or trivially obtainable using geometric means, such as Euler decompositions where G = SU(2 n ) [27].In many circumstances (such as those explored below), we are constrained or seek to synthesise target unitaries U T using only a subset of the relevant Lie algebra, a subset named the control algebra (or control subalgebra) k ⊂ g.In such cases, the full set of generators is not directly accessible.However, one may still be able to reach the target unitary of interest if the elements of k may be combined (by operation of the Lie bracket or Lie derivative, as discussed below) in order to generate the remaining generators belonging g, thus providing access to the g in its entirety.We distinguish such cases by denoting the first case as a case of directly accessible controls, while the second case represents indirectly accessible controls.
Returning to the quantum control paradigm (7), the drift Hamiltonian H d represents the evolution of a quantum system which cannot be directly controlled.It may represent a noise term or the interaction of a system with an environment in open quantum systems' formulations.Where a control subalgebra k ⊂ g represents only a subset of the relevant Lie algebra, we can think of the complement p = k ⊥ (where g = p⊕k) as generators from which the drift term H d , or at least elements of it (noting that, for example, in open quantum systems or non-unitary evolutions, generators are not necessarily Lie algebraic in character), above is composed, i.e. that H d ∈ p.
The interaction between the drift H d (named due to its origins in fluid dynamics) and control H j Hamiltonians depends on the set of such accessible controls available to solve the quantum control problem of interest.The application of control Hamiltonians in this case represents, in effect, an attempt to 'steer' a system evolving according to H d towards a desired target via the adjoint action of Lie group elements generated by k (see [11] for a discussion).
Understanding the nature of relevant control algebras and the composition of drift Hamiltonians is an important consideration when designing and implementing machine learning architectures for geometric quantum control, including recent novel approaches applying machine learning for modelling and control of a reconfigurable photonic circuit [28] and to learn characteristics of H d via quantum feature engineering [8].One of the motivations of the present work is to demonstrate the utility of being able to encode prior information about the relevant control subalgebra into machine learning protocols whose objective is the output of a time-optimal sequence of control pulses, a design choice that requires information about precisely what generators are accessible.

Geometric optimisation
Selecting the specific control subalgebra and set of control amplitudes in order to generate time-optimal quantum circuits is a difficult task.Solving this optimisation problem in quantum control and quantum circuit literature using geometric techniques follows two broad directions.One approach uses symmetric space formalism and Cartan decompositions [1,11,29] to decompose the Lie algebra g associated with a given Lie group G into symmetric and antisymmetric subalgebras such that g = k ⊕ p.Here k is the control subalgebra (containing accessible generators) and p is the subalgebra generating the non-directly controllable evolution of the system.If a suitable partition can be found satisfying certain Levi commutation relations (see [18,30]), then the Lie group can be decomposed into a Cartan decomposition G = KAK.By doing so, the problem of selecting the appropriate set of generators τ ∈ k and control amplitudes is simplified (see Appendix (B 1) for a discussion and [1,18] in particular).A drawback of such methods as currently applied to problems in quantum control is their limited scope of application, namely that such methods apply only to limited symmetric space manifolds for which the methods were developed.Furthermore, the particular methods in [1] used to determine the appropriate generators are limited in their generality.
An alternative but related method explored by Nielsen et al. in a range of papers [14][15][16]24] approaches the problem of finding optimal generators and controls via modifying metrics applicable to cost functionals.In [24], geometric techniques are applied to determine the minimal size circuit to exactly implement a specific n-qubit unitary operation combining variational and geometric techniques from Riemannian geometry, detailing a method for determining the lower bound of circuit complexity and circuit size by reference to the length of the local minimal geodesic between U T and I (where length is determined via a Finsler metric on SU(2 n )).In later work [14][15][16], particular metrics with penalty terms are chosen that add higher-weights to higher order Pauli operators in order to steer the generating set towards one-and two-body operators which are assessed as being optimal for geodesic synthesis (see Appendix (B 1 c) for a discussion).It is shown that in limiting cases applying the variational techniques and penalty metric of Nielsen et al., the optimal set of generators are oneand two-body terms [31].
Such variational and penalty metric-based approaches have their drawbacks, however: there are limited convergence guarantees due to, for example, the existence of exponentially Pauli geodesics (many unitaries have minimal Pauli geodesics of exponential length (see [32])), reliance upon complicated boundary conditions, or the difficulty in discovering homeomorphic maps with which to deform known geodesics into other geodesic paths [16,21,31].The approach in the work of Nielsen et al. is also less general in that it assumes the entire distribution is the Lie algebra su(2 n ).
A common characteristic of both approaches in the case of SU(2 n ) is a preference for control algebras comprising only one-and two-body Pauli operators (operators that are tensor products of at most one or two Pauli operators) [14].In [1,29], the rationale is that higher-order (more than two-body) generators introduce coupling terms which increase evolution time.In [15,16,24], this rationale manifests in the imposition of penalty metrics upon higher-order terms in cost functionals.This approach penalises higher-order generators by assigning to them a higher weighting in the metric, thereby penalising higher-order terms in the cost function which seeks to minimise the metric of interest (in the case of [15], often Finslerian metrics F ).
Thus there are strong motivations for preferencing one-and two-body generator control subalgebras when devising strategies for quantum circuit synthesis.It can be shown that for higher-order SU(2 n ) systems, three or more body generators can themselves be composed via one-and two-body generators [21] when they form a bracket-generating set [33].These combined results motivate the selection of a (minimal) set of one-and two-body generators that can generate the entire Lie algebra su(2 n ).This is a characteristic of the bracketgenerating set (or distribution) ∆ adopted in [9], where instead of imposing metrics or relying on decompositions to obtain optimal control subalgebras, the control subalgebras are selected initially to comprise only one-and two-body terms.Such reasoning does not guarantee the utility of one-and twobody terms per se (see [21] for technical examples) but provides a basis for potentially preferring such generators when designing optimisation protocols, such as via machine learning, to approximate geodesics.

A. Overview
The difficulties of synthesising geodesics are well-known throughout geometric and control literature [34,35].The geodesically-driven control methods articulated in above face considerable challenges in terms of the complexities of the relevant boundary-value problem when adopting certain 'penalty' metrics designed to enforce the geodesic constraints on Finslerian manifolds.Though analytic or numerical (including machine learning) architectures are unlikely to provide means of systematically synthesising approximate geodesics and time-optimal unitary synthesis for arbitrary propagators or higher-dimensional Lie groups, they have potential utility for lower-order qudit systems.
In [9,21], an approach leveraging subRiemannian, rather than Riemannian, geometry is adopted in order to overcome some of these barriers to quantum circuit synthesis using geodesic approximations.SubRiemannian geometry [14,33,36] is a generalised form of Riemannian geometry that is well-developed in classical control contexts.In its simplest description, it covers typical geometries where only a subset of the full Lie algebra g is directly accessible.
For the purposes of quantum control, it is helpful to characterise subRiemannian manifolds in simplified Lie theoretic terms (see [33] for a more formal treatment).For a given manifold G, the Lie algebra g comprises generators which also form a basis of the tangent space T G. Curves γ(t) along G are those generated by generators τ ∈ g such that the generators may be thought of as tangent vectors tangent to the curves they generate.The curves which may be generated on a manifold in many ways characterise the manifold.A distinguishing feature of Riemannian and subRiemannian manifolds is the set of accessible generators.Riemannian manifolds are characterised by full direct access to g, that is, all generators in g may generate curves on G.In more formal language, the directions a curve may evolve along (or subalgebra of its generators) is characterised by certain subsets ∆ of the tangent bundle T G for a manifold G.The distribution is also denoted the horizontal tangent space, which intuitively refers to tangent vectors being 'tangent' and along the manifold but more formally refers to the fact that the covariant derivative of those (generating) tangent vectors X along the curve is zero, that is which is characteristic of parallel transport.By contrast, it may be the case that only a subalgebra k ⊂ g where g = k ⊕ p is accessible for generation of curves on G.In this case, evolution of curves tangent to certain directions of tangent vectors in p is not directly possible.This set of directly inaccessible generators is orthogonal to the set k of horizontal tangent vectors and so can be thought of as in some sense vertical.More formally, the vertical subspace of T G comprises vectors X whose evolution along the curve γ(t) is such that ∇ γ(t) X = 0 (having some component not tangent to the manifold).In this second case, the manifold is characterisable as subRiemannian rather than Riemannian.Elements of the vertical subspace p may still affect the evolution of curves, but only indirectly to the extent the generators in p are able to be generated by the application of the Lie bracket (see (16 below) i.e. if the distribution is bracket-generating.A number of theorems of subRiemannian geometry [33] then guarantee the existence and uniqueness of certain normal subRiemannian geodesics on G which are both unique and minimal in length.
Thus, for generating circuits on G = SU(2 n ), by constructing a distribution ∆ that is bracket-generating and comprising only one-and two-body generators, it can be shown [21] that normal subRiemannian geodesics may be generated which are minimal and unique, thus approximating the minimal circuits between I and U T .In the next section, we detail the approach in [9] that leverages such subRiemannian geometric insights.We do so in order to provide insight into the subRiemannian machine learning detailed in Parts III and IV below.

B. SubRiemannian Normal Geodesics
The motivation behind the approach in [9] is to solve the problem of finding time-optimal sequences of gates via approximating subRiemannian normal geodesics on Lie group manifolds [15,[37][38][39][40] in order to synthesise target unitary propagators U T ∈ SU(2 n ).The basis of the approach is to firstly adopt the time-independent approximation (6) and ex-press U T as an approximate product of exponentials: where U j are referred to herein as (right-multiplicative or right acting) subunitaries for convenience, again justifiable in the large m, small h limit where h = ∆ j , the evolution time of each U j .The terms c k j represent the amplitudes of the c k (square) control pulses applied to k generators at time interval t j for duration ∆t j = h to generate unitary U j (i.e.j indexes the segment, k indexes the control amplitude c k paired with the generators τ k ).The method in [9] in effect becomes a 'bang-bang control' problem [41] in which the time-dependent Schrodinger equation is approximated by a sequence of time-independent solutions U j where control Hamiltonians H j are applied via the application of a constant amplitude c j k for discrete time interval h = 1/N (with N the number of segments).The term E(c) represents an embedding function Here τ i form a basis for the bracket generating subset ∆ ∈ su(2 n ) of dimension m.By comparison with the conventional control setting described above (7), the coefficients c k would correspond to v j .
Because ∆ constitutes the set of generators of the entire Lie algebra su(2 n ) which in turn acts as the generator of its associated Lie group SU(2 n ), an arbitrary unitary U ∈ SU(2 n ) can be obtained to arbitrary precision with sufficiently-many products of exponentials.This results from the application of the Baker-Hausdorff-Campbell (BCH) theorem (see [24] for a generalised explication), namely that: The approach in [9,21] is to constrain application to cases where U may be synthesised as a product of a polynomial in n terms, meaning the number of exponentials (subunitaries) required to synthesise U is at most a polynomial function of the number of sub-unitaries n.We discuss the effect for machine learning algorithms of increasing n on outcomes such as fidelity measures below.
In the control setting discussed above (in which each U j is decomposed into its BCH product with coefficients c k ) each c k j sought to be found constitutes some optimal application of the generator τ k .This is consistent with the result in [1] (see Appendix (B 1), in which the minimum time for synthesising the target unitary propagator is given by the smallest summation of the coefficients (controls) of the generators n i=1 |α i | which, in our notation, would be m k=1 |c k |.
It is worth noting that the assumption in [1] and even [15] and other analytic results in control is that in effect the controls can be applied 'instantaneously' such that the minimum time for evolution of a unitary (via the adjoint action of control generators on drift Hamiltonians) is lower-bounded by the evolution driven by the drift Hamiltonian H d .That is, many such control regimes assume that control amplitudes can be applied without energy constraints, which is equivalent to being applicable within infinitesimal time.Often this assumption is justified by the fact that a control voltage may be many orders of magnitude greater than the energy scales of the quantum systems to be controlled.In cases where control amplitudes (for example, voltages) are, in any significant sense, upper-bounded say by energy constraints, then time for optimal synthesis of circuits will of course increase as the assumption of instantaneity will not hold.For our purposes, in a bang bang control scenario and assuming evolution according to any drift Hamiltonian sets a lower-bound on evolution time, we consider the control amplitudes c k as applied for time h rather than instantaneously.

One-and two-body terms
As discussed above, the approach in [9,21] is to in essence circumvent the need for elaborate penalty terms in bespoke metrics to penalise higher-order generalised Pauli geodesic generators by instead simply constraining the control subalgebra, the distribution ∆, to be the Kronecker product of oneand two-body Pauli operators: where σ j ι indicates the n-fold Kronecker product/tensor product of Pauli operators at position {1, ..., j, ..., m} with the twodimensional identity operator at other indices.
The underlying approach of the geodesic approximation method in [9,21] is to seek to learn the inverse map: and thus, by doing so, learn the appropriate sequence of control pulses necessary to generate time optimal evolution of unitaries and, consequently, time optimal quantum circuits.
The method involves generating training data in the form of normal sub-Riemannian geodesics on SU(2 n ) form I to U T .The exponential product (6) represents a path along the SU(2 n ) manifold, however there may be an infinity of paths between I and U T such that the map E is not injective (or unique, thus minimal), meaning E −1 is not well-defined.

C. Generating geodesics
To solve this uniqueness problem, [9,21] propose to synthesise paths that approximate minimal normal sub-Riemannian geodesics described above.To generate normal subRiemannian geodesics in SU(2 n ), [9] limit the norm of boundary conditions (a computational efficiency choice) and apply a generalised form of the Pontryagin Maximum Principle [41].They follow well-established variational approaches in [42] where subRiemannian geodesics may be found by minimising the energy (cost) functional: Specifically, , is the restriction of the bi-invariant norm (induced by the inner product on the tangent bundle) to ∆ ∈ su(2 n ).Here the curve γ(t) (path) varies over t ∈ [0, 1] with tangent vectors to the curve (i.e.along the vector field) given by γ(t).This approach uses variational methods to minimise the path length.To contextualise this formulation in Lie theoretic terms, γ(t) represent unitaries U (t) ∈ SU (2 n ) and γ the corresponding tangent (Lie algebraic) vectors.Distance along a path γ(t) generated by the tangent vectors (generators) γ(t) is measured in effect by metrics applied to the tangent space.The other key assumption behind this method is that the applicable metric g αβ is constant.
The normal subRiemannian geodesic equations arising from minimising the energy functional above can be written in differential form [42] as: It is worth unpacking each of these terms in order to connect the equations above to the control and geometric formalism above and because they are integrated into the subRiemannian machine learning model detailed below.The u term represents an element of the Lie algebra As such, it represents the generator of evolutions on the underlying manifold SU(2 n ).For each value t, the curve γ(t) represents an element of the Lie group i.e.SU(2 n ), again parametrised by t ∈ [0, 1].The Λ terms belong also to the Lie algebra su(2).They differ from u in that while u are direct elements of the distribution ∆, Λ(t) are elements of the overall Lie algebra su(2 n ) that are generated by the Lie-bracket between other Λ and u, hence The time-derivative Λ refers to how the Lie bracket commutator indicates the change in a vector field along the path γ(t).In a control setting, the Lie derivative tells us how much the generator/tangent vector Λ changes as it is evolved along curves γ(t) generated by elements u of the control subalgebra.For parallel transport along geodesics, as mentioned above, we require this change to be such that the covariant derivative of Λ 0 as it is parallel transported along the curve is zero, that is: The last term (22) indicates that u, resides in the distribution ∆ by virtue of the projection of Λ onto the distribution ∆: This projection function is important in that it ensures that the generators of U j remain within ∆, facilitating the parallel transport of Λ 0 and that U j are therefore able to be synthesised from the control subalgebra in our machine learning protocols.Here γ(t) ∼ U (t) and γ(t) ∼ Λ(t).The geodesic curves γ(t) depend on the initial condition Λ(0) (the 'momentum' term) with the initial 'position' in the manifold being the identity unitary.In the geometric control setting over Lie group manifolds, such as unitary groups, selecting an initial generalised coordinate (akin to 'position' in the manifold) and generalised momentum, which in turn amounts to selecting an initial 'starting' unitary from the Lie group for the evolution at t = 0, usually the identity U (0) ∈ SU(2 n ) along with a starting momentum Λ(0) drawn from the associated Lie algebra su(2 n ).Given these initial operators, the geodesic equations then allow determination of tuples of unitaries and generators (positions in the Lie group manifold, momenta in the Lie algebra) for any particular time value t ∈ [0, 1].That is, they provide a formula for determining U (t) and Λ(t).The distribution (control subalgebra) determines the types of geodesics that may be evolved along.Because the distribution is bracket generating, in principle any curve along SU(2 n ) may be synthesised in this way (though not necessarily directly).As noted in [21], the above set of equations can be written as a first-order differential equation via A first-order integrator (see (29)) is used to solve for γ(t) = U (t).It is worth analysing (25) in light of the discussion above on conjugacy maps and their relation to time optimal geodesic paths.The γ(t) terms in the conjugacy map: represent the forward-solved geodesic equations [9,25].
Given the initial condition Λ 0 , γ(t) here is the cumulative evolved operator in SU(2 n ) that is, for time-step t j , we have: In this respect conjugating Λ 0 by the γ(t j ) is equivalent to adopting a co-rotating basis or so-called moving frame for the Lie algebra (not dissimilar to how conjugation acts in a standard Euler decomposition such as in [27]).Projecting the conjugated Λ 0 back onto the horizontal space (i.e.∆) then defines Λ j as Λ 0 parallel transported along the approximate geodesic.This algorithmic approximation thus achieves (a) a way to parallel transport Λ 0 and (b) a decomposition method for generating U j .The continuous curve γ(t) is discretised via partitioning the parametrisation interval into N segments.A first-order integrator is then utilised to solve the differential equation.In continuous form, the integration equation for the unitary propagator applied over interval ∆t takes the timedependent form (1): (note that the accompanying code, the imaginary unit is incorporated into ∆).In the discrete case, the curve γ(t) is partitioned into N such segments of equal parameter-interval h = ∆t, indexed by γ j where j = 1, ..., N .The first-order integration resolves to: where here U j are unitaries that forward-solve the geodesic equations, represented in terms of the Euler discretisation [21]: where, again to reiterate, γ j+1 represents the cumulative unitary propagator at time t j+1 and U j represents the respective unitary that propagates γ j → γ j+1 .The Hamiltonian H j for segment U j is given by the projection onto ∆: and is applied for time h (though see Appendix (C) below for nuances regarding the interpretation of h and time given the imposition of ||proj ∆ (Λ 0 )|| = ||u 0 || = 1).A consequence of these formal solutions is that each H j is constrained to be generated from ∆.This does not mean that only unitaries directly generated by ∆ are reachable, as the action of unitaries (see ( 16)) gives rise to generation of generators outside ∆.It is, however, of relevance to the construction of machine learning algorithms seeking to learn and reverse-engineer geodesic approximations from target unitaries U T .The consequence of this requirement is that the control functions for machine learning algorithms need only model controls for generators in ∆.

IV. Experimental Design
In this section, we detail our experimental design and implementation of various machine learning models that build upon and extend work in [9] applying deep learning to the problem of approximate geodesic quantum circuit synthesis.The overall objective of our experiments was to compare the performance of variety of different machine learning architectures in simulated environments in terms of generating time-optimal quantum circuits by being trained on approximate normal subRiemannian geodesic in SU(2 n ) Lie groups.While other methods, such as the 'shooting' method [31] provide alternative means of generating geodesic data, it was shown in [9] that such methods particularly for higher-order SU(8) cases led to considerable increases in runtime compared with neural network approaches.In any case, as our primary focus in this work was on investigating the utility of greybox approaches to geometric machine learning architectures, such alternative methods (for example, implementing the methods of [15,16]) of generating geodesics or approximations thereto were not canvassed.

A. Experimental objectives
Synthesis of geodesics for use as training data in the various machine learning protocols utilised an adapted subRiemannian approach from [9] and [27].Our overall objectives required the ability to decompose a target unitary U T in order to generate the sequence (U j ) from U T and, in turn, render each U j synthesisable from a set of control amplitudes applied to generators from ∆.There are a variety of classical deep learning approaches that can be adopted to solving this type of supervised learning problem, including: 1. Standard neural network models: such models adopt variations on simply connected or other architecture that seeks to learn an optimal configuration of hidden representations in order to output (and thus generate) the desired sequence.On their own such models tend to be blackbox models, in which algorithms are trained to learn a mapping from inputs (training data) to outputs (labels) without any necessary interpretability or clarity about the nature of the mapping or intermediate features being generated by the network; 2. Generative models: generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs) seek to learn the underlying distribution of ground truth data, then use that learnt distribution to generate new similar data; and 3. Greybox models: greybox models, as discussed further on, seek to combine domain knowledge (such as laws of physics), also known as whitebox models, together with blackbox models into a hybrid learning protocol.
The actual engineering, target inputs and outputs of the various machine learning models differs depending upon metrics of success and use case.For a typical quantum control problem, the sought output of the architecture is actually the sequence of control pulses (c j ) = (c k 1 ...c k 2 ) (where j indexes the relevant subunitary and k the generators to generate it at segment k) to be implemented in order to synthesise the target unitary (i.e.apply a gate in a quantum circuit).The target unitary U T is typically one of one or more inputs into the model architecture.
The approach in [9] is blackbox in nature.In that case, the input to their model was (for their global decomposition algorithm) U T with label data the sequence (U j ).The aim of their algorithm, a multi-layered Gated Recurrent Unit (GRU) RNN, was to learn a protocol for decomposing arbitrary U T ∈ SU(2 n ) into the an estimated sequence ( Ûj ) (sequences are indicated by parentheses).The individual Ûj are then fed into a subsequent simple feed-forward fully-connected neural network whose output is an estimate sequence of controls (ĉ j ) (where c j is used as a shorthand for each control amplitude c k applied to generators τ k for segment j and parentheses indicate a sequence) for generating each Ûj using τ k ∈ ∆.While Ûj need not itself (and is unlikely to) be exactly unitary, so long as the controls (ĉ j ) are sufficient to then input into (13) to generate unitary propagators, then the objective of learning the inverse mapping (18) has been achieved.No guarantees of unitarity from the learnt model are provided in [9], instead there is a reliance upon simply finding (18) in order to provide (ĉ j ).As we articulate below, while this approach in theory is feasible, in practice where unitarity is required within the network itself (as per our greybox method driven by batch fidelity objective functions), a more detailed engineering framework for the networks is required.

Geodesic deep learning architectures
Three primary deep learning architectures were applied to the problem of learning approximations to geodesics in SU(2 n ): 1. a simple multi-layer feed-forward fully-connected (FC) network model implementing adaptation of 13 that learns controls (c j ) trained against (U j ) (the FC Greybox model); 2. a greybox RNN model using GRU cells [43] in which controls (ĉ j ) for estimated Hamiltonians Ĥj are learnt without being trained against (the GRU RNN Greybox model); and 3. a fully-connected subRiemannian greybox model (the SubRiemannian model) which generates controls (ĉ j ) by concurrently implementing (24) and learning the control pulses c Λ0 for the initial generator Λ 0 (that is, a model that replicates the subRiemannian normal geodesic equations while learning initial conditions for respective geodesics).
Each model, described in more detail below, took as initial inputs the target unitary U T together with unitary sequences (U j ) such that: Each new model uses various neural network architectures to generate controls (ĉ j ) for generators τ k ∈ ∆ (where H j = k ĉk j τ k ) which are in turn evolved via customised layers implementing (6) in order to generate estimates ( Ûj ).These estimates ( Ûj ) were then compared using MSE loss using an operator fidelity metric against a vector of ones (as perfect fidelity will result in unity).A second metric of average operator fidelity was also adopted to provide a measure of how well on training and validation data the networks were able to synthesise U j with respect to the estimated Ûj .
Unlike the segmented neural networks for learning control pulses to generate specific U j , the variable weights (and units) of the neural network were constructed with greater flexibility.The models tested are set-out below which indicates the inputs, outputs and measures (here MSE((U j ), ( Ûj )) refers to the batch fidelity MSE described below).
As detailed below, the penultimate layer of each model outputs an estimated sequence of subunitaries ( Ûj ).This estimates sequence was then compared to the true sequence (U j ) using operator fidelity (see (34 below).This estimate of fidelity F ((U j ), ( Ûj )) was then compared against a vector of ones (i.e.ideal fidelity) which formed the label for the models.As described below, the customised nature of the models meant intermediate outputs, including estimated control amplitude sequences (ĉ j ), Hamiltonian estimate sequences ( Ĥj ) and ( Ûj ) were all accessible.

Methods, training and testing procedures
Generation of training data for each of the models tested was achieved via implementing the first-order subRiemannian geodesic equations in Python, adapting Mathematica code from [9].A number of adaptations and modifications to the original format of the code were undertaken: (a) where in [9], unitaries were parameterised only via their real components (to effect dimensionality reduction) (relying upon an analytic means of recovering imaginary components [21]), in our approach the entire unitary was realised such that U = X + iY .This was adopted to improve direct generation of target unitaries of interest and to facilitate fidelity calculations, such that our unitaries became expressed in terms of: where dim Û = dim SU(2 n+1 ); and (b) in certain iterations of the code for Λ 0 : [0, 1] → SU(2 n ), the coefficients of the generators were derived using tanh activation functions that allowed elements of unitaries to be more accurately generated and also to test (see Appendix C) whether the first order integrative approach did indeed generate equivalent time-optimal holonomic paths (as in [27]).Our greybox machine learning architecture utilised tanh functions (with a range −1 to 1) rather than the range [0, 1].The reason for this is that by doing so, we were able to better-approximate the relevant timeoptimal control functions which give rise to the generator coefficients (for example, to reproduce the holonomic paths of [27], one needs the coefficients to emulate the range of the sine and cosine control functions which characterise the timeoptimal evolution in that case).Furthermore, (c) one observation from [9] was that the training data generated unitaries relatively proximal to the identity i.e. curves that did not evolve far from their origin.This is a consequence of the time interval ∆t for each generator i.e. ∆t = h = 1/n seg where n seg is the number of segments.The consequence of this for our results was that training and validation performance was very high for U T close to the identity (that is, similar to training sets), but declined in cases for U T further away (in terms of metric distance) from the origin.This is consistent with [9] but also consistent with the lack of generalisation performance in their model.As such, in some iterations of the experiments we scaled-up h by a factor in order to seek U T which were more spread-out across the manifold.Other experiments undertaken sought to increase the extent to which training data covered manifolds by increasing the number of segments U j of the approximate geodesic while keeping h fixed (between 0 and 1).We report on scale and segment number dependence of model performance below.
In addition to these modifications, in certain experiments we also supplemented the [9] generative code with sub-Riemannian training data from a Python implementation of Boozer [27].In this case, given the difficulty of numerically solving for arbitrary unitaries using Boozer's approach (whose solutions in the paper rely upon analytic techniques), we generated rotations about the z-axis by arbitrary angles θ (denoted η in [27]), then rotated the entire sequence of unitaries U j by a random rotation matrix.This has the effect of generating sub-Riemannian geodesics with arbitrary initial boundary conditions and rotations about arbitrary axes, which in turn provided a richer dataset for training the various neural networks and machine learning algorithms.
For SU(2), the bracket-generating set ∆ can be any two of the three Pauli operators.Different combinations for ∆ were explored as part of our experimental process.Our experiments focused on setting our control subalgebra ∆ = {X, Y } as this allowed ease of comparison with analytic results of [27] and to enable assessment of how each machine learning model performed in cases where control subalgebras were limited, which was viewed as being more realistic in experimental contexts.
Test datasets for generalisation, where the trained machine learning models are tested against out of sample data, were generated using the same subRiemmanian generative code above.We also sought to test, for each of SU(2), SU(4) and SU(8), the efficacy of the models in generating sequences ( Ûj ) that accurately evolved to randomly generated unitaries from each of those groups.The testing methodology for geodesic approximation models comprised input of the target U T of interest into the trained model with the aim of generating control pulses (ĉ j ) from which ( Ûj ) (and thus ÛT ) could be generated.
Depending on model architecture, neural network layers (either feed-forward fully-connected, RNNs or customised models) then generated variable weight (ĉ j ).These control amplitudes are then fed into a customised Hamiltonian estimation layer which applied (ĉ j ) to the respective generators in ∆.The output of this Hamiltonian estimation layer is a sequence of control Hamiltonians ( Ĥj ) which are input into a second customised layer which implemented quantum evolu-tion (i.e.equation ( 6)) in order to output ( Ûj ).A subsequent custom layer takes ( Ûj ) and the true (U j ) as inputs and calculated their fidelity i.e. it takes as inputs batches of estimates ( Ûj ) and ground truth sequence (U j ) and calculates the operator fidelity [44] of each Ûj and U j via: where d = dim U j .It should be noted that in this case, the unitaries are ultimately complex-valued (rather than in realised form) prior to fidelity calculations.The outputs of the fidelity layer are the ultimate output (labels) of the model (that is, the output is a batch-size length vector of fidelities).These outputs are compared to a label batch-size length vector of ones (equivalent to an objective function targeting unit fidelity).The applicable cost function used was standard MSE but applied to the difference between ideal fidelity (unity) and actual fidelity: where here n represents the chosen batch size for the models, which in most cases was 10 or a multiple thereof.It should also be noted that this approach, which we name 'batch fidelity', contributed significantly to improvements in performance: previous iterations of our experiments had engineered fidelity itself as a direct loss-function using TensorFlow's lowlevel API, which was cumbersome, lacking in versatility and resulting in limited improvement by comparison with batch fidelity approaches.A standard ADAM optimizer [45] (with α = 10 −3 ) was used for all models.

Geodesic architectures: Fully-connected Greybox
To benchmark the performance of the greybox models, a blackbox model that sought to input U T and output (ĉ j ) was constructed using a simple deep feed-forward fully-connected layer stack taking as an input U T and outputting a sequence of estimated control amplitudes (ĉ j ).Subsequent customised layers construct estimates of Hamiltonians by applying (ĉ j ) to the generators in ∆, which are in turn used to generate subunitaries Ûj .
The stack comprised an initial fully-connected feed forward network with standard clipped ReLU activation functions (with dropout ∼ 0.2) that was fed U T .This stack fed into a subsequent dense layer outputting (ĉ j ) utilised tanh activation functions.Standard MSE loss against the label data (c j ) was used (akin to the basic GRU in [9]).The sequence (U j ) was then reconstructed using (ĉ j ) external to the model and fidelity assessed separately.A schema of the model is shown in Figure (2).In this variation of the feed-forward fully-connected model, a basic greybox approach that instantiated the approximation (6) was adopted.
Greybox approaches [28] represent a synthesis of 'blackbox' approaches to machine learning (in which the only known data are inputs and outputs to an typical machine learning algorithm whose internal dynamics remain unknown or uninterpretable) and 'whitebox' approaches, where prior knowledge of systems, such as knowledge of applicable physical laws, is engineered into algorithms.As outlined below, practically this means customising layers of neural network architecture to impose specific physical constraints and laws of quantum evolution in order to output estimated Hamiltonians and unitaries.The motivation for this approach is that it is more efficient to engineer known processes, such as the laws of quantum mechanics, into neural network architecture rather than devote computational resources to requiring the network to learn what is already known to be true (and necessary for it to function effectively) such as Schrödinger's equation.
The greybox architecture used to estimate the control pulses necessary to synthesise each U j is set-out below.This is achieved by using τ i ∈ ∆ to construct estimates of Hamiltonians Ĥ and unitaries Û .The inputs (training data) to the network are twofold: firstly, unitaries Û generated by a Hamiltonian composed of generators in ∆ with uniform randomly chosen coefficients c k ∈ [−1, 1], where the negative values represent, intuitively, tangent vectors pointing in the opposite direction along a Lie group manifold: (recalling i is absorbed into τ k for convenience).The coefficients c j are constructed via successive feed-forward fullyconnected dense layers before being applied to the generators: they are the optimal controls being sought and represent updatable weights in the network.Secondly, a tensor of training U j (generated extrinsically from ∆) is separately input into the network.Because TensorFlow layers require output/input as real floats, U j is separated into a real Re(U j ) and imaginary Im(U j ) parts which are recombined in a customised layer.The specific controls being learnt by the network were accessible using standard TensorFlow techniques that allow access to intermediate output layers.This approach also provides a way to access the other intermediate outputs, such as Ĥ and Ûj .
For training and validation of the model, we have the following inputs and outputs: • Inputs: U T (target unitary) and (U j ) the training sequence (U j )s; and • Outputs: Fidelity F ( Ûj , U j ) ∈ [0, 1], representing the fidelities of the estimate of the sequence ( Ûj ) from those in the training data.
In the model, U T is fed into the feed-forward fullyconnected stack followed by input into a dense flattened layer to produce a coefficient ĉk for each generator τ k ∈ ∆ in (13).Thus for a model with n seg segments and dim ∆ = d, a total of n seg × d coefficients ĉk are generated.
These are then applied to the generators τ k in a customised Hamiltonian estimation layer.The output of this layer is then input into a unitary layer which that generates each subunitary: in order to generate the estimated sequence of unitaries ( Ûj ).
A subsequent custom layer calculates F ( Ûj , U j ).The output of this layer is a (batched) vector of fidelities which are compared against a label of ones using a standard MSE loss function and Adam optimiser.This model is the simplest of the greybox models adopted in our experiments.Pseudocode for the Fully-connected Greybox model is set-out in Appendix (A 1).

Geodesic architectures: GRU RNN Greybox
The second category of deep learning architectures explored in our experiments were RNN algorithms [46,47].LSTMs are a popular deep learning tool for the modelling of sequential datasets, such as time-series or other data in which successive data depends upon preceding data points.The interested reader is directed to a number of standard texts [47] covering RNNs architecture in general for an overview.In short, RNNs are modular neural networks comprising 'cells', self-enclosed neural networks consisting of inputs of training data, outputs and a secondary input from preceding cells.For sequential or time-series data, a sequence of modules are connected for each entry or time-step in the series, j.The intuition behind RNNs, such as Long-Short Term Memory net-works (LSTMs), is that inputs from previous time-step cells or 'memories' can be carried forward throughout the network, enabling it to more accurately learn patterns in sequential non-Markovian datasets.The original application of GRU RNNs to solving the geodesic synthesis problem was the focus of [9].That work utilised a relatively simple network of GRU layers [48], popular due to efficiencies it can provide to training regimes.
In the present case, the aim of the GRU RNN is to generate a model that can decompose a target unitary U T into a sequence U j reachable from I ∈ SU(2 n ).That is, the GRU RNN seeks to reverse-engineer the geodesically approximate sequence of subunitaries through a learning protocol that is itself sequential.In this model, the index j of the sequence (U j ) is akin to a 'time slice'.At each slice j, the unitary U j is input into the corresponding GRU cell G j (one for each segment).The cell activation functions were set to the tanh function given its range of [−1, 1] accorded with the range of elements of desired subunitaries.The output of the GRU cell G j then becomes, with a certain probability, an input into the successor GRU cell G j+1 which also takes as an input the successor subunitary U j+1 where the function tanh over operators (matrices) is understood in the usual way (see Appendix D for background).
The output of the GRU RNN is itself a sequence of control amplitudes (ĉ j ) from which were then applied to generators in ∆ in a custom Hamiltonian estimation layer in TensorFlow in order to construct Hamiltonian estimates and quantum evolution layers to generated estimated subunitaries Ûj .As with other models above, the sequence ( Ûj ) was then input into a customised batch fidelity layer for comparison against the corresponding (U j ).Our variations of the basic GRU RNN differed in that rather than simply concatenating and flattening all (U j ) into a long single vector for input into a single GRU cell, each U j was associated with time-slice j, the objective being that, a discretised output of ( Ûj ).
Our main adaptation to the standard GRU RNN model was to include customised layers as described above such that the output ( Ûj ) were themselves generated by inputting learnt coefficients (ĉ j ) into custom Hamiltonian estimation layers (containing generators from ∆), followed by quantum evolution layers (exponentiation) to generate the estimates.In this respect we followed novel approaches developed in [8,28], particularly around sequential Hamiltonian estimation (though we restricted ourselves throughout to square pulse forms for (c j ) only instead of also trialling Gaussian pulses).
Here the aim of the GRU is to replicate the algorithmic approach in [9], for example learning how Λ 0 is conjugated by U j in the generation of U j+1 .Again, this represents in effect a form of 'whitebox' engineering in which assured knowledge, namely how unitaries approximately evolve under the cumulative action of subunitaries, is encoded into customised layers within the network (rather than having the network 'deduce' this process).Pseudocode for the GRU RNN Greybox model is set-out in Appendix (A 2).A schema of the model is shown in Figure (3).

Geodesic architectures: SubRiemannian model
The third model (the SubRiemannian model) architecture developed in our experiments expanded upon principles of greybox network design and subRiemannian geometry in order to generate approximations to subRiemannian geodesics.The choice of architecture was motivated by insights from the variational means of generating subRiemannian geodesics themselves [35,42], namely that a machine learning model that effectively leveraged known or assumed knowledge regarding evolution of unitaries and their generation would perform better than more blackbox-oriented approaches.In essence the model algorithmically implemented the recursive method of generating approximate subRiemannian geodesics which relies only upon Λ 0 and ∆ and relied upon learning the choice of initial condition Λ 0 , rather than having to learn how to construct Hamiltonians or evolve according to the laws of quantum mechanics (which were instead dealt with via customised layers).
The method in [9] assumes certain prior knowledge or information in order to generate output, including: (a) the distribution ∆ i.e. the control subalgebra in an experiment of interest; (b) the form of variational equations giving rise to normal subRiemannian geodesics; (c) hyperparameters, such as knowledge of the number of segments in each approximation and time-step h.The form of ( 24) provides (via the trace operation) the control amplitudes ĉj for each generator for Hamiltonian Ĥj .Once the initial generator Λ 0 is selected, given these prior assumptions, the output of geodesic approximations is predetermined.This characterisation was then used to design the network architecture: the inputs to the network were target unitaries U T , together with the associated sequence (U j ) and control subalgebra ∆.
The aim of the network was to, given the input U T , learn the control amplitudes for generating the correct Λ 0 which, when input into the subRiemannian normal geodesic equations, generated the sequence ( Ûj ) from which U T could be obtained (thus resulting in a global decomposition of U T into subunitaries evolved from the identity).Recall that Λ 0 is composed from su(2 n ) or ∆ depending on use case (the original paper [9] selects Λ 0 ∈ su(2 n ).This generated Λ 0 was then input into a recursive customised layer performing the projection operation (24) that outputs estimated Hamiltonians, followed by a quantum layer that ultimately generated the sequence ( Ûj ).The sequence ( Ûj ) was then input into a batch fidelity layer for comparison against the true (U j ).Once trained, the network could then be used for prediction of Λ 0 , (U j ), the sequence of amplitudes (c i ) and ( Ûj ), each being accessible via the creation of sub-models that access the respective intermediate custom layer used to generate such output.Pseudocode for the SubRiemannian model is set-out in Appendix (A 3).A schema of the model is shown in Figure (4).
As we discuss in our results section, this architecture provided among the highest-fidelity performance which is unsurprising given that it effectively reproduces the subRiemannian generative method in its entirety.One point to note is that, while this architecture generated the best performance in terms of fidelity, in terms of the actual learning protocol (i.e. the extent to which the network learns as measured by declines in loss), it was less adaptive than other architectures.That is, while having overall lower MSE, it was initialised with a lower MSE which declined less.This is not unexpected given that, in some sense, the neural network architecture combined with the whitebox subRiemannian generative procedure overdetermines the task of learning the coefficients of a single generator Λ 0 used as an initial condition.The other point to note is that in [9], Λ 0 ∈ su(2 n ) i.e. it is drawn from the full Lie algebra, not just ∆ (intuitively because it provides a random direction in the tangent space to commence evolution from).From a control perspective, however, if one only has access to ∆, one cannot necessarily synthesise Λ 0 , thus a second iteration of experiments where Λ 0 ∈ ∆ were undertaken.The applicability of the SubRiemannian model as a means of solving the control problem is more directly related to this second case rather than the first.

Geodesic architectures: GRU & Fully-connected (original) model
In order to benchmark the performance of the greybox models described above, we recreated the original global and local machine learning models utilised in [9].In that paper, the global model utilised a simple shallow-depth GRU RNN tak- ing U T as inputs and outputting sequence estimates ( Ûj ) (being trained on the true (U )).In this global decomposition, each element of each U j is in effect a trainable weight, with the GRU RNN returning the full ( Ûj ) instead of only the control amplitudes as intermediate layers as in our GRU RNN Greybox model.The local model took U j as an input and output the coefficient control amplitude estimates (ĉ j ) from which the sequence ( Ûj ) could be reconstructed using ∆.In [9], in order to reduce parameter size of the model, the original global model was trained only on the real part of (U j ) on the basis that the imaginary part could be recovered via application of the unitarity constraint (see [21] for details).
To learn the individual U j segments of the approximate geodesic unitary path, we adapted while substantially modifying the approach in [9].In that paper, the method of learning U j segments was adopted via feeding the real part of a vectorised (i.e.flattened) unitary U j into a simple three layer feed-forward fully connected neural network.The labels for the network were the control pulse amplitudes c i .
Recreating these models it was found that using only the realised part of unitaries was insufficient for model performance overall, thus we included both real and imaginary parts both for model performance but also because it is unclear whether simply training alone on realised parts of unitaries affects the way in which the networks would integrate information about the imaginary parts.Furthermore, the approach in [9] did not use measures such as fidelity of more utility to quantum information practitioners, thus our model extended the original models by recreating the unitaries from the estimated controls (ĉ j ).

A. Overview
The motivation behind the architectures above is to develop protocols by which time-optimal quantum circuits may be implemented via sequences of control pulses applied to quantum computational systems.In this respect, the objective is for the architectures to receive a target unitary U T as input and output a sequence of control pulses (c j ) necessary to synthesise the estimate ÛT that optimises fidelity F ( ÛT , U T ).Our experimental method sought to enable comparison of blackbox and greybox methods across the synthesis of unitary propagators (gates) in SU(2 n ) for n = 1, 2, 3 and higher order groups in order to achieve this objective.We also sought to gain insight into hyperparameters of model architecture by shedding light on, for example, the optimal number of segments, training examples and training data.
Throughout our experiments, we observed that the selection of hyperparameters for both the training data and the models made a significant impact on performance.For example, as we discuss below, selection of different values for h = ∆t exhibited a noticeable impact on performance in terms of training/validation batch fidelity MSE and generalisation to test sets.For this reason, we extended our experiments to include progressively increasing values of h from h = 1/n seg to around h ≈ 1.
Generalisation of models was tested via assessing the fidelities of ( Ûj ) output by the trained models and also independently reconstructing ( Ûj ) from the estimates of control coefficients (ĉ j ).In this respect, our architecture benefited from the customised layering in that intermediate outputs of the models, such as control coefficients, sequences of estimated Hamiltonians ( Ĥj ), the actual unitary sequences (U j ) and fidelities could all easily be extracted from the models using TensorFlow's standard Keras model functional API.As discussed in [8,28], one of the benefits of this type of architecture is that it allows practitioners to 'open' the machine learning box, as it were, to validate at intermediate steps that the whitebox outputs of the model match expectations, which in turn is useful for model tuning and engineering

B. Tables and charts
Experimental results are set out in the tables and figures below.In Table (I), each of the four models was trained and evaluated against training data from SU(2), SU(4) and SU (8).For the greybox models, batch fidelity MSE was chosen as the relevant loss function.For the GRU & FC Blackbox model that replicated (subject to the inclusion of imaginary parts of unitaries in training) the original machine learning models in [9], standard MSE comparing realised unitary sequences (U j ) and estimates ( Ûj ) was used.Average gate fidelities for training and validation data sets were also recorded, with order of magnitude of standard error provided in parentheses.Bold entries indicate the highest MSE and fidelity metrics for models trained on SU(2), SU(4) and SU( 8) training data respectively.   .

A. Geodesic approximation performance
As can be seen from Table (I), the in-sample (training/validation) performance of the models varied considerably between blackbox and greybox approaches.From a use-case and training data perspective, as can be seen from Table (I), while the SubRiemannian and GRU RNN Greybox models outperformed the existing benchmark in [9] in terms of insample batch fidelity MSE loss, we see a decline in estimations of U j ∈ SU(2 n ) for higher n.MSE overall increases with dimension n, which is not unexpected.

B. Greybox improvements
As can be seen from Table I, the greybox models in general significantly outperformed (with fidelities around the 0.99 mark) the generic blackbox models (with fidelities in the order of 0.70) for in-sample training and validation experiments for all greybox models and all training data sets (Λ ∈ su(2 n ) and Λ ∈ ∆).By comparison with existing approaches in [9] and blackbox models that seek to directly synthesise control sequences (c j ) or unitary sequences (U j ), the SubRiemannian and GRU RNN Greybox models outperformed (batch fidelity MSE) the FC Greybox model by several orders of magnitude.This is evident most apparently in Figures (5)  SubRiemannian model performs the best out of each model, though exhibits overfitting at around the 80 epoch level.These improvements were also accompanied by functional benefits such as the guarantees of unitarity of U j .
Figures (5), ( 6) and (8) show training and validation loss for the three models for the case of SU(2), SU(4) and SU(8) for 1000 training examples, 10 segments, h = 0.1 and 500 epochs.All models exhibit a noticeable flatlining of the MSE loss for after a relatively short number of epochs, indicative of the models saturating (reaching capacity for learning), a phenomenon accompanied by predictable overfitting beyond such saturation points.For small h ≈ 0.1, the batch fidelity MSE is already at very low levels of the order ∼ 10 −5 .Again we see these persistently low MSEs as indicative of a highly determined model in which the task of learning Λ 0 (at least for smaller dimensional SU(2 n )) is overdetermined from the standpoint of large hidden layers (with 640 neuron units each), together with a prescriptive subRiemannian method.From one perspective, these highly determined architectures such as SubRiemannian model have less applicability beyond the particular use-case of learning the subRiemannian geodesic approximations specified by the method in [9].A comparison of FC Greybox, which is a more generalisable architecture (not restricted to whitebox engineering of the subRiemannian algorithm) indicates relatively high performance measures of As can be seen, larger h leads to deterioration in performance (higher MSE).However, smaller h can lead to insufficiently long geodesics, leading to a deterioration in generalisation.Setting h = 0.1 (red curves) exhibits the best overall performance.Even a smaller jump up to h = 0.5 (blue curves) exhibits an increase in MSE and decrease in performance by several orders of magnitude (and similarly for h = 1).

low MSE and high fidelity.
While the SubRiemannian model performed best in the case of Λ 0 ∈ su(2 n ), as is evident from Tables (I and II), the GRU RNN Greybox model performed almost as well for SU(2) and moderately outperformed the SubRiemannian model for SU(4) and SU(8) for most cases.The GRU RNN Greybox model was noticeably faster to train than the FC Greybox model by several hours and was slightly quicker to train than the SubRiemannian model but also flatlines (saturates) relatively early as evident in Figure (9).This is of note considering the fact that the GRU RNN Greybox model has more parameters than the SubRiemannian model (which ostensibly needs to only learn control amplitudes for Λ 0 generation) and is consistent with the demonstrable utility of GRU neural networks for certain quantum control problems [28].One possible reason for differences between GRU RNN Greybox and SubRiemannian models may lie in the sensitivity of Λ 0 : the SubRiemannian model's only variable degrees of freedom once initiated are in the relatively few weights c k learnt in order to synthesise Λ 0 .As the dimension of SU(2 n ) grows, then the coefficients of Λ 0 become increasingly sensitive, that is, small variations in c k have considerable consequences for shaping the evolution in higher-dimension spaces, in a sense, Λ 0 bears the entire  8), we see (main plot -first 100 epochs) that the GRU RNN Greybox (blue line) performs best in terms of batch fidelity MSE on training and validation sets.As shown in the inset, the GRU RNN Greybox levels out (saturates) after about 100 epochs and overall perfomed the best of each of the models and rendered average operator fidelties of around 0.998.The SubRiemannian model (red) performed less-well than the GRU RNN, still recording high average operator fidelity but exhibiting overfitting as can be seen by the divergence of the validation (dashed) curve from the training (smooth) curve.The FC Greybox rapidly saturates for large n train and exhibits little in the way of learning.All models render high average operator fidelity > 0.99 and saturate after around 150 epochs (see inset).burden of training and so becomes hypersensitive and requires ever fine-grained tuning.This is in contrast to the GRU, for example, where the availability of more coefficients c k means each individual coefficient c k need not be as sensitive (can vary more) in order to learn the appropriate sequence.

C. Segment and scale dependence
The experiments run across the various training sets indicated model dependence on the number of segments and scale h.As can be seen from Figure (10), we find that, not unexpectedly, the performance of models depends upon training data.In particular, model performance measures such as MSE and fidelity clearly depend upon time interval h = ∆t: where h is small, i.e. the closer the sequence of (U j ) is to approxi-FIG.9: Training and validation loss (MSE): GRU RNN Greybox.G = SU(2), n train = 1000, n seg = 100, h = 0.1, epochs= 500.This plot shows the MSE loss (for training and validation sets) for the GRU RNN Greybox model where the number of segments was increased from 10 to 100.As can be seen, the model saturates rapidly once segments are increased to 100 and exhibits no significant learning.Similar results were found for the SubRiemmanian model.This result suggests that simply changing the number of segments is insufficient for model improvement.One solution to this problem may be to introduce variable or adaptive hyperparameter tuning into the model such that the number of segments varies dynamically.mating the geodesic section, the lower the MSE and higher the fidelity.The effect on model performance is particularly evident in Figure (7) where increasing h from 0.1 to 0.5 leads to a deterioration in loss of several orders in magnitude (particularly for h > 0.3).As step size h increases, the less approximating is the resultant curve to a geodesic.Furthermore, for larger step sizes, the conditions required for the assumption of time independence in unitary evolution (6) are less valid.

D. Generalisation
In order to test the generalisation of each model, a number of tests were run.In the first case, a set of random target unitaries ŨT from the relevant SU(2 n ) group of interest were generated.These target ŨT were then input into the SubRiemannian and GRU RNN Greybox models which output the estimated approximate geodesic sequences ( Ûj ) to propagate from the identity to ŨT .An estimated endpoint target estimate ÛT for the approximate geodesic was generated by accumulating ( Ûj ) i.e: This estimate ÛT was then compared against ŨT to obtain a generalised gate fidelity F ( ŨT , ÛT ) for each test target uni- tary.Second, because fidelities of test unitary targets varied considerably, in order to test the extent to which higher fidelities may be related to similarity to the underlying training set of target unitaries {U T } train on which the models were trained, a second fidelity calculation was undertaken.The average gate fidelity of ŨT with {U T } train was calculated F ( ŨT , {U T } train ).
Correlations among the two fidelities were then assessed.
In the third case, for SU(2) models trained on training data where Λ 0 ∈ ∆, random test unitaries were replaced by ŨT comprising random-angle θ ∈ [−2π, π] z-rotations.The rationale was to test the extent to which a model based upon restricted control subalgebra training and architecture could replicate unitaries generated only from ∆ with high fidelity for the single qubit case of SU(2) where analytic solutions to the time optimal synthesis of subRiemanninan geodesics are known [27].
Generalisation of both GRU RNN Greybox and SubRiemannian models trained on SU(2) was of mixed success.By comparison, Figure (12) plots the relationship between generalised gate fidelity and similarity to training set data for the SubRiemannian model trained on data generated where Λ 0 ∈ ∆.In this case, the test unitaries ŨT were rotations by a random angle θ about the z-axis.No particular relationship between ŨT and the training set is apparent.Figure (14) plots the same generalised gate fidelities in relation to θ.Once again there is no immediately discernible pattern between the angle of the z-rotation and the fidelity of the estimate of ŨT .We do see that high (above 0.99) fidelities are distributed across the range of θ and that there is some hollowing out of fidelities between extremes of 0 and 1.
The out of sample performance of both the SubRiemannian and GRU RNN Greybox models (in both cases limited to generators from ∆) for random unitaries drawn from SU(4) and SU (8) was significantly worse than for SU (2).Average generalised gate fidelities were below 0.5 for each of the models tested.This is not unexpected given the heightened number of parameters that the models must deploy in order to learn underlying geodesic structure increases considerably as the Lie group dimension expands.A larger training set may have some benefits, however we note the saturation of the models suggests that at least for the models deployed in the experiments described above, expanding training sets is unlikely to systematically improve the generalisation of the models.Devising strategies to address both model saturation and ways in which expanded training data could be leveraged to improve

VII. Conclusion and future directions
This work presents a comparative analysis of greybox models for learning and generating candidate quantum circuits from time optimally generated training data.The results from experiments above present a clear case for the benefits of greybox machine learning architecture for specific applications in quantum control.The increase in performance over blackbox models, as evidenced by training and validation average operator fidelities for synthesised quantum circuits of over 0.99 in each case, demonstrate that machine learning based methods of quantum circuit synthesis can benefit from customised architecture that engineers known or necessary information into learning protocols.This is especially the case for quantum machine learning architectures: to the extent learning protocol resources need not be devoted to rediscovering known information or relationships, such protocols can more leverage the power of blackbox neural networks to those parts of problems which are unknown.
While the models outperformed current benchmarks on training and validation sets, they faced considerable challenges generalising well.Future work that may improve upon performance could include exploring hyperparameter learning, such as dynamically learning optimal numbers of segments, time-scale h (including variable time-scale) or other metrics (such as Finslerian metrics) for use within the subRiemannian variational algorithm itself.The crossdisciplinary intersection of geometry, machine learning and quantum information processing provides a rich seam of emergent research directions for which the application of both geometric quantum control and greybox machine learning architectures explored in this work are potentially useful.It is important to note that the methods developed in this work , particularly the SubRiemannian model and GRU RNN Greybox were both specifically engineered for particular objectives.While the models developed in this work and experiments were tailored for the particular problem of learning subRiemannian normal geodesics for quantum circuit synthesis, the overall architectural framework in which geometric knowledge is encoded into machine learning protocols has potential for useful application in quantum and classical information processing tasks.Future work building upon the greybox machine learning results in this work could include an exploration of ways to combine the extensive utility of symmetric space formalism, methods of Cartan and other geometric techniques with machine learning.

Fully-connected Greybox model
Pseudocode for the Fully-connected Greybox model is setout below.Note that TensorFlow inputs required (U j ) to be separated into real Re(U J ) and imaginary Im(U J ) parts and then recombined for input into fidelity calculations.Note the cost function C(F, 1) below is implicitly a function of c k (the sequence of which is (c j )) which are the variable weights in the model.Here γ is the learning rate for the gradient update and θ the trainable weights of the model.
Pseudocode for the GRU RNN Greybox model is set-out below.Note that TensorFlow inputs required (U j ) to be separated into real Re(U J ) and imaginary Im(U J ) parts and then recombined for input into fidelity calculations.Note the cost function C(F, 1) below is implicitly a function of c k (the sequence of which is (c j )) which are the variable weights in the model.Here γ is the learning rate for the gradient update and θ the trainable weights of the model.

SubRiemannian model
Pseudocode for the SubRiemannian model is set-out below.Note that TensorFlow inputs required (U j ) to be separated into real Re(U J ) and imaginary Im(U J ) parts and then recombined for input into fidelity calculations.Note the cost function C(F, 1) below is implicitly a function of c k (the sequence of which is (c j )) which are the variable weights in the model.Here γ is the learning rate for the gradient update and θ the trainable weights of the model.

Simulation Design
Simulation of training datasets for use in the machine learning models was undertaken in Python.The simulation was adapted from Mathematica code accompanying [9] with a number of adaptations.The code is constructed as a class with the following hyperparameters: (i) n = dim(SU(2 n )) for selecting the Lie group of interest SU(2 n ); (ii) n seg the number of segments (indexed by j) in the global decomposition into subunitaries (U j ); (iii) n train , the number of training examples; (iv) a parameter for whether Λ 0 ∈ su(2 n ) or ∆; (v) a set of parameters for selecting (for SU(2)) which Pauli operators constituted ∆; (vi) a parameter for selecting whether unitaries were to be generated in accordance with the example formulation in [27], (vi) parameter for selecting h (which defaults to 1/n seg in the event of a null entry.Upon selection of parameters, the class generates an extensive selection of training data in various forms (see the relevant code repository for code with commentary), including complex and realised iterations of U T , (U j ), c j , ∆ and other key inputs into the models.Training data was generated using a combination of standard Python numerical and scientific packages together with quantum simulation software QuTip [49].Our experimental results and methods focus on synthesising quantum circuits for multi-qubit systems where unitary operators are drawn from SU(2 n ).For such multi-qubit (qudit) systems, unitary operators U belong to Lie groups G = SU(2 n ) which describe the evolution of n interacting spin−1/2 particles.These groups are equipped with a corresponding Lie algebra of dimension (2 n ) 2 − 1 = 4 n − 1 and denoted su(2 n ), represented via traceless n × n skew-Hermitian (A = −A * ) matrices.Solving time-optimal problems in such contexts often relies upon appropriate selection of a subset of generators from the Lie algebra as the control subalgebra from which to synthesise a quantum circuit.This is especially the case when selecting a control algebra that renders targets U T reachable in a way that approximates geodesic curves on the relevant manifold as the choice of one set of generators over another can affect evolution time (and whether generated geodesics are indeed minimal, in cases where multiple geodesics are available such as via great circles on a 2-sphere).Of importance in selecting control subalgebras for time-optimal synthesis of geodesics in multi-qubit systems [1,14,16,18,31] is the so-called product operator basis i.e. a basis for the Lie algebra of generalised Pauli matrices, being tensor (Kronecker) products of elementary Pauli operators.The basis is formed by Pauli spin matrices {I x , I y , I z } = 1/2{σ x , σ y , σ z } i.e. the sets of generators of rotation in two-dimensional Hilbert space (and Lie algebra basis), with usual commutation relations.A basis for SU(2 n ) comprises of many-body tensor products of these Pauli operators, i.e. for an n-dimensional operator, there are between 1 and n Pauli operators tensor products with identities for various indices.An orthogonal basis {iB} (frame) for su(n) is then given [1] in closed-form via: where α = x, y, z and where I α appears only at the kth position with the identity appearing everywhere else.The parameter q tells us how many Pauli operators are tensor producted together e.g.q = 1 means the basis element only has one Pauli and the rest identities; q = 2 means we are dealing with a basis formed by tensor products of two Pauli operators and identities etc.

b. One-and two-body operators
Geometric control techniques for multi-qubit systems often focus on selecting one-and two-body Pauli product operator frames (bases) for relevant control subalgebras [15,21].If the control subalgebra contains only one-and two-body elements of the Lie algebra g, then curves generated in the corresponding Lie group G are more likely (with a number of important caveats) to be approximations to (and in the limit, as the number of gates n → ∞, representations of) geodesic curves and thus time-optimal synthesis of target unitary propagators.This approach can be seen across a number of key results in the literature [1,15,16,31] and forms the basis for the relevant distribution used in subRiemannian variational methods in [9,21] which the protocols developed in this workexpand upon.The preference for one-and two-body Pauli operator frames arises in different contexts.
For example, it is demonstrated in [1] in the case where G = SU(4) and K = SU(2) ⊗ SU(2) that by finding an appropriate Cartan decomposition G = KAK (with associated Lie algebra decomposition g = k ⊕ p) and maximally abelian subalgebra h = ispan{I x S x , I y S y , I z S z } ⊂ p (where I α represent one-body terms and S β two-body terms), we can write exp(−ih) = A in the KAK decomposition as the exponential of a linear combination of the generators in h, namely: (2).In this case, any Hamiltonian from from h ⊂ p can be generated using the controls in K (essentially by showing they can generate the two-body terms via action of the single-body operators I on S) and is time optimal.Because synthesis depends on the evolution of the drift Hamiltonian according to generators in p (as acted on via the adjoint action of K) and because this depends on the coefficients of the generators α i , then the minimal time is given by the coefficients of the generators in k used to steer H d : hence the optimisation problem becomes relatively straightforward.One rationale for preferring one-and two-body generators [1] is that higher-order (i.e. more than one-body) generators are shown to have coefficients which include a scalar coupling strength J between the relevant spins such that each H j has a coefficient 2π and the two-body (I i S i ) term has coefficient 2πJ.Thus a time-optimal problem becomes a simpler optimisation problem of finding the minimal sum i α i satisfying: The proof essentially relies on the fact that because synthesis of Q 1 , Q 2 takes negligible time, then synthesis time is determined by the time to synthesise A in the KAK which is determined by the parameters α i , hence minimal time amounts to minimising the sum of α i .Synthesis time is thus minimal to the extent that the 'fewest-body' Pauli generators are utilised in the control subalgebra.Thus, ideally, to generate minimal (and thus time optimal) paths in G to reach arbitrary target unitaries U T , one should ideally choose the control subalgebra with as few many-body terms as necessary in order to render U T reachable in a control sense.

c. Nielsen's approach
Nielsen et al. also focus on adopting one-and two-body terms in their metric-based approach to characterising and generating time-optimal quantum circuits.For example, the preference for one-and two-body generators is justified in [24] imposing a Hamming weight term wt(σ) applied to the Pauli generators σ together with a penalty function p(•) that penalises the control functional whenever Pauli terms of high Hamming weight are part of the control Hamiltonian.The idea is that Pauli n-tuples (tensor products) of anything more than one-or two -body Hamiltonians will be penalised via a higher Hamming weight as they will have many more nonidentity elements, whereas one-and two-body operators have lower Hamming weight).Nielsen et al. demonstrate that selection of one-and two-body generators is optimal for calculating a lower bound on the complexity measure m G (U ) using Finsler metrics i.e: where G is a universal gate set in SU(2 n ).
The significance of restricting control subalgebras together with bespoke metrics when utilising geometric optimisation techniques is evident in later work [14].For quantum control optimisation architectures, this demonstrates the utility of Finsler metrics as a more general norm-based measure of distance (and thus optimality) together with a justification of the selection of one-and two-body generators due on the basis of Hamming weights.The use of the 'penalty metric' approach is explored in further work [16,31] however, as noted in [21], such approaches can be convoluted without providing guarantees that optimal generators will be selected.
In [14], Nielsen et al. expand certain elements of the initial program combining techniques from differential geometry and variational methods to quantum circuit synthesis and quantum control.This second paper considered the difficulty of implementing a unitary operation U generated by a time dependent Hamiltonian evolving to the desired U T .They show that the problem of finding minimal circuits is equivalent to analogous problems in geometric control theory i.e. this paper has more of a focus on quantum control utilising geometric means.They select a cost function on H(t) such that finding optimal control functions for synthesis of U T (evolving according to the Schrodinger equation) involves finding minimal geodesics on a Riemannian manifold.
In this case, H(t) is written in terms of a Pauli operator expansion: where the first summation is over one-and two-body terms, the second over all other tensor products.A cost function is constructed with a penalty term p 2 imposed that penalises the higher-order terms with the total cost to be minimised given by a relatively trivial but important feature of the machine learning code in model architectures explored below.
Later work [15] of Nielsen and Dowling provides a more directly applicable example of how to develop analytic solutions to geodesic synthesis of unitary operations.As with the discussion above, it is worth exploring the key results from this work in order to understand characteristics of relevance to any attempt to utilise geometric methods for synthesis of unitary propagators for multi-qubit systems.In the paper, they develop a method of deforming (homotopically) simple and well-understood geodesics to geodesics of metrics of interest.Intuitively, the idea is to start with a known geodesic curve between I and U T and, subject to certain constraints, 'bend' it homotopically (that is, via mappings which preserve topological properties) into a minimal-length curve.However, as demonstrated in [15], a similar preference for one-and two-body terms is manifest in the applicable lifted Hamilton-Jacobi equation (this paper is also important for anyone interested in geometric quantum control given its discussion of significant (and potentially intractable) complexity constraints presented by the quantum extension of the Rabarov-Rudich theorem and also extend geometric quantum computing to include ancilla qubits.
In subsequent work utilising Nielsen et al.'s approach [31], the application of penalty metrics is extended in order to show its utility in synthesising time-optimal geodesics.In that paper, it is shown that a bound on the norm of the Hamiltonian H(t) is equivalent to a bound on the speed of evolution, that is, such a bound implies that minimal-time paths are minimal distance in which the norm function is used as the distance measure.Given ||H(t)|| = E, they demonstrate that for any curve connecting I and U T , the length of time-optimal curves is given by: where minimising evolution time T thereby minimises distance L. Hamiltonians of interest are confined to a control subalgebra A that disjunctively partitions the Lie algebra M (i.e.equivalent to generators being drawn from k or p above) and cases where ||H(t)|| ≤ E (where the Hamiltonian can rescaled i.e. reparametrised so that the norm equals E at all points on the path which in effect keeps the path identical but time shorter).They introduce a slight modification, that Tr(H 2 (t)) = E 2 in order to introduce the quantum brachistorone problem (see also [50]), a quantum analogue of the brachistorone (meaning 'shortest time') problem from classical variational mechanics [51].Their method in essence adopts the penalty-metric approach of [15] such that in the limit, the lowest-energy solution tends towards minimal time by reason of the increased cost associated with higher-order (more than one-and two-body) generalised Pauli generators.The approach in [31] is precisely to use the penalty metric approach of Nielsen et al. to generate a subRiemannian geodesic equation in order to confine the generators of the curve on the manifold to the control subalgebra A. This is achieved by adopting the norm-based cost function (pseudometric) where higher-order generator terms are weighted with penalty q, so that minimisation will by extension favour those generators (i.e.favour generators in k not p).By doing so, a sufficiently proximal initial seed for the "shooting method" (see [31,52]) is generated.This method is a generic numerical technique for solving differential equations with two-point boundary problems (where our two points are I and U T on G) and thus generating approximate geodesics.
Another motivation of restricting control subalgebras as in [9,21] and our experiments below) to one-and two-body terms is to be found via the geodesic approximations via the decomposition of the Lie algebra into projective subspace operators [15,37,38].In [31], this was achieved via setting P(H) = H P for one-and two-body Pauli terms and Q(H) = H Q for three-or more-body Pauli terms such that The idea is that higher-order (three-or more-body) terms in {H Q } carry a penalty parameter (weight) which is designed, when curve length is obtained via minimising the action, to penalise higher-order terms in a way that the functional (solution) to the variational problem is more likely to contain only one-and two-body terms.Thus instead of restricting the subalgebra of controls k to only one-and two-body terms (such as is undertaken in Swaddle), they instead (as per Nielsen's original paper) begin with full access to the entire su(2 n ) Lie algebra (i.e.fully controllable) and then proceed to impose constraints in order to refine this down to geodesics comprising only (or mostly) one-and two-body terms.The distinction with the subRiemannian approach adopted in [9] is that in the latter case, generators for U are by design constrained to be drawn from P = ∆ via the projection function (24), circumventing imposition of Finslerian penalty metrics.

C. Comparing geodesic approximations
Generation of geodesics in a QML context relies upon the availability of ways to compare whether outputs of machine learning models do in fact closely approximate geodesic curves.Thus the availability of reliable analytic and numerical methods for the generation of geodesics for use as training, testing and validation datasets is important.In our work, we sought to adapt the novel algorithmic approach to geodesic synthesis from [9] to include performance metrics of relevance to quantum information processing, such as fidelity measures.By comparison, [27] sets out an algorithm for determining time-optimal sub-Riemannian geodesics in SU (2) which can be used to benchmark the performance of different machine learning approaches to synthesising approximate geodesics.While the derivation of time optimal parameters in [27] relies upon complicated sequence of coordinate transformations which is not easily scalable, it does provide a useful basis for comparison with the methods in [9].
In [27], it is shown that time-optimal paths, where target unitaries constitute rotations by angle θ about the z-axis with generators being Pauli X and Y operators, can be synthesised in time-optimal fashion by following 'circular' or holonomic paths along which they are parallel-transported.On the Bloch sphere, this is represented as 'circular' paths emanating from the north pole whose diameter increases with increasing θ ∈ [0, 2π].Intuitively, the greater the angle of rotation, the greater the diameter of the holonomic path.
To this end, one of the objectives of our experiments was to ascertain the reliability of a few different methods of generating geodesics using methods drawn from geometric control sources.In order to do so, we compared this variational geodesic generation [42] approach to a known method for analytically determining subRiemannian geodesics in SU(2) in [27].By demonstrating the existence of a homeomorphism between the two methods one can be confident that the variational method appropriately approximates geodesics.
The challenge posed in comparing geodesic methods lies in the differing assumptions of each method: [9] constrains the norms ||proj ∆ (u 0 )|| = ||u 0 || as a means of more efficiently generating subRiemannian geodesic approximations [21], which is in effect the time scale (or energy scale) of their method.Conversely, [27], works at different scales.In practice this means the generators for unitary evolution via each method differ by a scaling related to the norm of the generators.Such different parameterisations can be understood as follows: For some desired tolerance (difference) , the two approximations at are identical if the cumulative norms D(H (S) , H (B) ) example (batch gradient descent example below).

LSTMs and GRUs
Long-Short Term Memory networks and Gated Recurrent Units are a prevalent form of recurrent neural network (RNN).RNNs are networks tailored to modelling sequential data, such as time-series data, or data such as sequences of control amplitudes (c j ) [47].For RNNs, for each time-step t, there is an input x t (such as c t ), an output y t and hiddenlayer output h t .The key intuitive idea behind RNNs is that h t of the network itself becomes an input into hidden layers for the immediately next time-step t + 1. LSTMs advance upon this concept by enabling the output of hidden layers to influence not just the immediately succeeding time-step t + 1, but also potentially activation functions at later time steps.In this sense LSTMs allow information about previous hidden layers (or states) to function as 'memory' that is carried forward.
One of the challenges regarding RNNs is the saturation of networks where new inputs to an activation function fail to contribute significantly to its output.Intuitively too much information is saturating the model, so additional information does not lead to material updates (manifest, for example in flatlining loss, as seen in some examples above).A way to overcome this problem of saturation includes to stochastically 'forget' certain information in order to make room for additional information, as manifest in the forget gate of an LSTM, distinct from the update gate.GRUs by contrast seek to incorporate the output of hidden layers and updates into subsequent hidden layers as detailed below.Their popularity is often owing to their improved speedup over LSTMs for a variety of contexts.
The reset gate combines the input x t at time t with the previous time-step hidden state h t−1 to define a reset output r t [48]: where w r , u r are updatable weight matrices and b r is an applicable bias, with σ an activation function (in our models, the tanh function to produce control amplitudes (c j ) but usually the sigmoid function).The update gate remains: This update gate is the output of the unit at time t.However, in order to output h t , an intermediate hidden layer state is calculated: where we see the (r t h t−1 ) term incorporates the influence of the reset gate and previous hidden layer h t−1 into the estimate.The final hidden layer output is then calculated by combining the Hadamard products of the update gate and previous hidden state together with the intermediate hidden state: which is the ultimate output at time t.The incorporation of h t−1 in this way allows influence of prior information in the sequence to influence future outputs, improving the correlation between outputs such as controls.

FIG. 2 :
FIG. 2: Schema of Fully-Connected Greybox model: (a) realised U T inputs (flattened) into a stack of feed-forward fully connected layers with ReLU activations and dropout of 0.2; (b) the final dense layer in this stack outputs a sequence of controls (ĉ j ) using using tanh activation functions; (c) these are fed into a custom Hamiltonian estimation layer produce a sequence of Hamiltonians ( Ĥj ) using ∆ ; (d) these in turn are fed into a custom quantum evolution layer implementing the time-independent Schrödinger equation to produce estimated sequences of subunitaries ( Ûj ) which are fed into (e) a final fidelity layer for comparison with the true (U j ).Intermediate outputs are accessible via submodels in TensorFlow.

FIG. 3 :
FIG. 3: Schema of GRU RNN Greybox model: (a) realised U T inputs (flattened) into a GRU RNN layer comprising GRU cells in which each segment j plays the role of the time parameter; (b) the output of the GRU layer is a sequence of control pulses (ĉ j ) using tanh activation functions; (c) these are fed into a custom Hamiltonian estimation layer to produce a sequence of Hamiltonians ( Ĥj ) by applying the control amplitudes to ∆; (d) the Hamiltonian sequence is fed into a custom quantum evolution layer implementing the time-independent Schrödinger equation to produce estimated sequences of subunitaries ( Ûj ) which are fed into (e) a final fidelity layer for comparison with the true (U j ).Intermediate outputs are accessible via submodels in TensorFlow.

FIG. 4 :
FIG. 4: Schema of SubRiemannian model: (a) realised U T inputs (flattened) into a set of feed-forward fully-connected dense layers (with dropout ∼ 0.2); (b) two layers (red) output sets of control amplitudes for estimating the positive (ĉ + Λ0 ) and negative (ĉ − Λ0 ) control amplitudes using tanh activation functions; (c) these are fed into two custom Hamiltonian estimation layers to produce the positive Λ+ 0 and negative Λ− 0 Hamiltonians for Λ 0 using ∆ or su(2 n ) that are combined into a single Hamiltonian estimation Λ0 ; (d) Λ0 is fed into a custom subRiemannian layer which generates the control amplitudes (ĉ j ), Hamiltonians ( Ĥj ) and then implements the time-independent Schrödinger equation to produce estimated sequences of subunitaries ( Ûj ) which are fed into (e) a final fidelity layer for comparison with the true (U j ).Intermediate outputs (a) to (d) are accessible via submodels in TensorFlow.The SubRiemannian model resulted in average gate fidelity when learning representations of (U j ) of over 0.99 on training and validation sets in comparison to existing GRU & FC Blackbox models which recorded average gate fidelities of ≈ 0.70, demonstrating the utility of greybox machine learning models in synthesising learning unitary sequences.

FIG. 7 :
FIG. 7: Training and validation loss (MSE).Comparison of MSE at different time intervals.h = 0.1, 0.5 and 1 G = SU(2), n train = 1000, n seg = 10, epochs= 500, Λ 0 ∈ su(2 n ): This plot shows the differences in MSE on training and validation sets as the time-step h = ∆t varies from 0.1 to 1.As can be seen, larger h leads to deterioration in performance (higher MSE).However, smaller h can lead to insufficiently long geodesics, leading to a deterioration in generalisation.Setting h = 0.1 (red curves) exhibits the best overall performance.Even a smaller jump up to h = 0.5 (blue curves) exhibits an increase in MSE and decrease in performance by several orders of magnitude (and similarly for h = 1).

FIG. 8 :
FIG. 8: Training and validation loss (MSE).Comparison of SubRiemannian, FC Greybox and GRU RNN Greybox models.G = SU(8), n train = 1000, n seg = 10, h = 0.1, epochs= 500, Λ 0 ∈ su(2 n ): For U ∈ SU(8), we see (main plot -first 100 epochs) that the GRU RNN Greybox (blue line) performs best in terms of batch fidelity MSE on training and validation sets.As shown in the inset, the GRU RNN Greybox levels out (saturates) after about 100 epochs and overall perfomed the best of each of the models and rendered average operator fidelties of around 0.998.The SubRiemannian model (red) performed less-well than the GRU RNN, still recording high average operator fidelity but exhibiting overfitting as can be seen by the divergence of the validation (dashed) curve from the training (smooth) curve.The FC Greybox rapidly saturates for large n train and exhibits little in the way of learning.All models render high average operator fidelity > 0.99 and saturate after around 150 epochs (see inset).

FIG. 12 :
FIG. 12: Generalisation (SubRiemannian model).G = SU(2), n train = 1000, n seg = 10, epochs= 500, Λ 0 ∈ ∆.Plot of generalised gate fidelity F ( ÛT , ŨT ) of random-angle θ ∈ [−2π, π] z-rotations against generated ŨT with the reconstructed estimate ÛT , versus F ( ÛT , ŨT ), average operator fidelity of randomly generated U T with training {U T } train inputs to the model.Here there is no statistically significant correlation between U T and training set {U T } train , though higher test fidelities are evident for U T bearing both high and low similarity to the training set (less dependence on similarity to training set for high fidelities).

B. Differential geometry and Lie groups 1 .
Generating subalgebras for geodesics a. Product Operator Basis

TABLE II :
Comparison table of batch fidelity MSE ((U j ) v.
B4)Due to parametrisation invariance, F (a Finsler metric) can be rescaled such that T = d([U ]).The overall effect is to demonstrate that using O(n 2 d(I, U ) 3 ) one-and two-qubit gates, it is possible to synthesise a unitary U A satisfying ||U T −U A || ≤ c, where c is a constant and U T is the target unitary gate.Moreover, the work demonstrates the optimality of unitary synthesis via following minimal geodesics in the Lie group manifold generated by one-and two-body generators (as we focus on below).Nielsen notes the number of one-and two-qubit terms (i.e.dim ∆) for the relevant Lie algebra is given by