Clustering and enhanced classification using a hybrid quantum autoencoder

Quantum machine learning (QML) is a rapidly growing area of research at the intersection of classical machine learning and quantum information theory. One area of considerable interest is the use of QML to learn information contained within quantum states themselves. In this work, we propose a novel approach in which the extraction of information from quantum states is undertaken in a classical representational-space, obtained through the training of a hybrid quantum autoencoder (HQA). Hence, given a set of pure states, this variational QML algorithm learns to identify, and classically represent, their essential distinguishing characteristics, subsequently giving rise to a new paradigm for clustering and semi-supervised classification. The analysis and employment of the HQA model are presented in the context of amplitude encoded states - which in principle can be extended to arbitrary states for the analysis of structure in non-trivial quantum data sets.


I. INTRODUCTION
In recent years, the amalgamation of Quantum Mechanics and Machine Learning (ML) has instigated extensive research into the field of Quantum Machine Learning (QML) [1]. With fault-tolerant quantum computation far from realisable in the near future, one of the areas in which researchers are looking for quantum advantage are variational algorithms. These variational approaches have demonstrated robustness in the regime of noisy intermediate-scale quantum (NISQ) devices, thus being a contender to first demonstrate quantum advantage [2]. With applications in QML variational methods most commonly employ a parameterised quantum circuit (PQC) [3,4], where parameters are classically optimised in a feedback loop routine between optimiser and PQC.
Generally, QML methods can be categorised into two distinct groups: (i) models that obtain advantage through the learning of classical data -once embedded into a quantum system -or (ii) the learning of purely quantum data sets. This paper focuses on the latter task, and proposes an approach in which the learning of quantum states can be undertaken in a classical representational space. This allows for novel approaches to cluster and classify quantum states based on their classical representations. Such representations will be formed from the employment of -what we have termed -a hybrid quantum autoencoder (HQA), illustrated in Figure 1c.
Classically, an autoencoder is a specific artificial neural network (ANN) architecture that is trained to return its input as its output, whilst undergoing a crucial funnelling of its degrees of freedom [6] (shown in Figure 1a). This funnelling process generates compressed representations of data points that belong to a particular group of data.
The autoencoder is composed of two maps: an encoder, e, and decoder, h, that are both approximated using This would be trivial for if x and z had equal dimensionality; however, in the case where dim( x) > dim( z), the autoencoder is forced to encode the most important aspects of the input, x, into the latent space. The latent vector, z ∈ Z, is in essence a representation of x ∈ X in the lower dimensional representational space of Z.
The main advantage of an autoencoder is that it is able to learn complex compression strategies through an unsupervised learning process. Such a process requires a human to have minimal prior information regarding the data set. Hence, it is often used in the context of denoising and compressing data that lack obvious methods of dimensionality reduction. Quantum autoencoders (Figure 1b), discussed further in Section (II B), are direct analogues and hence provide non-trivial compression maps for quantum states to a subset of its Hilbert space.
In essence, both ML and QML algorithms exploit the tendency that data aggregates in a low dimensional submanifold over the vast space of possible data points. We describe the data as lying on a sub-manifold to emphasise the fact that infinitesimal tangential translations, result in remaining on the sub-manifold. It should be noted that, although in mathematics a manifold has a more formal definition, in ML it is used to describe a set of points that can be well approximated by considering only a small number of degrees of freedom, embedded in a higher-dimensional space [6]. This manifold hypothesis is essential so that data points have a neighbourhood of highly similar examples that can be accessed by applying small transformations to traverse the manifold. Hence, the main objective of a quantum autoencoder is to learn the sub-manifold that describes a particular set of quantum states. The HQA in particular, represents the sub-manifold in an accessible classical vector space.
Quantum states can be represented as positive semidefinite operators on a complex Hilbert space, C 2 n , known as density matrices. Theoretically, one can imagine putting the elements of the density matrix through a classical ML algorithm to find similarities between quantum states or even classify states. However, the information stored in quantum states is notoriously inaccessible without exponential resources to characterise each (through quantum state tomography [7,8]). The HQA, in Figure 1b, gets around this by using an encoder that is trained to output classical information about important aspects of a quantum state. This quantum to classical transformation is extreme in its dimensionality reduction, as for a pure input state we have is the dimension of the classical space by measuring the n number of qubits. Hence one can see that if the state is able to be reconstructed from the classical space, then the classical space can only describe a relatively small set of quantum states. Nonetheless, the classical real space represents a manifold in C 2 n that can be learned with the construction of the HQA. It will be seen that it is this perspective that distinguishes the HQA from the QAEs explored in literature thus far.
This paper is structured such that we first provide background into quantum neural networks and quantum autoencoders in Section (II); before then constructing the HQA in Section (III). This is then followed by applications of clustering and classification in Section (IV), including results from numerical simulations.

A. Quantum Neural Networks
Quantum neural networks (QNNs) are the extension of ANNs to QML. The precise form of the QNN, however, is quite non-trivial as it would need to take advantage of unique quantum mechanical properties while also retaining the non-linear functional features of classical ANNs [9]. Hence, there are various proposals for QNN designs that claim similar non-linear dissipative dynamics of ANNs, but are yet to present clear quantum advantage [10][11][12][13]. In this paper, the implementation of the HQA will use the simplest design of a QNN: a PQC coupled with a specified observable. The choice of QNN, however, is arbitrary to the overall approach of the HQA (as shown in Figure 1c). Hence a fair comparison of QNN complexity and expressibility -for the construction of the HQA specifically -is left for future work.
PQCs form the basis of hybrid quantum-classical algorithms that optimise a quantum circuit with respect to a problem-dependent cost function [14]. The optimisation is performed classically to determine a better estimate of parameters which define a variational circuit. The optimisation in this work is performed using the parameter shift rule [15], elaborated in Appendix B.
Since any quantum circuit can be defined as a gate sequence U(θ), the m parameters of the circuit are the set of θ i which parameterise the unitaries. We define the measurement of a PQC as a function f : R m − → R, mapping the gate parameters to an expectation value, whereB is a predetermined observable (most commonly a Pauli-Z) and ρ 0 is an initial arbitrary state of the circuit. For consistency, the state before measurement will have notation, ρ(θ) = U(θ)ρ 0 U † (θ) as evident from the second line in equation (1). It should be noted that this functional form of the variational circuits hides the fact that on real devices, repeated measurements of the circuit are required to obtain this expectation value. As a result, there is a natural statistical uncertainty to the estimated f , determined by the number of samples taken of the circuit. Caution is thus required when constructing algorithms that require arbitrarily large numerical precision of f . These algorithms may have promising results in state-vector simulations, but become computationally infeasible to measure on real quantum devices, requiring an exponentially large number of samples. The optimisation of PQCs remaim an area of research as it is not clear that employing classical optimisers will result in optimal solutions for quantum cost function landscapes. This has given rise to works suggesting optimisers that are aware of the underlying quantum structure of quantum states [16][17][18]. Furthermore, there exist limitations of barren plateaus in deep PQC-based algorithms, as initially realised in [19]. Here, the exponential suppression of gradient with increasing depth of circuit, has also been shown to be linked to the expressibility of PQCs [20]. In addition to barren plateaus, PQCs are seen to exhibit narrow gorges [21] which is the occurrence of the existence of cost landscape minima in narrow wells that get steeper with increasing depth. The effects are not only seen with an increase in qubits, but arise due to entanglement [22], certain cost functions [21,23], and noise [24]. These phenomena clearly have implications on QAEs that employ PQCs, hence it is important that an implementation of the HQA is efficient in its circuit depth -i.e. O(n). This is largely dependent on the chosen quantum data set {ρ in }, and the expressibility of the particular QNN component shown in Figure 1c.

B. Quantum autoencoders
Many recent works involving quantum autoencoders (QAEs) build on the structure first proposed by Romero et al. [5], using shallow PQCs to compress quantum states. In [25], QNNs are of the form in [26] to construct a QAE that successfully denoises Greenberger-Horne-Zeilinger states which are subject to spin-flip and random unitary noise errors. In [27], a QAE is constructed using approximate quantum adders that are obtained with classical genetic algorithms as opposed to more commonly used gradient methods for parameter optimisation.
All such applications work on the process of funnelling quantum states into a lower dimensional Hilbert space. This naturally returns compressed representations that disregard both stochastic noise fluctuations and irrelevant degrees of freedom. It is important to note that the set of states are assumed to have support on a subset of its Hilbert space, S ⊂ H. The existence of such support is not guaranteed, but is instead common in many sets of quantum states, due to symmetries inherent to physical processes. For example, in [5] a QAE is classically simulated to show the compression of ground states of the Hubbard model and molecular Hamiltonians.
QAEs have been experimentally realised for the compression of qutrits with photons in [28] and the compression of two-qubit states into two single-qubit states in [29]. Furthermore experimental realisation of the QAE via quantum adders has been shown in [30]. These QAEs are promising, but are nonetheless distinct to the HQA proposed in this paper. Specifically, the HQA aims to generate a classical representation of the S sub-manifold for classical analysis. Hence, not only can the data be compressed, but the compressed representation is an accessible classical vector that can be analysed.
There are clear limitations on an autoencoder's abil-ity to compress data. On the compression rate of QAEs, there exist not only a fundamental limit due to the degrees of freedom in a dataset, but a quantum limitation related to the von Neumann entropy of the density operator representing the ensemble of training states [5]. In [31], it is further elaborated that the compressibility is related to the eigenvalues of the weighted ensemble density matrix. Crucially, this theoretical limit is intrinsic to all possible compression strategies and QAEs.

A. Design
This paper proposes a novel variation of the QAE that we have termed a Hybrid Quantum Autoencoder (HQA). The hybrid nature of this model arises from the incorporation of both ML, in the form of classical ANNs, as well as QML, through the use of PQC-based QNNs. Figure 2 illustrates the overall design of the model which is a combination of (i) an encoder that takes a quantum state from Hilbert space H ⊗n 2 to a subset of the real vector space V of dimension v = dim(V), and (ii) a decoder that performs the inverse of such an operation. In general, though quantum states are positive semidefinite (trace= 1) operators ρ on H ⊗n 2 , the HQA is equipped to identify only pure states. Hence, we associate the vector |ψ a to the pure state ρ a := |ψ a ψ a |. Mathematically, the encoder and decoder have the form of a map, where V = [−1, 1] v is termed the latent space and ξ ∈ V is referred to as the latent vector -analogous to the terminology used when dealing with classical autoencoders. This vector is in essence a classical representation of the quantum state -a perfect representation if D • E(|ψ in ) = |ψ in is achieved. This indicates that the information of state |ψ in was preserved in the latent vector to then be recreated without any loss of information. In information theory this is lossless encoding of information into a compressed latent space. Though the functional form of the encoder and decoder are defined, the models themselves have not been specified. As seen in Figure 2, the encoder E is a PQC parameterised by a vector, α. The PQC receives some state |ψ in , and applies unitary U 1 (α) on the combined system of the input state with (v − n) ancilla qubits. From this circuit, the Z expectation value of every qubit is measured, to then form the latent vector ξ, where | ψ in = |ψ in ⊗ |0 ⊗(v−n) , Z i is the Z-Pauli operator acting on the i th qubit and α parameterises the PQC that will be optimised. The set of ξ = (ξ (1) , ..., ξ (v) ) forms the latent vector which is identified as the classical representation of an input quantum state. Following the encoder, the decoder is extremely similar to the one defined in equation (4). However, this time it is a mapping that returns a quantum state when given a latent vector, Training changes only the weights of the ANN and not the parameters of the PQC. The PQC parameters are designed to be the output of the ANN. |ψ 0 in Eq. (5) is an appropriate n-qubit ansatz for the type states involved. In this paper, we simply take |ψ 0 = |0 ⊗n , which requires no additional operations for ansatz preparation.
It is important to note, both encoder and decoder have hyper-parameters (such as the number of parameters that define the PQCs, the number of neurons and depth of the ANN, etc.) for which one must optimise. From here on we will assume that these hyper-parameters are accounted, realising that there is possible future work in rigorously addressing the exact optimisation for this HQA model. Now that we have defined the encoder and decoder, the HQA model, A, is the combination defined as, where we will refer to the output of the HQA as A(|ψ in ) = |ψ out . Though the HQA looks as though it is a single run through both the encoder and decoder, there is an implicit sub-routine for the encoder where the PQC must be sampled multiple times to obtain ξ. Now that the components of the HQA have been pieced together, the model is trained to copy the input such that |ψ out ≈ |ψ in . To do this we find a measure that can identify the distance between quantum states which will be the foundation of the HQA cost function. There are many possible ways in which to construct a sensible loss function, the one which we will be considering is one minus the fidelity F between the model output and the expected training output. Hence for a chosen training data set of K quantum states, {|ψ in i } K i=1 , and the fidelity defined as F (|φ , |ψ ) = ψ|φ φ|ψ , we define a loss, where is the average fidelity across all the training instances. In practice, the learner does not calculate the loss over all training instances per iteration, but rather a small batch or even over a single instance. This is because, (i) it is computationally expensive having a loss function that sums over all training instances, and (ii) doing so may result in over-fitting to the training data.
The fidelity between two states is maximum when input states are identical, and minimum when they are orthogonal. Hence, the aim is to maximise the fidelity to achieve |ψ in ≈ |ψ out , and thereby minimise the loss, which lies in the range [0, 1]. This is a natural choice for a loss function, as the fidelity is a common distance measure between quantum states. The fidelity has also successfully been used for the construction of denoising quantum autoencoders [25] and is hence a great starting point for the construction of the HQA loss function. Selecting a method to measure the fidelity now becomes a hyper-parameter of the model -for which this paper will use the swap test (discussed in Appendix D).
Using the swap test, we can make an estimate of the training complexity. The sampling complexity per iteration (derivation in Appendix E) is given by, where P E = dim(α) is the number of parameters in the encoder and P D = dim(θ) in the decoder, ε ξ = ∆ξ i is the uncertainty in each component of the latent vector ξ, and ε fid is the uncertainty in the fidelity measurement. The required ε ξ and ε fid is quite non-trivial. This nontriviality will become evident when dealing with the application in Section IV, where the HQA is fundamentally unable to learn some states due to their stochasticity.

B. Order in latent space
Now that the HQA has been constructed, one can observe the powerful nature of representing states in a classical latent space. Training the HQA gives rise to order in latent space that is created purely through matching the input quantum state to the output. In other words, even though we are not supplying the HQA with information about the states trained directly, the model is able to learn these differences and form patterns in latent space. It is this order that we can exploit to apply ML learning techniques to cluster and classify states in Section IV.
The HQA is trained for a training set of quantum states that have underlying symmetry. In this paper, |ψ in i will correspond to a set of distinct amplitude encoded Gaussian states that can be easily analysed.
A Gaussian distribution is defined as, where µ and σ are the mean and standard deviation respectively. Quantising this function for N equally spaced values in the range .., N − 1} and a is the ceiling function that returns the smallest integer above or equal to a. Now that we have discrete distribution, d i (μ,σ), it can be encoded into the N = 2 n amplitudes of a n-qubit quantum state, where C = N −1 i=0 |d i (μ,σ)| 2 is the normalisation constant. The ability for such encoded states to be variationally encoded has been shown in [32].
With this set of distinct quantum states, the HQA is employed to generate a classical latent space that represents the subset in which these states lie. Automatic differentiation is assumed through this unsupervised model to update the PQC parameters in the encoder and the weights of the ANN that determine the decoder. The PQC architecture of U 1 (α) and U 2 (θ) are shown in Appendix C. We take dim(α) = 4v and dim(θ) = 4n, and a feedforward ANN with one hidden layer of size 2v. The training was performed in batches of 2 states using the Adam (γ = 0.1) optimiser [33], for a fixed number of epochs -which is the number of iterations each training sample is used for optimisation.
In Figure 3a, we see that training convergence is robust to latent space dimension, with similar loss evolution for all v. However, in Figure 3b, larger v is seen to decrease the loss in testing. As expected we see greater expressibility of the quantum state with a larger latent space.
Once trained, to analyse this model we can apply only the trained encoder, E, to an input quantum state and observe its location in latent space -this latent vector is what we refer to as the classical representation of the input quantum state. An illustration of this process is shown in Figure 4. The latent vector can then be obtained for all the amplitude encoded Gaussian states used for training, and then plotted. Figure 5 shows such a plot for n = 5 and a latent size, v = 12 where first two components (or parameters) of the vector are illustrated. This figure shows that Gaussian encoded states are distributed with a pattern distinguishing their mean and standard deviations. In this specific example, we see that an outward radial movement in the presented latent space corresponds to decreasing the standard deviation; and a positive polar rotation corresponds to increasing the mean. Such elegance is evident from patterns in the original set of quantum states; the two degrees of freedom: µ and σ. However, these patterns have the ability to be extremely non-trivial to visualise and this can be seen when we extend the results of Figure 5 to a 3 rd dimension as shown in Figure 6. Hence this non-triviality suggests the use of ML to learn patterns in latent space.
It is not necessarily true that patterns will be visible when plotting the first two parameters of latent space. A plot of, say, the 9 th and 10 th latent parameters shows no patterns at all, as points seem to all congregate on a line or point. This indicates that these latent parameters are not being used to distinguish the quantum states, suggesting that a possible dimensionality reduction is possible for the latent space. This is where one can use principal component analysis (PCA) [6] that will both, allow for a clearer understanding of how the latent space is used, and also transform the space so that the most principal components can be plotted. Now that we have identified a method of systematically analysing latent space, we can ask the question of whether this pattern would still occur if one simply trained on Gaussians with different mean values. Or simi- The two parameters shown are in the direction of the two most principle components which accounts for 55% of the variation in (a) and 71% of the variation in (b). This variation refers to the spread of data in the plotted components. One should also note, since the latent space has been transformed to the basis of the 2 principle components, the parameters of the latent space are not necessarily between −1 and 1.
larly, if the model was built on training on Gaussians with only varying standard deviations. The results of doing so are shown in Figure 7 where the two most principal components from PCA dimensionality reduction are plotted. The formation of patterns in latent space can only be observed when training has seen the variations in state. For example, in Figure 7 where only µ was varied and σ was kept constant, there is order formed distinguishing µ but not σ. The opposite occurs where the training is switched. This suggests the HQA naturally attempts to allocate areas of similar quantum states without the need for additional supervision. In the context of applications, this is extremely useful as one can exploit the latent space location of a particular unknown state in reference to other known states -where similarity was before not necessarily obvious. In general, this means that we can infer information about states by applying ML algorithms to the states in latent space, as will be explored in Section IV.

IV. APPLICATION
It is evident that clear patterns emerge in latent space from training Gaussian distributions. An extension to merely observing these patterns is obtaining information about quantum states through their latent representation. This includes both, understanding what it means for states to be located near each other in latent space, and also seeing if one can infer information about states using ML on their latent space representation.
In order to test these methods, we construct a toy problem involving two classes of states. We define amplitude encoded skewed Gaussian distributions of the form, where d i is the amplitude of the i th orthogonal state, N i is a Gaussian distribution, v i = max{0, a·i+b} is a linear function, and we have the class label definitions, η = 1, class = "smooth" η, class = "non-smooth" where η ∈ [0, 1] is a uniform stochastic term that fluctuates as a distribution is called for training or testing.
To illustrate the power of the HQA, we introduce the artificial objective of clustering states by either smooth or non-smooth, simply from applying classical ML techniques to their latent space representations. However, before one can obtain these representations, one needs to train the HQA, raising the question of selecting training instances. In general, there is no clear answer to how one should proportion the training instances between the two classes. However, it was seen in Section III B that order was formed when the HQA had seen different distributions, without which the latent representations appear to not separate deferring distributions. Hence we look at two HQAs: (i) one that is trained with only smooth states, and (ii) a HQA trained with both classes in equal proportion. An important point to know here is that the HQA will fundamentally not be able to reproduce the states with the applied stochastic term as the set of such distributions is far too large. Nevertheless, it will be shown -through both clustering and classificationthat the HQA does not need to recreate states perfectly to be useful in the context of ML in latent space.
Post the HQA training, it is possible to analyse the latent representations of the quantum states that are obtained by applying E on a sample of the states (the results of which are shown in Figure 8). Using both classes of distributions for training, constructs a HQA (Figure 8b) that is still able to allocate regions of varying mean in its principal components. However, the HQA trained on only smooth states lacks this order (Figure 8). Interest- ingly, minor components of the vector clearly distinguish the two classes of states, regardless of the its training.
Having constructed classes of states, in this section we will look at (i) clustering states based on their latent representations, and (ii) providing an enhancement on classifying quantum states with semi-supervised learning.

A. Clustering
In classical ML the most common algorithm for clustering is kmeans [34] (also referred to Lloyd's algorithm) which attempts to find clusters in a data set by observing some classical distance measure (an introduction to kmeans is presented in Appendix F). The quantum equivalent was first proposed in [35], in the form of a so-called quantum kmeans. Here they are able to produce a quantum state corresponding to the k-clusters with complexity that grows linearly with the number of qubits, n. However, obtaining a classical description becomes exponential as the quantum states need to be measured using quantum state tomography (QST) techniques, such as compressed sensing [7,8], which requires O(kN 2 log N ) where N = 2 n . In addition, the authors exploit the fact that kmeans can be expressed as a quadratic programming problem which can be solved using a quantum adiabatic algorithm. In [36], an approach is presented with efficient quantum methods of calculating the Euclidean distance. Alternatively, the quantum approximate optimisation (QAOA) algorithm [37] is used in [38], for clustering by association to the maximum cut problem.
Many of the methods in literature suffer from requiring QST techniques to classically obtain clusters, which is not required when using the HQA. We define a method of clustering states using classical representations generated by the HQA, after which classical clustering algorithms are used on a space that is exponentially smaller.
For the classical clustering of latent space, it is required that a more sophisticated method than kmeans is used. There are many advanced clustering methods that allow non-linear clustering [39], but in general such clustering is not natural. In this work we will use Gaussian Mixture Modelling which is an extension of the kmeans clustering algorithm that has the flexibility to change the importance of certain parameters of the vector [40] -kmeans simply uses a Euclidean measure of distance.
The predicted labels from clustering latent space are shown in Figure 9. The results show remarkable agreement with true labels when considering the minor components of the latent vector: 85.1% accuracy for the HQA trained with only smooth states and 84.4% accuracy for the HQA trained with both classes. On the other hand, clustering based on the principal components are seen to amount to guessing the class of state. The reason for this is evident when looking at the principal components. Fitting using all components is seen to identify clusters relating to the mean of the distributions (grouping negative and positive means), which is also a valid clustering process. Importantly this is only seen for the HQA trained with both classes. Such a deficiency is attributed, not to a limitation of the algorithm, but rather to the non-uniqueness of the task's solution.
At this point, one should retrace the steps of this clustering method in the context of an application. It is conceivable that a quantum experiment is conducted that produces quantum states about which the user has no information. This stream of states could hence be used for the training of the HQA. Importantly, however, one should note that a single sample of a quantum state is not sufficient. There is both, the sub-routine for the encoder, as well as the fidelity computation, that requires multiple copies of the state being produced from the experiment. Once a level of convergence has been reached, a clustering algorithm -such as kmeans or Gaussian Mixture Modelling -could be used on the latent representations of these states to identify possible groups. Finally, one can learn to identify -possibly highly non-trivial -distinctions between quantum states. This was shown in the distinction between smooth and non-smooth states, however more work is required to extend such a method to further applications.

B. Semi-supervised classification
In a similar process to using classical clustering methods on latent space, we now use classical supervised learning models to classify quantum states. Specifically two ML algorithms are used: Support Vector Machine (SVM) [41] and Logistic Regression (LR) [42]. These are both supervised learning algorithms that are successfully used for classification. The former attempts to obtain a separating hyper-plane splitting the smooth and non-smooth classes, while the latter is a form of binary regression. It is enough for the reader to understand that these are supervised learning algorithms that are fundamentally linear, but where the SVM can be extended to non-linear decision boundaries with the use of, what is called, a kernel. The decision boundaries of these models are shown in Figure 10.
Splitting the 1600 classical data points (seen in Figure 8) in a 3 : 10 testing/training ratio, the SVM and LR models are trained in two ways: (i) on all components of the latent vectors, and (ii) on just the principal components. The accuracy on the test data for all these variations (including kernel) is shown in Table I.
In general, both SVMs with non-linear kernels are seen to have a higher classification accuracy than LR. This makes sense, as the distribution of states on latent space was seen to be highly non-linear. More importantly however, from the performance of the SVM, the interesting result is that the HQA trained on only the smooth distributions, is comparable to the HQA trained on both when considering all components of the vector. At the same time, the HQA trained on both classes performs better when considering ML on only the principal components. To understand this behaviour, we realise that the HQA model attempts to separate states in only a few principal components. This means that when both classes were used for training the HQA, this separation was identifiable by the algorithm.
Considering all components, the polynomial kernel SVM with a HQA trained on only smooth distributions has a 0.95 accuracy rate. This occurs as a result of the non-smooth distributions -that were not seen by the HQA -being stored off the learned manifold and into the distinct unutilised regions of the latent space. It was hence easy for both ML algorithms to distinguish between the classes, with even a linear decision boundary from LR achieving an accuracy of 0.87.
These are significant results to keep in mind, however, not necessarily the most natural use of the HQA. If such a classification of quantum states was the main objective given a set of labelled states, one could instead -potentially -train the encoder E rather than an entire HQA. However, the power of the HQA comes down to its ability to be trained unsupervised -ie. without any labelling of the states that are being fed. For example, it is possible to train a HQA with the output states of some quantum experiment without knowing anything about the states themselves. Post-training one would only be required to label a few states and simply perform classical ML to obtain a working quantum state classifier. The reason that this works lies with the ability of the HQA to learn a manifold in which relevant states lie. Such a manifold is far smaller than the space of all states, hence requiring far fewer labelled instances to train. In ML literature, this process is known as semi-supervised learning, where only a portion of instances are labelled but where the unlabelled instances are also able to help with the overall classification process.  I. Results of the binary classification problem of distinguishing smooth and non-smooth states from their latent space representations. This is done for two HQA models, with (n, v) = (7,12), that identify different latent vectors: a HQA trained on only smooth and another HQA trained on both. The accuracy is simply defined as the proportion of correct classifications on a separate group of test data (size=480). Importantly, both the SVM and the LR models are by no means optimised for the performance in this classification problem. Rather, these classical models are used as an illustration of the possibility of classifying states based on their latent space representation. For completeness, the hyper-paramters used for SVM are: C = 1.0 and γ = 2 (Linear), 20 (RBF). Finally, PCA reduction takes the 4 most principal components from the latent vector. The PCA reduced space accounts for 55% of the variance for HQA trained on both classes and 48% trained only smooth.

V. CONCLUSION
QML algorithms are yet to conclusively demonstrate advantage in the NISQ-era. There remain crucial problems that must be resolved for the application of these methods on real devices. One such problem is the emergence of barren plateaus and narrow gorges in gradientbased optimisation. This is common to all models that involve the optimisation of PQCs and is hence critical that it is solved for the specific HQA design implemented in this work. Yet another impediment is the difficulty of obtaining continuous values from qubit-based devices with measurements that are fundamentally digital. In this work, this is overcome with repeated measurement. However, it is possible to apply similar principles to Continuous-Variable (CV) Quantum devices [43] for a more efficient measurement of continuous latent space vectors. This approach is left for future work, in align with the QML models proposed in [44]. Finally, the effect of noise on the HQA requires further research, with potential robustness seen from training stochastic states.
Due to these problems, it was important that the proposal of the HQA was made arbitrary in its specific implementation (as illustrated in Figure 1c). Therefore, the crucial aspect of this work is the proposed paradigm of learning quantum states through the application of ML techniques on their classical representations -representations that are generated through training a hybrid quantum autoencoder (HQA).
To demonstrate its successful application, the HQA was constructed using PQCs on a training set of Gaussian amplitude encoded states. Patterns -associated with the mean and standard deviation of the Gaussian encoded quantum state -were visually recognisable in their latent space representations. The emergence of order in latent space was exploited for the implementation of clustering and semi-supervised classification. In the context of (non-)smooth states (defined in Eq. (13)), we were able to achieve 84% accuracy for clustering and 93% for classification. Though the accuracy is highly problem dependent, the applied states under question had nontrivial distinctions and hence demonstrates the robustness of this clustering and classification approach.
Finally, it is assumed that an end-to-end application of the HQA will involve a set of training states obtained from supplemental quantum algorithms -such as the quantum variational eigensolver [3]. In [5] the constructed quantum autoencoder is classically simulated to compress ground states of the hydrogen molecule with various r. Furthermore, there have been attempts for entanglement classification [45,46], a process which can potentially employ variational methods. Recently, QAEs were suggested for use in low-rank state fidelity estimation [47], for which the structured classical latent space constructed by the HQA could be exploited. These examples motivate further uses of the HQA from the contrived application studied in this work. Nonetheless, the novel paradigm proposed in this paper, lays the framework for a unique approach to extracting information and obtaining underlying structure from sets of quantum states. The real utility of ANNs is in their ability to learn and approximate any continuous function with sufficient data. The Universal Approximation Theorem states that a feed-forward neural network with a single hidden layer is able to approximate any continuous function on R N [52]. It is important to note that non-linear activation functions are required -without which we have a complex linear model. This is evident from equation (A2) where it can be that linear σ i implies a linear f as function compositions preserve linearity. The non-trivial aspect is that there is no restriction on the non-linearity of the activation function and how this relates to the number of neurons required in the hidden layer. Due to this universality of function approximation, ANNs do not have the same hyper-parameter tuning problem as other ML models. Hyper-parameters refer to parameters of the model that must be identified by the user -such as the number of neurons in the case of an ANN. The hyper-parameter that an ANN does not require is a choice of the type of decision boundary, which in other models is crucial. At the same time a significant problem with ANNs its tendency to approximate too closely the training instances rather than generalising upon these data points. This is known in ML literature as over-fitting and there are regularisation techniques to mitigate this problem.

Appendix B: The parameter shift rule
In this paper the computation of quantum gradients of continuous parameters are computed using parametershift differentiation proposed in [15]. It is shown that the analytical gradient of a variational circuit, f , defined in equation (1), can be found when it is composed of gates of the form G(µ) = e −iµG , where they are generated by a Hermitian operator G with strictly two eigenvalues ±r. It is shown that ∂ µ f can be estimated using two additional evaluations of the quantum device by placing either gates G(± π 4r ) in the original circuit next to the gate that we are differentiating. Since for unitarily generated one-parameter gates we have G(a)G(b) = G(a + b), we simply have a shift of the parameters by s = π 4r to find the gradient, This is aptly named the parameter shift rule. For generator G with more than two eigenvalues this strategy fails; however one is able to use an ancilla qubit and perform a decomposition of the derivative of the gate to obtain a gradient as elaborated in [15]. As an important additional benefit, the parameter-shift rule has been shown to hold on noisy quantum devices [53]. Furthermore, recent works have attempted to generalise this method of obtaining an analytic gradient for more intricate unitaries using stochastic techniques [54].
Having acquired the derivative of parameters in VQCs, one is now able to explore its use in QML -especially hybrid models involving both classical and quantum processing units. The ability to compute gradients of VQCs means that it is possible to attach a VQC and ANN as components of a larger ML algorithm, over which the back-propagation algorithm will still be applicable (shown in Figure 12). To implement these models, an open-source Python3 software framework for hybrid quantum-classical optimisation and ML, termed Penny-Lane [49], is used. The library interfaces with popular machine learning libraries such as Tensorflow, PyTorch, autograd while also providing APIs for the access of publicly available quantum devices such as those by Rigetti and IBM -alleviating some of the tedious programming. With only nearest neighbour couplings, such a structure is also known as a hardware efficient ansatz.
FIG. 14. This is a circuit diagram depicting the swap test algorithm that is used to compute the overlap, | φ|ψ | 2 through repeated measurements of the ancilla qubit. j φ j |j = j |φ j , one repeatedly measures the ancilla qubit from the circuit shown in Figure 14. This works since the state before measurement is, 1 2 i,j |0 |ψ i |φ j + |φ j |ψ i + |1 |ψ i |φ j − |φ j |ψ i .
Now measuring the ancilla qubit in the Z-basis, the probability of measuring eigenvalue, z = ±1 is, where we have used ψ i |ψ j = |ψ i | 2 δ ij due to orthogonality and the normalisation condition, i |ψ i | 2 = 1. Similarly for the |φ j states. Therefore by measuring and recording the ancilla qubit enough times, we can work out the overlap between the two states in terms of the probability Pr(z = ±1), (D4) where again we have made a simplification using the fact that ψ i and φ i are normalised. This means that minimising the fidelity implies the minimisation of L MSE -with the assumption that the amplitudes of the two states are non-negative real values. This can be shown by using the triangle inequality on equation (D3) to show that | φ|ψ | 2 ≤ i (ψ i φ i ) 2 revealing that the maximisation of the fidelity implies the minimisation of L MSE in equation (D4). However, the fidelity statement is stronger due to the inequality. It should, however, be noted that the swap test requires large numbers of SWAP gates, and hence further research is required to enhance this fidelity measurement.
M E samples to determine the latent vector and M swap samples to obtain the fidelity. We can relate the error in fidelity, ε fid , to M swap such that we have M swap = 1/ε 2 fid . This relation relies on the fact that the swap test measurement is a Bernoulli trial. Similarly, we relate the uncertainty in the latent vector components to the number of encoder samples: M E = 1/ 2 ξ , where ξ is the uncertainty in each component of the latent vector.
A single evaluation of the HQA requires M E + M swap samples. However, training requires the calculation of gradients through the parameter shift rule (Section B). Specifically, one must obtain the gradients with respect to all parameters in the model: { dL dαi }, { dL dwij } and { dL dθi }, where α i are the P E encoder parameters, w ij are the ANN parameters (weights) and θ i are the P D decoder parameters. The gradient evaluations are of the form, where the output of the encoder, E, becomes the input to the ANN, f , and the output of the ANN becomes θ i . Note, only α i and w ij are optimised, but we also require the gradient with respect to θ i . Sampling a quantum circuit is required to obtain the gradients dF dθi and dE dαi . dF dθi must be calculated for all P D parameters and dE dαi must be calculated for all P E parameters. The number of circuit samples required for dE dαi and dF dθi , is proportional to M E and M swap , respectively. In addition, an evaluation of the HQA is required for the calculation of the loss for that particular iteration. This therefore requires an additional M E + M swap samples. Putting all this together we have sampling complexity per iteration of,