An end-to-end trainable hybrid classical-quantum classifier

We introduce a hybrid model combining a quantum-inspired tensor network and a variational quantum circuit to perform supervised learning tasks. This architecture allows for the classical and quantum parts of the model to be trained simultaneously, providing an end-to-end training framework. We show that compared to the principal component analysis, a tensor network based on the matrix product state with low bond dimensions performs better as a feature extractor for the input data of the variational quantum circuit in the binary and ternary classification of MNIST and Fashion-MNIST datasets. The architecture is highly adaptable and the classical-quantum boundary can be adjusted according to the availability of the quantum resource by exploiting the correspondence between tensor networks and quantum circuits.


Introduction
Quantum computing (QC) has demonstrated superiority in problems intractable on classical computers [1,2], such as factorization of large numbers [3] and searching in an unstructured database [4]. Recent growth of the quantum volume in noisy intermediate-scale quantum (NISQ) [5] devices has stimulated rapid development in circuit-based quantum algorithms. Due to the noise associated with the quantum gates and lack of quantum error correction on NISQ devices, performing quantum computation with large circuit depth is impossible currently. It is therefore highly desirable to develop quantum algorithms that are resilient to noise with moderate circuit depth. Variational quantum algorithms [6] are a class of algorithms currently under rapid development in many fields. In particular, quantum machine learning (QML) [7][8][9] using variational quantum circuits (VQCs) shows great potential in surpassing the performance of classical machine learning (ML). One of the major advantages of VQC-based QML compared to its classical counterpart is the drastic reduction in the number of model parameters, potentially mitigating the problem of the overfitting common in the classical ML. Moreover, it has been shown that under certain conditions, QML models may learn faster or achieve higher testing accuracies than its classical counterpart [10,11]. A modern QML architecture typically includes a classical and a quantum part. Famous examples in this hybrid genre include quantum approximate optimization algorithm [12], and quantum circuit learning [13], where the VQC plays a crucial role as the quantum component with the circuit parameters updated via a classical computer. Various architectures and geometries of VQC have been suggested for tasks ranging from binary classification [11,[13][14][15] to reinforcement learning [16][17][18].
One of the key challenges in the NISQ era is that available quantum hardware has limited quantum volume and is only capable of executing quantum operations with small circuit depth. That means that most of the dataset commonly used for classical ML tasks are too large for the NISQ devices. To process the data with input dimension exceeding the number of available qubits, it is necessary to apply dimensional reduction techniques to first compress the input data. For example, in [19], pre-trained a classical deep convolutional neural network is used to compress the high-resolution images into a low-dimension representation. However, since the pre-trained model there has a huge number of parameters, it is not clear what is the contribution of the quantum circuit in the whole workload.
On the other hand, a major challenge in building a QML model is how to encode high-dimensional classical data into a quantum circuit efficiently. With the limitation imposed by NISQ in mind, the encoding process should be designed to consume as few gate operations as possible. Amplitude encoding is one of the encoding methods which can provide significant advantage in terms of the number of qubits required to handle the input data. For an N-dimensional vector, amplitude encoding requires only log 2 N qubits; however, the quantum circuit depth to prepare such encoded state exceeds the current limits of NISQ devices. Other approaches like single-qubit rotations require only a shallow circuit but it is unclear how to employ such encoding schemes to load high-dimensional data into a quantum circuit. This can be potentially mitigated by preprocessing the input data with classical methods to perform dimension reduction. Principal component analysis (PCA) is a simple dimension reduction method and has been widely used in the QML research; yet it lacks the representation power to retain enough information. More powerful and expressive models such as neural networks are not commonly utilized due to the requirement of pre-training and the significant number of parameters involved. Therefore, it is necessary to devise a data compression scheme which can be naturally integrated with VQC.
In this paper, we propose a hybrid framework where a matrix product state (MPS) [20,21], the simplest form of tensor networks (TN) [22], is used as a feature extractor to produce a low-dimensional feature vector. This information is subsequently fed into a VQC for classification. Unlike other QML schemes where the classical neural network has to be pre-trained, our framework is trained as a whole. This end-to-end training indicates the quantum-classical boundary can be adjusted based on the available quantum resource. Furthermore, since a MPS can be realized precisely by a quantum circuit [23], it is possible to replace the classical component with a quantum circuit, making the scheme highly adaptable. Our scheme has shown to be superior in the binary classification task for the MNIST dataset [24]. Here, we apply the scheme to more difficult tasks such as the ternary classification of MNIST and the classification tasks of Fashion-MNIST.
The rest of the paper is organized as follows. Section 2 gives a brief introduction to tensor networks and their application in classical ML. Section 3 describes the VQC used in this study. Section 4 introduces the hybrid TN-VQC architecture. The performance of the model is shown in section 5. Finally we conclude in section 6.

Tensor network
Tensor networks are efficient representation of data residing in high-dimensional space. Originally developed in the context of condensed matter physics, TNs have gained attention in the deep learning community, for both theoretical understanding and computationally efficiency [25]. They have provided new inspiration for machine learning algorithms and showed encouraging success in both discriminative [26][27][28][29][30] and generative learning tasks [31][32][33]. In addition, the quantum entanglement inherent in the formulation of tensor networks points to a new direction in understanding the mechanism of deep neural networks and may provide a better way to design new network architectures [27,34,35].
It is common to use graphical notation to express tensor networks. A tensor is represented as a closed shape, typically a circle, with emanating lines representing tensor indices (figure 1). The joined line indicates the corresponding index is contracted, as in the Einstein convention where repeated indices are summed over. MPS, also known as Tensor Train, has been widely used in physics to study low-dimensional quantum systems. In an MPS, tensors are contracted through the 'virtual' indices (α's in figure 1(d)). The dimension of these virtual indices are called bond dimensions and is indicated by χ. In the MPS representation of a quantum wave function, the bond dimension indicates the amount of quantum entanglement the MPS can represent in the bond. In the context of ML, this corresponds to the representation power of the MPS.
In the realm of machine learning, MPS also has found numerous applications. For example, MPS is used for feature extraction of multidimensional data modeled as high-order tensors in [36], In [37], it has been used to compress the weight matrices in the deep neural networks to perform classification tasks. Stoudenmire and Schwab [26] and Efthymiou et al [29] use MPS directly to perform supervised learning tasks for the classification of MNIST and Fashion-MNIST datasets. The MPS has also been used for generative modeling [31], probabilistic modeling [38] and sequence modeling [39]. Finally, it is shown that the restricted Boltzmann machine, an important building block of deep learning, is equivalent to an MPS [40].
In addition to MPS, other examples of TNs with distinct entanglement structures exist, such as a tree tensor network (TTN), multi-scale entanglement renormalization ansatz (MERA) and projected entangled pair state (PEPS). The successful application of a specific TN can also give insights into the hidden correlations in the data. The quantumness inherent in the TN gives it great advantage over other  MPS as a feature extractor. Data is encoded into a product state (red nodes) which is contract with an MPS (blue nodes) and a class label or output is generated. The input data is first transformed via the feature map shown in equation (3) and then loaded into the MPS according to the equation (2). architectures in the application of QML. In particular, since each TN can be mapped to a quantum circuit, it means that although in the current scheme, the TN is treated classically, it is possible to replace the whole or part of the TN by an equivalent quantum circuit when a bigger quantum volume is available. This gives our scheme the flexibility to move the quantum-classical boundary based on the available resources.
We will use the MPS as a feature extractor to compress the input data. Following [26], we approximate a feature extractor by the MPS decomposition as illustrated in figure 2. The input data is then loaded into the MPS via the following operation,

VQC
VQCs originate from the variational quantum eigensolver [41], a family of quantum algorithms used to compute ground states. This family of algorithms have recently drawn significant attention and numerous efforts have been made to extend their applications [6]. VQCs have been successfully applied to function approximation [10,13], classification [11, 13-15, 19, 42-46], generative modeling [47][48][49][50][51], metric learning [52,53], deep reinforcement learning [16,17,54] , sequential learning [10,55], speech recognition [56] and transfer learning [19]. It has been shown that VQCs are more expressive than conventional neural networks [57][58][59][60] with respect to the number of parameters or the learning speed. It is demonstrated that with similar number of parameters, VQC-based models outperform classical models on testing accuracies [11], and achieve optimal accuracy in function approximation tasks with fewer training epochs than their classical counterparts [10]. Of particular interest for NISQ applications, it has been shown that such circuits are potentially resilient to noises in quantum hardware [12,61,62], and such robustness has been demonstrated empirically on either noisy simulators or real quantum hardware [16,53]. This strongly suggests that VQC-based architectures are suitable for building ML applications on NISQ devices. The VQC used in this work consists of three parts (figure 3): the first part is the encoding part, which consists of Hadamard gate H and single qubit rotation gates R y (arctan(x i )) and R z (arctan(x 2 i )), representing rotations along y-axis and z-axis by the given angle arctan(x i ) and arctan(x 2 i ), respectively. The Hadamard gate H is is used to create an unbiased initial state as described in appendix A. Notice the rotation angles arctan(x i ) and arctan(x 2 i ) are for state preparation and come directly from the input classical data. The data encoding part should be designed with respect to the problem of interest and plays a crucial role in the overall architecture [63]. Potential quantum advantage depends heavily on the encoding scheme together with the hardware limitations incorporated in the design.
The second part is the variational part, which consists of CNOT gates used to entangle quantum states from each qubit and R(α, β, γ) representing the general single qubit unitary gate with three parameters α i , β i and γ i to be learned (Dashed block in figure 3). These circuit parameters can be regarded as the weights in the classical neural networks. In principle, one can repeat this variational block as many times as desired depending on the model complexity required for the machine learning problem of concern. In our experiments for image classification, we find that a total of four blocks is sufficient to produce optimal results. The final part is the measurement part which will output the Pauli-Z expectation values via multiple runs of the quantum circuit. The retrieved values (logits) will go through classical processing such as softmax to generate the probability of each possible class. The quantum measurement would be performed on first k qubits where k is the number of classes.
Finally, we note that with a single variational block (n = 1), the VQC can be efficiently simulated classically through the addition of ancilla and post-selection. Therefore, for the VQC to be a quantum model, instead of a quantum-inspired model, the number of the variational blocks must be larger than one (n > 1). Figure 4 shows the architecture of the hybrid TN-VQC model. The input image of N = 28 × 28 = 784 pixels from MNIST or Fashion-MNIST is flattened into a 784-dimensional vector x = (x 1 , x 2 , . . . , x N ), and each component is normalized such that x i ∈ [0, 1]. The vector is mapped to a product state using the feature map [26] x

Hybrid TN-VQC architecture
and further processes by the MPS to generate a compressed representation. The feature vector is then encoded into the quantum circuit using the variational encoding (see appendix A). At the end of the VQC, the quantum measurement would be performed to generate the logits for classification. Both the TN and VQC have tunable parameters, labeled as θ 1 and θ 2 respectively in figure 4, which are optimized via gradient descend methods. Gradients of the quantum circuit parameters are calculated using the parameter-shift method (see appendix B), which avoids the use of finite difference calculation. This method is similar to the computation of gradients in neural networks; therefore, the end-to-end training of this TN-VQC model follows the standard backpropagation method as in the training of deep neural networks, and no pre-trained classical model is needed.
Since the classical and quantum parts of the model can be trained simultaneously, it allows for more flexibility in terms of implementation on the quantum hardware. When more qubits are available, one simply increases the dimension of the feature vector out of the MPS to match the input of the VQC and retrain the model. On the other hand, the modular architecture also has the advantage that the classical and

Experiments and results
We study the capabilities of the hybrid TN-VQC architecture by performing classification tasks on the standard benchmark dataset MNIST [64] and Fashion-MNIST [65]. Results for the binary classification of MNIST have been presented in [24]. Here, we perform ternary classification for MNIST and both binary and ternary classifications for Fashion-MNIST. As a baseline, we perform the same tasks on a hybrid PCA-VQC model, where the PCA serves as the simple feature extractor and the VQC as the classifier. As a comparison, we also present results using the MPS as a classifier to demonstrate the role of VQC in the workload. The computational tools we use for the simulation of variational quantum circuits and tensor networks are PyTorch [66], PennyLane [67] and Qulacs [68]. We include four variatioanl blocks (n = 4) in the VQC used in the experiments. Details of the simulations such as the hyperparameters and optimizers for each experiment are summarized in appendix C.

Binary classification
We perform binary classification of the Fashion-MNIST dataset (class 5 vs 7), which is a more difficult task than the binary classification of MNIST performed in [24]. The results from different models are shown in  However, the small bond dimension is enough for the MPS-VQC hybrid model to learn properly and reach a test accuracy 96.05%. As the number of parameters of the VQC part is far fewer than that of the MPS, this suggests that our VQC possess greater power in classification and dominates the workload. It is also clear that the MPS, compared to the PCA, serves as a better feature extractor for a VQC discriminator.
To further demonstrate this point, we compare the performance in test accuracy between the PCA-VQC and MPS-VQC model with a different number of variational blocks in the VQC ( figure 4). The results are shown in table 1. For the PCA-VQC model, doubling the number of blocks from two to four leads to an noticeable increase in the test accuracy from 82.10% to 85.35%. On the other hand, for the MPS-VQC model, the increase is marginal, from 95.55% to 96.05%. We note that the reason why the MPS-VQC model can already achieve higher accuracy with two variational blocks compared to the PCA-VQC model is because the MPS contains trainable parameters, which is trained together with the VQC, whereas the capacity of PCA is fixed. Even when the VQC blocks are increased to four, the PCA-VQC 4 model still under-performs the MPS-VQC 2 model, again showing the MPS is a better feature extractor.
We note that in the case of χ = 1 MPS-VQC, the overall effect of the MPS feature extractor is a simple rescaling of the input image; in contrast, for the case of χ > 1, each tensor node in the MPS part is a matrix. In principle, increasing bond dimensions should provide more degrees of freedom. However, we find that χ = 2 does not significantly improve the accuracy for this task, and for χ = 3, the training fails to converge. This suggests that the classification task might be too simple for the proposed architecture. For a more difficult task, as demonstrated in the following, a higher bond dimension will be needed.

Ternary classification
In the ternary classification, we consider both the MNIST (class 0, 3, 6) and Fashion-MNIST (class 5,7,9) datasets. The results for the MPS-VQC model and the baseline model PCA-VQC are shown in figure 6. Since ternary classification is a more difficult task, a larger bond dimension is required for the MPS part. In terms of performance, all show that the MPS is superior to the PCA as a feature extractor. With χ = 2, MPS-VQC is able to reach a test accuracy over 98% in MNIST and 92% in Fashion-MNIST. Furthermore, the representation power of an MPS feature extractor is tunable via χ, which is an advantage absent in PCA. In the case of Fashion-MNIST, we observe a 2% increase in test accuracy as the bond dimension increases, indicating a better data compression capability due to the increased representation power of the MPS. It is clear that for more complex classification problems, the performance of PCA-VQC should further deteriorate while that of MPS-VQC can be increased by increasing χ.

Conclusion
In this work, we present a hybrid quantum-classical classifier by integrating a quantum-inspired tensor network and a variational quantum circuit. Such a hybrid TN-VQC architecture enables researchers to build QML applications capable of dealing with larger dimensional inputs and potentially to implement these QML models on NISQ devices with a limited number of qubits and shallow circuit depth. We further demonstrate the superiority of this framework by comparing it with the baseline study of a PCA-VQC model on ternary classification tasks on the MNIST and Fashion-MNIST dataset as well as a binary classification task of the Fashion-MNIST dataset. One clear advantage is that the representation power of the trainable MPS feature extractor is tunable with the bond dimension of the tensors. Our results point to the future application of the hybrid TN-VQC model in different quantum machine learning scenarios and potentially implementation on NISQ devices. The extension of this architecture to more complicated datasets such as CIFAR-10 should further test the robustness and capability of the model. We note this requires more computing resources and better optimized simulators.
The core concept in the current design is to leverage classical computing resources (the MPS part) to assist the existing or near-term quantum computers in processing high-dimensional data. We use the cases of binary and ternary classification simply to demonstrate that such an algorithm indeed works. In these cases, we find that four input qubits are enough to yield good results. For more difficult tasks such as 10-class classification, it would be necessary to scale up the VQC with more than ten input qubits. Since the number of CNOT gates scales linearly with the number of qubits, the circuit depth would also increase accordingly. Obviously, simulating such a large non-Clifford circuit is not classically efficient. In contrast, a quantum device could readily handle such a calculation. In short, our hybrid algorithm is designed for quantum devices to perform multi-class classification on high-dimensional data that is classically hard to simulate, and the number of input qubits depends on the number of classes to be discriminated, not on the number of pixel inputs. The proposed hybrid model can be integrated with various kinds of TNs or VQCs, given the suitable encoding methods. For example, one can replace the MPS by other TNs with a different entanglement structure such as TTN, MERA and PEPS, whose potential in the supervised learning context has been demonstrated [30,69,70]. They may serve also good feature extractors for datasets that contain special structure and correlations. Another way to build a more sophisticated feature extractor is to stack multiple TN layers. It has been shown that stacking more layers in classical deep neural networks can increase the model performance [71,72]. Attempts to build a deep convolutional tensor network [73] do not show the same improvement generally observed in classical deep neural networks. We can also replace the simple VQC in our architecture with novel VQC architectures such as the quantum convolutional neural networks [11,[74][75][76][77][78][79][80]. How to apply these ideas to improve the performance of the current model is worth further investigation.

Data availability statement
The data that supports the findings of this study is available from the corresponding author upon reasonable request.

Appendix A. Encoding into quantum states
In our hybrid framework, the outputs from the classical parts need to be encoded such that they can be used by the quantum circuit. A general N-qubit quantum state can be represented as: |ψ⟩ = (q1,q2,...,qN)∈{0,1} N c q1,...,qN |q 1 ⟩ ⊗ |q 2 ⟩ ⊗ |q 3 ⟩ ⊗ · · · ⊗ |q N ⟩, where c q1,...,qN are complex numbers. They are amplitudes of each quantum state and q i ∈ {0, 1}. The square of the amplitude c q1,...,qN represents the probability of measurement results in |q 1 ⟩ ⊗ |q 2 ⟩ ⊗ |q 3 ⟩ ⊗ · · · ⊗ |q N ⟩, and the total probability should sum to 1, i.e. In this work, we choose the variational encoding method to encode our classical data into the quantum states. The initial quantum state |0⟩ ⊗ · · · ⊗ |0⟩ first undergoes the H ⊗ · · · ⊗ H operation to create the unbiased state |+⟩ ⊗ · · · ⊗ |+⟩, where H is the Hadamard gate. Consider a n-qubit system, the corresponding unbiased initial state is, 2 n (|0⟩ ⊗ · · · ⊗ |0⟩ + · · · + |1⟩ ⊗ · · · ⊗ |1⟩) This initial quantum state will first go through the encoding part, which consists of R y and R z rotations. These rotation operations are parameterized by the input vector ⃗ x = (x 1 , x 2 , . . . , x n ). On the ith qubit with i = 1, …, n, R y rotates the state by an angle of arctan(x i ) and R z by arctan(x 2 i ). The encoded state is then processed with the variational quantum circuits with optimizable parameters, as shown in the dashed box in figure 3.

Appendix B. Calculation of gradients of quantum functions
Here the models are trained via gradient-descent methods widely used in training the deep neural network. To calculate the gradients with respect to the parameters of quantum circuits, we employ the parameter-shift method [13,67,81]. Given the knowledge of computing the expectation values of an observableP on quantum function, where x is the classical input vector (e.g. the output values from the PCA or MPS parts), U 0 (x) is the quantum encoding routine to prepare the classical value x into a quantum state, i is the circuit parameter index for which the gradient is to be evaluated, and U i (θ i ) represents the single-qubit rotation generated by the Pauli operators X, Y, Z. It can be shown [13] that the gradient of this quantum function f with respect to the parameter θ i is