Parallel Hybrid Networks: an interplay between quantum and classical neural networks

Quantum neural networks represent a new machine learning paradigm that has recently attracted much attention due to its potential promise. Under certain conditions, these models approximate the distribution of their dataset with a truncated Fourier series. The trigonometric nature of this fit could result in angle-embedded quantum neural networks struggling to fit the non-harmonic features in a given dataset. Moreover, the interpretability of neural networks remains a challenge. In this work, we introduce a new, interpretable class of hybrid quantum neural networks that pass the inputs of the dataset in parallel to 1) a classical multi-layered perceptron and 2) a variational quantum circuit, and then the outputs of the two are linearly combined. We observe that the quantum neural network creates a smooth sinusoidal foundation base on the training set, and then the classical perceptrons fill the non-harmonic gaps in the landscape. We demonstrate this claim on two synthetic datasets sampled from periodic distributions with added protrusions as noise. The training results indicate that the parallel hybrid network architecture could improve the solution optimality on periodic datasets with additional noise.


I. INTRODUCTION
Machine learning and quantum computing have become attractive research areas in recent years.The quest for an efficient quantum neural network (QNN) has dominated the cross-section of these two technologies.Many suggestions have been made for the potential inner workings of a classically-intractable quantum machine learning model [1][2][3][4], but theoretical and hardware limitations could prove challenging to implement.The noise-free barren plateau problem [5] or the curse of dimensionality [6] are examples of the theoretical challenges with QNNs.At the same time, the hardware limitations point to the industry limits on the accuracy, and the number of qubits [7,8].Therefore, contemporary practical use of quantum technologies in machine learning should come from complementary quantum-classical architectures, called hybrid quantum neural networks (HQNN), that employ relatively small, realisable quantum circuits and classical multi-layered perceptrons (MLP) where the two work in tandem.The works in [9][10][11][12][13][14] explored the applicability and performance of sequential HQNNs, where MLPs and QNNs are connected in series, passing the information from one network to another.The sequential HQNNs could introduce information bottlenecks in the representational power of the model, which could limit the expressivity of the network.This work explores the theoretical basis of parallel HQNNs, where variational quantum circuits (VQC) and MLPs process information in parallel.The approach is based on the universality theorems from two sources: 1) MLPs can produce nonharmonic functions [15] and 2) QNNs fit smooth truncated Fourier series on the training data [16].This work was inspired by the Fourier neural operator introduced by Ref. [17].
In Sec II, we review the theoretical foundations of MLP and VQCs, and in Sec III, we introduce the design and experimental results of PHN.In Sec III B, we address the potential problem of component primacy in training PHNs, where either the VQC or the MLP could dominate the training, and propose a remedy to it.Finally, in Sec IV, we summarise our findings and discuss future directions.

II. THEORETICAL FOUNDATION
In this study, we concentrate on solving a supervised regression problem using a dataset (x i , y i ), where x i ∈ X is a feature vector and y i ∈ Y is the label.We aim to discover a function f (x) that can approximate the labels y of out-of-sample features.To achieve this, we create a machine learning model with parameters θ to create the functionality, f θ (x).We adjust these parameters according to the training sample to maximise the probability of obtaining the correct label for a given feature.The general functionality f is a machine learning architecture, while a specific realisation of its parameters, θ, is a machine learning model.In the subsequent sections, we will explore two well-known architectures and then use that theoretical foundation to justify the PHN architecture in Sec III.

A. Multi-layered perceptrons
MLPs constitute a large class of successful machine learning architectures.They are directional graphs arXiv:2303.03227v2[quant-ph] 1 Nov 2023 whose nodes are ordered in one-dimensional layers which take input from the previous layer and provide the next layer with their outputs.In the case of fully connected MLPs (FCN), all neurons of each layer feed information to all the neurons in their immediate front neighbourhood.Each edge of the graph has an associated multiplicative factor (weight), and each neuron has an associated additive quantity (bias), which together form the parameters of the MLP.The first neural layer is called the input layer, and the last is the logits.Ref. [18] provides a comprehensive overview of neural networks and their properties.
Ref. [19] showed that MLPs are asymptotically universal approximators whose fit on the training data becomes perfect as the numbers of neurons in the intermediate neural layers approach infinity.Moreover, [15,20] proved this using a novel graphical method that showed a fully-connected network with a single intermediate neural layer can approximate any function by fitting a superposition of rectangular waves.To utilise this graph as a machine learning architecture, we could encode the features of a data point taken from the sample, x i , onto the input layer and then propagate their values through the graph by multiplying their values by the weights of the architecture w and adding the biases b where σ is known as the activation function1 , h i indicates the values of the consecutive neural layer, and Einstein's summation notation is implied.The propagation process can be passed along to the entire graph until we arrive at the terminal nodes, the prediction of the MLP.

B. Variational quantum circuits
Variational quantum circuits (VQC) employ variational group rotations to create a machine learning architecture on a quantum computer [4,[21][22][23][24].To construct a VQC, one could start by creating a quantum node of several qubits in the ground state.Then, a series of variational and fixed quantum gates can be applied to the circuit.The variational gates could include the Pauli rotation gates, which require single-qubit time evolution Hamiltonians of the respective Pauli gate.The fixed quantum gates might consist of the controlled-NOT (CNOT) and the Hadamard (H) gates.We split the variational gates into embedding, and trainable gates, which encode the features, x, and act as model parameters, θ.At the end of the circuit, we measure the qubits in a specified basis, such as the Pauli bases, and obtain either a 0 or a 1.After many iterations of the circuit, we can find the likelihood of getting a 0 over 1, and by taking the average, we can obtain the expectation value of the circuit.By the Born rule [25], we can find this probability by taking the expectation of the measurement matrix, M , and then using it as the output of our model: where |ψ(x, θ)⟩ denotes the state of the quantum circuit before the measurement.We can improve this approximation to the labels by optimising the parameters of the VQC, θ. [16] proved that VQCs are also universal approximators, and the way they work is by fitting a truncated Fourier series over the samples: where L is the highest degree Fourier term expressible by the VQC.
In the functioning of VQCs, the interaction between the variational and fixed quantum gates forms a critical aspect.The variational gates navigate the quantum system through a sequence of transformations, manipulating quantum states based on the data x and trainable parameters θ.On the other hand, fixed quantum gates, such as CNOT and Hadamard gates, exhibit predictable behaviours and serve to entangle and manipulate qubits.While the fixed gates maintain system structure, the variational gates enhance adaptability, thereby enabling the circuit to learn and adapt to new data.

III. RESULTS -PARALLEL HYBRID NETWORKS
We split the HQNN hybrid interfaces into two categories: 1) sequential: where the classical and quantum parts feed directly into each other, and 2) parallel: where a classical multi-layered perceptron and a variational quantum circuit in parallel process the same information.In this section, we take an in-depth look into HQNNs of the latter type and the functions they represent.Appendix B provides an empirical comparison between the two categories, but in the following, we explore the latter type.We shall refer to these networks as parallel hybrid networks (PHN).Fig. 1 shows the general architecture of PHNs.The combination is a weighted linear addition with trainable weights.These weights determine the contribution of each network to the final output.The specific VQC used here is a generalised data re-uploading VQC, where K qubits are initialised in the state |0⟩ ⊗K .Then in alternation, a series of variational and encoding layers are applied.The encoding layers S take the input features, {x 1 , • • • , x N }, and encode them ENCODING VQC MEASUREMENTS MLP FIG.1: The general architecture of the PHN.The PHN takes an input vector of features and passes them to an angle-embedded VQC (a VQC that uses single-qubit, Pauli gate embedding of features without applying any non-linear kernel on the features) of the appropriate architecture in parallel to a multi-layered perceptron with a single hidden layer of the appropriate size.The outputs of the VQC are then combined linearly with the outputs of the MLP to produce a final output vector.in a unitary transformation which is then applied to the state of the qubit.The variational layers, U , are unitaries that encapsulate the VQC model parameters as an operator that can be used for the quantum state of the network.Finally, the measurements are where the quantum information collapses into M classical outputs, which can be obtained by taking the expectation value of the circuit with respect to the measurement observable.Note the difference between M , the number of classical outputs out of the VQC, and K, the number of qubits, and that they are not necessarily the same, as often we are only required to measure some of the qubits.
In parallel, the fully connected MLP also takes in the N features and passes them to a single layer of hidden neurons of size F by multiplying the feature vector by a weight matrix of size N × F .Then, biases are applied to these values and scaled using an activation function.An activation function is necessary for adding non-linearity to an otherwise linear system.Then, the neurons are propagated to M MLP output neurons with their own biases and activation functions, denoted as The MLP and VQC outputs are then combined, using a two-to-one linear weight layer, to form the PHN outputs, {o 1 , • • • , o M }.This final layer combines the first output of the VQC with the first output of the MLP: and similarly for all the M outputs, where ({s q }, {s c }) are trainable parameters.
In Sec II, we saw that an MLP with a single hidden layer created a non-harmonic functional fit for the dataset and that the VQC created a truncated Fourier series, a harmonic function.Thus, a network combin-ing these results could map the smooth, sinusoidal parts through the VQC and fill the protruding sections via the MLP.This complementary setting has the potential to approximate a function that fits the dataset both in the position space (MLP) and in the conjugate momentum space (VQC).We could compare this duality to the Fourier neural operator in Ref. [17] or the models with benign overfitting in Ref. [26].The scope of this work includes architectures that use multiple VQCs (MLPs) in parallel, as they can always be combined to form a single VQC (MLP).

A. Performance
We start with a ground truth consisting of an overall single-frequency sinusoidal function and then introduce high-frequency perturbations to this system.Specifically, the functional form was f = sin(x) + 0.05 sin(8x) + 0.03 sin(16x) + 0.01 sin(32x), which was scaled to -1 and 1. 100 equally-spaced data samples were taken from this distribution for training.We train a simple PHN, described in detail in Appendix A with the exact structure of the MLP and the VQC, to recreate the ground truth as accurately as possible.We then train the individual constituents of the same PHN architecture to see their performances.Fig. 2 shows the training loss curves, and Fig. 3 shows the best fits that each architecture created for the ground truth.The scatter plot of final losses and their respective final ratios after 1000 training epochs.The optimal loss value is achieved at non-zero ratios, where ratios to the side of this value provide sub-optimal losses.Note that this figure only includes the runs with learning rates whose final loss is low enough for comparison.FIG.3: The functional fits of each architecture to the ground truth.The VQC, expectedly, produced a sinusoidal curve.The MLP created an overall curve close to a sinusoidal curve but with jagged edges.The PHN, however, produced the best result, predicting the protrusion at the peak.
The PHN trains to a lower MSE training loss than its elements, which suggests that adding the VQC improves the overall expressivity of the MLP.Furthermore, by examining the loss curves, we see that the PHN inherits the same speedy descent as the VQC but also shares many of the features present in the MLP loss curve, such as the spikes or the gradual flattening near the end of training.
We see that the PHN outperforms both individual components, which means that both the VQC and MLP contribute to the training, and neither becomes redundant.In Sec III B, we explore how to measure the contribution of each and how the tuning of hyper-parameters could change this contribution.Moreover, Appendix C explores the fairness of this comparison, Appendix D tests the case where the higher frequency term has the larger amplitude, and Appendix E investigates the generalisation ability of the PHN on this dataset.

B. PHN primacy
A way to understand the relative contributions of each network is by inspecting the weights of the combination phase, s q and s c , for VQC and MLP, respectively.In this section, we look at how this contribution can unfold when training the PHN.
When training a PHN, we must be wary that the VQC and MLP train at different rates.We define the primacy of one of the constituent architectures (VQC or MLP) over another as when the last weights preceding the PHN output layer vanish for one of the components.Equivalently, this makes the output of the latter network independent from the input features, which would mean that the prediction curve is solely constructed by either the MLP or the VQC.A primacy of this type could prevent the PHN from reaching the global minimum, as it is limited to what only one of the components could offer.
The ratio of the final weights, r = |sc| |sq| , was used to track intervals of different primacy regimes recorded for different hyper-parameterisations of the PHN.Specifically, we fixed the learning rate of the VQC at 0.01 and then selected the learning rate of the MLP from 54 values of α c ∈ {1.0e−7, 2.0e−7, • • • , 9.0e−7, 1.0e−6, • • • , 9.0e−2}.We, then, trained the PHN for 1000 epochs at a fixed initialisation point.Lastly, we recorded the ratios r throughout each training.The bigger the ratio, the more the MLP would contribute compared with the VQC.Notably, even a small contribution could make a critical difference, and primacy occurs only when one of the contributions completely vanishes.
The dependence of the final loss on the ratio of the final weights for the VQC and MLP, shown in Fig. 2(b), highlights the potential for either component to dominate training in the PHN architecture.The results exhibit an optimal range for the ratio between 0.1 and 1, indicating that a balanced contribution from the VQC and MLP is desirable for achieving the best results.It is also evident that complete MLP primacy, where the ratio approaches 0, leads to worse final losses.However, we also observe that adjusting the learning rates of the two components can sometimes improve the loss.Therefore, tuning the learning rates of the VQC and MLP is crucial to achieve a balanced contribution from both parts and to prevent either component from dominating the training.

C. Scalability and generalisation
To show the scalability of the PHN, in this section, we try a 2-dimensional problem with the view that this can be scaled to an arbitrarily complex problem with many qubits.To solve the problem in Fig. 4(a), a simple PHN, described in Appendix.A, was employed.The distribution used to create this ground truth was Note that this function (similarly to the 1D case) was chosen entirely at random to have a coarse harmonic structure (first four terms) as well as high-frequency noise (the last three terms) and not engineered to showcase the PHN in a favourable light.100 equidistant points were sampled from this ground truth to create a training set.This set was then trained on only the VQC, MLP, and the complete PHN for 10, 000 epochs.The trained models were then tested on 10, 000 equidistant points data points to see the generalisation ability of each architecture.Figs.4(b), (c), and (d) respectively showcase the fit of the VQC, MLP, and PHN, and Fig. 5 shows the evolution of their training loss.The VQC creates a symmetric, sinusoidal pattern, whereas the MLP creates jagged regions to fit the ground truth.However, the PHN can generalise the ground truth by employing both elements and thus creates a closer fit, which could mean that for such datasets, the PHN could provide a high generalisation power over the MLP or the VQC.

IV. CONCLUSION
Overall, our findings demonstrate the potential of PHN as a powerful tool for quantum machine learning.It is a hybrid architecture that can extract harmonic and non-harmonic features from a dataset.By leveraging its unique architecture, the PHN can learn complex patterns and relationships within the data that might be difficult to capture using traditional machine learning algorithms.
However, it is essential to note that the performance of the PHN is highly dependent on the choice of hyperparameters.The number of layers, neurons in each layer, activation functions, and learning rate are crucial in determining how well the network performs on a given task.Therefore, hyperparameter tuning is a critical step in training a successful PHN.One potential direction for future research is to explore using a custom learning rate scheduler to modify the learning rate during training.A learning rate scheduler can dynamically adjust the learning rate based on the network's performance on the training set, allowing the model to learn more efficiently and converge faster.Implementing a learning rate scheduler may further improve the performance of the PHN on a wide range of tasks.
Figs 6(a) and 6(b) illustrate the PHN example architectures used to produce the 1D and 2D results, respectively.In both cases, the quantum measurement in q out was made in the Z-basis, and the MLP utilised a single hidden layer with the rectified linear unit (ReLU) and sigmoid activation functions for the hidden and output layers, c out , respectively.The MLPs were fully connected and included weights and biases with 256 and 128 neurons, respectively.The outputs of the MLP and VQC were linearly combined after being weighted by s c and s q , respectively.The MLP had 769, the VQC had 3, and the final weighing layer had two parameters.
Fig 6(b) depicts a simple 2-dimensional PHN used to demonstrate the scalability of the PHN.The activation layers employed in the MLP were ReLU and sigmoid for the first and second layers.The VQC produced a single output, q 1 , which resulted from measuring the state of the VQC in the Z I basis, where the identity measurement I is excluded from the diagram.A learning rate of 0.01 was used for the VQC parameters and 0.001 for all others.We also utilised the Adam optimiser and a learning rate scheduler that multiplied all learning rates by γ = 0.99 every ten epochs.

Appendix B: Information bottlenecks in sequential hybrid networks
This section focuses on the information bottlenecks in sequential hybrid networks, which are the primary motivation behind the invention of the PHN.According to Ref. [16], an angle-embedded VQC, such as the ones in Fig. 6, produces a truncated Fourier series of the featureset to approximate the labels.Additionally, Ref. [15] showed that MLPs fit the function in the position space using rectangular protrusions.In a sequential hybrid network, the information flow depends on the quantum and classical processes sequence.Therefore, information processing is inherently limited by the processing capacities of either VQC or MLP.This represents a bottleneck in the information flow, given that the information output of one process becomes the input for the other.
When the outputs of a VQC, for example, are passed onto an MLP in sequential models, the MLP is limited by the truncated Fourier series of the VQC.This might result in an incomplete or imprecise approximation of the labels, as the MLP's ability to fit the function in the position space may be constrained by the quality of the VQC's output.Similarly, one could reverse the setting, where the output of the MLP is passed onto the VQC.In that case, the VQC might struggle to process the rectangular protrusions provided by the MLP accurately.These limitations in information flow and processing capabilities are what we refer to as the information bottleneck in sequential hybrid networks.
In contrast, parallel hybrid networks (PHN) sidestep these bottlenecks by allowing simultaneous information processing in the quantum and classical domains.Instead of constraining the system by the sequential passage of information, PHN enables more efficient utilisation of both the quantum and classical capabilities.Consequently, the processing power of the PHN is not restricted by the limitations of a single component but instead is governed by the cumulative capacity of all its parts.Fig. 7 shows the results of sequentially connecting the VQC to the MLP and the MLP to the VQC when trained on the one-dimensional dataset in Sec.III A. The VQC, with its newfound capability to identify the sin(8x) term, highlights the potency of feature scaling in unlocking the full potential of VQCs.Meanwhile, the PHN also found a superior fit for the lower frequency term, outstripping the performances of both the MLP and the VQC.These results accentuate the proficiency of the PHN in handling functions with dominant high-frequency components, particularly when coupled with feature scaling.Furthermore, they underscore the potential value of adopting a PHN for such cases, with its performance outpacing other networks considered in this study.

Appendix E: Generalisation
A further investigation was carried out to test the generalisation ability of the PHN.The one-dimensional dataset was bifurcated, training the models only on the positive half and testing their ability to generalise to the untrained, negative half.
The MLP successfully fitted the training data yet failed to predict behaviour external to this range, as visible in Fig. 10.In the context of periodic functions, a default generalisation capability is observed for the Variational Quantum Classifier (VQC) due to the periodic nature of its special unitary members employed as linear mappings.
The PHN likewise inherits this extrapolation ability from the VQC, although no observed improvement in generalisation beyond the capability of the VQC is evident in this case.
The final test losses for the three models are as follows: for the VQC, the loss is measured at 0.008; for the PHN, the loss is recorded as 0.094; and for the MLP, the loss is computed as 0.133.
These findings demonstrate the superior extrapolation capability of the VQC and the PHN, particularly in the context of periodic functions, while highlighting the limitations of the MLP in this regard.Furthermore, it is suggested that although the PHN inherits the extrapolation capabilities of the VQC, it does not offer any additional benefit in terms of extrapolation in this particular case.

FIG. 2 :
FIG. 2: (a) The training losses of the individual elements of the PHN when trained separately, as well as the full PHN.(b)The scatter plot of final losses and their respective final ratios after 1000 training epochs.The optimal loss value is achieved at non-zero ratios, where ratios to the side of this value provide sub-optimal losses.Note that this figure only includes the runs with learning rates whose final loss is low enough for comparison.

FIG. 4 :FIG. 5 :
FIG. 4: (a) shows the ground truth for our 2D problem in the form of a contour map.The predictions shown are for the VQC (b), MLP (c), and PHN (d).We see that the prediction of the VQC is smooth and convex, whereas the MLP creates jagged shapes.Taking advantage of both of these properties, the PHN represents the harmonic functions of the VQC with the added, necessary protrusions.

FIG. 6 :
FIG. 6: (a) The schematic diagram describes the joint work of the VQC with one qubit and a basic MLP architecture.(b) The architecture includes a two-qubit VQC with only a measurement applied to the first qubit and an MLP with 128 neurons in its singular hidden layer.

FIG. 7 :FIG. 8 :FIG. 9 :FIG. 10 :
FIG. 7:The sequential architectures where in the MLP to VQC structure, the MLP takes in the dataset features and passes its outputs to the VQC.The output of the VQC is then used as the prediction of the sequential model.The VQC to MLP model performs this operation but in reverse order.In this problem, the PHN clearly has a better fitting ability than the sequential architectures.