Benchmarking simulated and physical quantum processing units using quantum and hybrid algorithms

Powerful hardware services and software libraries are vital tools for quickly and affordably designing, testing, and executing quantum algorithms. A robust large-scale study of how the performance of these platforms scales with the number of qubits is key to providing quantum solutions to challenging industry problems. This work benchmarks the runtime and accuracy for a representative sample of specialized high-performance simulated and physical quantum processing units. Results show the QMware simulator can reduce the runtime for executing a quantum circuit by up to 78% compared to the next fastest option for algorithms with fewer than 27 qubits. The AWS SV1 simulator offers a runtime advantage for larger circuits, up to the maximum 34 qubits available with SV1. Beyond this limit, QMware can execute circuits as large as 40 qubits. Physical quantum devices, such as Rigetti's Aspen-M2, can provide an exponential runtime advantage for circuits with more than 30 qubits. However, the high financial cost of physical quantum processing units presents a serious barrier to practical use. Moreover, only IonQ's Harmony quantum device achieves high fidelity with more than four qubits. This study paves the way to understanding the optimal combination of available software and hardware for executing practical quantum algorithms.


I. INTRODUCTION
Quantum computing is a rapidly growing field of technology with increasingly useful applications across both industry and research. This new paradigm of computing has the potential to solve classically-intractable problems, by exploiting an exponentially-increasing computational space. This allows quantum algorithms to dramatically reduce the runtime for solving computationally resource-intensive problems.
There is a plethora of quantum algorithms, of which parameterized quantum circuits represent the most general form. Quantum neural networks (QNNs) are quantum machine learning (QML) algorithms [1][2][3][4] that leverage powerful techniques developed for classical neural networks, to optimize this parameterized structure, and have already been applied to solve a number of industrial problems [5][6][7][8][9][10][11][12]. The complexity and performance of classical neural networks employed to solve data-intensive problems has grown dramatically in the last decade. Although algorithmic efficiency has played a partial role in improving performance, hardware development (including parallelism and increased scale and spending) is the primary driver behind the progress of artificial intelligence [13,14]. Unlike their classical counterparts, QNNs are able to learn a generalized model of a dataset from a substantially smaller training set [15][16][17] and typically have the potential to do so with polynomially or exponentially simpler models [18][19][20]. Thus, they provide a promising opportunity to subvert the scaling problem encountered in classical machine learning [5,[21][22][23][24][25][26][27], which presents a serious challenge for data-intensive problems that are increasingly bottle-necked by hardware limitations [28][29][30].
Nonetheless, even for a small dataset, training QNNs requires on the order of a million circuit evaluations. This is a consequence of the multiplicative number of data points, evaluations required for calculating the gradient [31], and iterations before a solution is reached. This makes them a relatively challenging and resourceintensive use case for quantum processing units (QPU). Therefore, QNNs require stable, on-demand, and accurate quantum circuit execution. A plethora of different options for executing quantum circuits exist. These are either physical QPUs or classical hardware simulating quantum behaviour. In both cases multiple vendor options and services are available. Establishing the combination of software and hardware that provides the optimum runtime, cost, and accuracy is crucial to the future of democratizing quantum software development.
In this study, different pairings of software development kits (SDKs) and hardware platforms are compared in order to determine the fastest and most cost-efficient route to developing novel quantum algorithms. This benchmark is performed using QNNs, which represent the most general form of a quantum algorithm. The benchmark indicates an advantage in using the QMware basiq simulator for circuits with 2 to 26 qubits, AWS SV1 for 28-34 qubits, and QMware basiq for 36-40 qubits. Ad-ditionally, QPUs from four different vendors (IonQ, Oxford Quantum Circuits (OQC), IBM, and Rigetti) were benchmarked for runtime, accuracy, and cost. The results show QPUs could become time-competitive in a practical use case for circuits with 30 qubits or more. However, the current low fidelity attained by many of these systems precludes their application to industrial problems. The Python implementation code for the benchmarks presented in this study is available in Ref. [32].
This investigation is not an exhaustive benchmark of all available systems, and it is worth acknowledging the existence of other state-of-the-art services. This includes hardware quantum simulators such as IBM's simulator statevector and Atos's Quantum Learning Machine [33], as well as software backends such as the IBM Qiskit machine learning suite and the Qulacs package [34,35]. It also includes QPUs such as IBM's Eagle processor [36], Honeywell's System Model H1 [37], and Google's Sycamore processor [38]. The authors also acknowledge other works that involved a similar technique in performing benchmarks on quantum hardware. These include hardware specific performance benchmarks such as Refs. [39,40] as well as metric-specific tests such as Refs. [41][42][43][44][45][46][47].
The work is organized in five sections. Sec. II describes the methodology, including the benchmark algorithms, the hardware and software tested, as well as the results of the runtime benchmark. Sec. III details the cost of executing the proposed quantum circuits on various QPUs and compares their fidelity. Sec. IV discusses the execution of large quantum circuit, with as many as 40 qubits, and the implementation of multi-threading in simulators. Finally Sec. IV C provides a summary of the findings, and outlines key challenges for the quantum computing industry.
II. RUNTIME BENCHMARKING Quantum simulators are bipartite systems, consisting of a software library and the hardware on which the software is run. Both play a crucial role in the development and execution of a quantum circuit. Although the various software libraries available for quantum simulation are often implemented in a hardware-agnostic way, the internal implementation of the linear algebraic methods and the manner in which quantum logic gates are compiled will have a significant impact on performance. This is true for the execution of any quantum circuit, but particularly relevant for gradient calculations when optimizing a QNN during the training phase, due to the use of gradient-based optimization techniques. In particular, the standard parameter-shift method of calculating the gradient of the circuit output with respect to each of the n trainable parameters increases the number of expectation values evaluated by a factor of 2n. By comparison, the forward pass of a trained QNN requires evaluating just a single expectation value, which can usually be obtained with fewer than one thousand circuit shots. The specification of the simulator hardware also plays an important role in the ability to quickly and efficiently optimize a variational quantum algorithms. Moreover, the synergy between software and hardware influences how much computational overhead is required and the ease with which quantum algorithms can be designed, tested, and deployed.
For many quantum computing applications, the opensource PennyLane Python library is extremely popular [48]. It is also the recommended quantum simulation library for the AWS Braket computing service, which provides a performance-optimized version of the PennyLane library [49]. PennyLane offers a variety of qubit devices. The most commonly used is the default.qubit, which implements a Python backend using NumPy, TensorFlow, JAX, and PyTorch. More recently the lightning.qubit device was introduced, which implements a high-performance C++ backend [48]. QMware's cloud computing service provides a quantum simulator stack which supports open-source, hardwareagnostic libraries, such as PennyLane, in addition to QMware's bespoke quantum computing Python library basiq [50]. The basiq library also supports a PennyLane plugin, which translates circuits built using the Pennylane SDK into circuits that can be executed using the basiq backend.
Initially the performance of the QMware HQC4020 simulator is compared to the performance of the AWS Braket ml.m5.24xlarge simulator. In both instances the PennyLane lightning.qubit backend is used in order to evaluate the performance of the underlying hardware systems -referred to as QPL and APL, respectively. The performance of the QMware basiq library is then benchmarked using both a native basiq implementation of the QNN and HQNN. This runtime benchmark is executed on the QMware HQC4020 simulator using 384 vCPUs across all circuit sizes. Results are compared to the high-performance AWS Braket SV1 simulator (ASV), as well as the previous QPL and APL benchmark. Finally, the runtime performance of these simulator stacks is also compared to runtimes achieved for (H)QNN inference (forward pass) using real QPUs. The QPUs included are IonQ's Harmony, OQC's Lucy, Rigetti's Aspen M-2, and IBM Quantum's Falcon r5.11.
The ml.m5.24xlarge AWS notebook instance provides 96 vCPUs and 384 GiB of random access memory (RAM). By comparison the QMware HQC4020 simulator has 12 TB of RAM and 384 vCPUs in total, a maximum of 48 vCPUs of which are utilized throughout the benchmarking tests executed on the QMware service. The exception to this is when benchmarking the PennyLane Lightning qubit, which implements no parallelization for the parameter-shift method. Hence the results for the QMware PennyLane lightning.qubit (QPL) and AWS PennyLane lightning.qubit (APL) benchmark represents single-core calculations on both hardware services.
The metric used in benchmarks is the training time per epoch per training sample, which is measured using the Python time library. The expectation value of each circuit measurement, corresponding to the output prediction of the (H)QNN, is obtained using 1000 circuit shots. Measuring the execution time for multiple circuit shots has the effect of reducing the proportion of circuit initialization time in the quoted runtime for each stack. In practice, quantum circuits usually require hundreds or thousands of circuit shots to obtain an accurate expectation value. Benchmarks are performed for a range of circuit sizes (number of qubits). This is achieved by varying the dimension of the dataset, n d ∈ [1,15], which increases linearly with the number of qubits, n q = 2n d (see Sec. II B for details).
Although the IBM Quantum Cloud development platform allows users to retrieve quantum processor runtimes, the AWS Braket computing platform precludes the possibility of measuring low-level process times. As a result all times quoted, for both AWS and QMware benchmark, include the interaction with the runtime en-vironment and the backend that compiles, initializes, and invokes the quantum circuit. This follows the standard practice in quantum benchmarks and reflects a real-world use case encompassing the full software and hardware stack [51]. For the benchmark involving the QPU devices available via AWS Braket, quoted times also include any potential queue time incurred on the vendor backend. The proportion of the total runtime this constitutes varies according to the number and complexity of tasks submitted by other users of the same QPU (as detailed in the Amazon Braket Developer Guide) and cannot be accurately estimated through the AWS user interface.

B. Benchmark Dataset
The dataset for the benchmarks presented here is an n-dimensional abstracted version of the well-documented two-dimensional sklearn.datasets.make circles for binary classification tasks [52]. It consists of points sampled from two concentric circular shells, distributed randomly about two nominal mean radii, r inner < r outer . The distribution of the data points about the mean radius is described by a normal distribution, and both shells use the same standard deviation. The method for creating the dataset is adapted from that proposed by [53] for sampling points on a sphere. First, an n d -dimensional vector is created with components v i sampled randomly from a standard normal distribution with mean µ = 0 and standard deviation σ = 1: The vector is then normalized to length r to obtain a point sampled randomly with uniform probability from the surface of an n-sphere. A vector of random noise, ⃗ ρ, sampled independently from the distribution N (0, σ 2 ), is applied to each component of the vector: Data points ⃗ x are sampled from two such distributions to create an outer shell with classification label y i = 0 and inner shell with classification label y i = 1. In the present work r outer = 1.0 is used to create points in the outer shell and r inner = 0.2 to create points in the inner shell, with σ = 0.3 for both shells. The dimension of the dataset determines the number of features used as input to the neural network. By linearly increasing the dimension of the dataset, circuits with a varying number of qubits (n q = 2n d ) can be benchmarked without changing the underlying rubric of the classification problem itself. Examples of the one-, two-and, three-dimensional datasets are shown in Fig. 2.

C. Learning Models
The following sections describe the architecture of the hybrid quantum-classical (HQNN) and pure QNNs used in the benchmarking tests. In all cases, the networks are trained using a binary cross-entropy loss function and the Adam optimizer with a learning rate of α = 0.3 [54].

Quantum Neural Network
The QNN used in this benchmark consists of a multiqubit variational quantum circuit, and is based on the model proposed by [15] The model employs a sequential R x R z two-axis rotation encoding scheme to embed each of the data features in a single qubit state on the oddnumbered qubits. The feature encoding is followed by an entangling layer of sequential 'nearest-neighbor' CNOT gates across all qubits. A layer of trainable single-axis R y rotation are then applied to each qubit, followed by a final layer of sequential 'cascading' CNOT gates across all qubits. Finally, the expectation value of the Pauli-Z operator (σ z ) is measured for each of the even-numbered qubits. The mean expectation value across the n d measurement qubits is interpreted as a probability of the input belonging to the class y i = 1.
To obtain the gradients of the loss function with respect to each of the parameters, the standard analytical parameter-shift algorithm is used, in which the gradient of an expectation value, with respect to the i-th parameter, is obtained via measurement of two additional expectation values. This method requires a total of 2n w +1 circuit evaluations to obtain the gradient of the loss with respect to n w circuit parameters and thus scales linearly with the number of trainable parameters. Unlike the more efficient adjoint and backpropagation methods, which are usually available for quantum simulators, parameter-shift is the only currently available algorithm that can be implemented natively on a QPU. It therefore provides an upper bound on the cost of training a QNN using a QPU, and on the runtime of the benchmarked simulators [55,56].

Hybrid Quantum Neural Network
The HQNN is a bipartite model consisting of a QNN and a multi-layer perceptron (MLP) classical neural network, with the expectation value of each measurement qubit in the QNN used as an input feature for the first layer of the MLP. The quantum component of the HQNN follows the same architecture as described in Sec. II C 1, and uses the same parameter-shift method to obtain the gradients of the expectation values. The MLP is built using the PyTorch library and contains three linear layers with sizes [n d , 40, 1], with ReLU and Sigmoid activation functions applied to the input and hidden layer, respectively [57]. The gradients in the classical component of the HQNN are computed using the standard back-propagation algorithm, via the native implementation in the PyTorch library. For the inference benchmark that utilizes QPUs or the AWS SV1 device, the classical part of the circuit is executed using the AWS ml.m5.24xlarge compute instance in all cases except for IBM Quantum Falcon r5.11 whose classical part was executed on QMware HQC4020.

D. Results
This section presents the results of the benchmark methodology outlined in Sec. II A. The results of the benchmark are illustrated in Fig 2. A table of these results is also given in App. B, Tab. III and Tab. IV. The measured runtime values (runtime per epoch, per training sample) are averaged across 100 repeats in order to obtain a mean and standard deviation for circuits up to 24 qubits in size for the training benchmark, and 26 qubits for the inference benchmark. A similar approach to averaging is impractical for larger circuit sizes, owing to the significantly longer total runtime. Thus, only a single value with no standard deviation is quoted for circuits larger than 24 qubits. In all cases Chauvenet's criterion is applied in order to filter anomalous runtime measurements that arise due to extraneous hardware processes [58].

Training
In general, the QMware HQC4020 and the AWS ml.m5.24xlarge hardware achieve similar performance using the Pennylane Lightning backend (QPL and APL, respectively -see Tab. II for a list of abbreviations). When benchmarking with the QNN, QPL performs similarly to APL for circuits with less than 27 qubits in size, with a relative runtime of t QPL /t APL = 0.90 (17). When benchmarking with the HQNN the average relative runtime is t QPL /t APL = 0.95 (16).
Comparing the QMware HQC4020 hardware using a native basiq implementation (QBN) to the Pennylane Lightning implementation (QPL) shows a clear advan-tage for the native basiq implementation across all circuit sizes for both QNNs and HQNNs. The QBN implementation achieves an average speedup of t QBN /t QPL = 0.56 (32) over QPL in the QNN benchmark, and t QBN /t QPL = 0.54 (30) for HQNNs. It is also notable that the native QMware implementation is most performant for models using fewer than more than 20 qubits, where an average speedup of t QBN /t QPL = 0.22(4) is obtained for the QNNs and t QBN /t QPL = 0.21(2) for the HQNNs.
The AWS Braket SV1 device is a high-performance managed device designed for simulating large quantum circuits up to a maximum of 34 qubits. Correspondingly, it outperforms QBN for circuits with 28 or 30 qubits with an average relative runtime across QNNs and HQNNs of t QBN /t ASV = 1.35 (11). Conversely, it has very poor performance for small and medium circuits, with a relative runtime approximately two to four orders of magnitude slower than the other available simulator stacks for circuits with fewer than 20 qubits. Across all circuit sizes smaller than 28 qubits, the QBN implementation outperforms ASV for both QNNs and HQNNs. The average relative runtime is t QBN /t ASV = 0.000 41 (25) for circuits with 14 qubits or less, t QBN /t ASV = 0.0049 (35) for circuits with 16 to 20 qubits, and t QBN /t ASV = 0.36 (29) for circuits with 22 to 26 qubits.

Inference
When training a QNN the number of circuit evaluations increases linearly with the number of trainable parameters. In contrast, the inference (forward pass) requires only a single evaluation of the output expectation values. As a result, the total runtime is reduced significantly. The results of the inference runtime benchmark broadly follow the same trends as for the training runtime. One exception to this are the QPL and APL relative runtimes, which are identical within one standard deviation for circuits less than 16 qubits in size. The average relative runtime is t QPL /t APL = 1.04(10) for QNNs and t QPL /t APL = 1.03 (22) for HQNNs. For larger circuits, APL performs marginally better than QPL with t QPL /t APL = 1.11 (10) and t QPL /t APL = 1.12 (9) for QNNs and HQNNs, respectively.
Crucially, the inference tests include a benchmark of physical QPUs, as listed in Tab. I. For low numbers of qubits the QPU runtimes are orders of magnitude longer than their simulator counterparts. This is likely due to the additional overhead incurred in compiling and initializing a QPU results, as well as the queue time incurred on the vendor side, where multiple AWS users may be accessing the QPU device simultaneously. On the other hand, the runtime for QPUs increases linearly with the number of qubits. For large numbers of qubits, the exponentially increasing simulator runtime exceeds the fixed time cost associated with QPUs. This results in a 'threshold' circuit size above which it becomes exponentially faster to execute a QNN using a QPU device.
The limited number of qubits available for many QPUs means that, in most cases, this threshold is not attainable. Of the three QPUs in the present study, only Rigetti's Aspen M-2 has a sufficient number of qubits to be time-competitive with the simulator stacks tested. The threshold occurs at 30 qubits for QBN and ASV where the relative runtime is t ASV /t M-2 = 1.03, t QBN /t M-2 = 1.53 for the QNN and t ASV /t M-2 = 1.02, t QBN /t M-2 = 1.51 for the HQNN. For smaller circuits sizes OQC's Lucy produces runtimes that are faster by a factor of t Lucy /t M-2 = 0.22(3) compared to Rigetti's Aspen M-2. In contrast, the runtime for IonQ's Harmony QPU is on average a factor t Harmony /t M-2 = 13.5(52) slower than Aspen M-2. The IBMQ Falcon r5.11 QPU performs similarly to the Rigetti Aspen M-2, with an average relative runtime of t ASV /t M-2 = 0.87 (18).
Notably, vendor management of the QPUs results in a runtime that varies significantly from job to job. The percentage variance in runtime measured for inference on a QPU is generally higher than for simulator execution. Specifically, the average percentage standard deviation across the 10 repeat measurements is on the order of 125%, 16%, 21%, and 7% for IonQ, Aspen M-2, Lucy, and Falcon r5.11, respectively.

A. Cost Evaluation
As described in Sec. I a primary consideration in developing and testing QNNs and HQNNs is the financial cost of training a network. In general, the pricing structure of publicly available QPUs is such that the training cost is proportional to the number of distinct quantum circuits that must be evaluated during the training process. When using the parameter-shift method for gradient calculations, the number of distinct quantum circuits is proportional to the number of trainable parameters. Hence, for the QNN and HQNN considered in Sec. II C 1 and Sec. II C 2, the training costs scales linearly with the number of qubits. For an increasing number of epochs, and for an increasing number of training samples, this rapidly results in millions or billions of distinct quantum circuits that must be initialized and evaluated on the QPU.
In the present work, QPUs are accessed through AWS Braket and IBM Cloud Qiskit Runtime. The AWS pricing scheme includes both a per-task cost, incurred for the execution of a given quantum circuit, and a per-shot cost which is applied to the number of shots specified for that quantum circuit. By comparison, Qiskit Runtime implements a pricing scheme on the basis of runtime with a fixed price per second for executing a quantum circuit. Fig. 3(b) illustrates an example of how the estimated cost scales with circuit size for the four QPUs presented in this work. The consequence of this QPU pricing structure is that for circuits larger than a few bits, training a QNN The training price estimates are calculated on the assumption that the model is trained over 100 epochs, with 100 training samples and 1000 circuit shots per expectation value. For IBM Falcon r5.11 the prices are obtained in CHF and the conversion is calculated using a rate of 1.06 USD = 1.00 CHF. Note that the IBM Cloud Estimator service offers a method for circuit training that could reduce the IBM prices by up to 2 orders of magnitude, please refer to Sec. III A for a detailed explanation. on a QPU becomes prohibitively expensive. Prices range from approximately 1000 USD for a two-qubit QNN using Rigetti's Aspen M-2 or OQC's Lucy, to more than 10 × 10 6 USD for a 26 qubit QNN using IBM's Falcon r5.11. It is worth clarifying that the training in this benchmark treats every quantum evaluation as a new circuit on which many shots can be executed. The IBM Qiskit Runtime offers an alternative method: where the quantum computer scientist can set up the quantum circuit once for a large initial cost and then all the circuit executions in that epoch would use the same setup but with varying parameters depending on the dataset and the parameter-shift rule. Assuming a circuit set-up that takes t 1 = 5s and a per shot runtime of t 2 = 250µs, the cost of training the same circuit could be dramatically reduced from tens of millions of USD to 212, 800 USD. The multi-parameter, multi-circuit functionality is specific to the IBM Cloud and thus the authors decided to set up every circuit in all cases in the interest of fairness.
In contrast to the high cost of training, the forward pass of a QNN does not entail a gradient calculation and thus requires only a single task with around 100-1000 circuit shots. The cost of using a QPU for the inference stage of a trained QNN is considerably less. For the QPUs accessed through AWS Braket, the cost is fixed for a given number of shots, irrespective of circuit size. For QPUs accessed through Qiskit Runtime, the cost increases linearly with circuit size (QPU runtime increases linearly with circuit size). A typical forward-pass of a QNN can be executed at a cost of approximately 1 USD for QPUs accessed through AWS Braket, or between 10 -40 USD for QPUs accessed through Qiskit Runtime.

B. Accuracy Evaluation
Understanding the role noise plays in executing quantum circuits is critical to leveraging quantum computers. Vendors often provide esoteric measures of fidelity, gate accuracy and characteristic timescales. Although these are valuable metrics for assessing the relative performance of different QPUs, evaluating the overall effect of these different sources of error on a typical quantum circuit can be challenging. A holistic measure of noise in a quantum circuit can be achieved with a straightforward empirical procedure.
First, the parameters of all trainable gates are fixed at a value of π/4. The input feature values are each set to π 2 /4, and the QNN is augmented by applying the adjoint of the entire circuit prior to measurement. In a noiseless circuit this results in a final state vector with zero amplitude for all basis states except the computational ground state, i.e |00...0⟩. In a physical QPU, various noise sources degrade the fidelity of the circuit, resulting in a non-zero probability of measuring one of the other 2 nq − 1 computational basis states. An accuracy measure is then obtained by counting the proportion of states measured in the computational ground state over 1000 circuit shots. This is commonly referred to as the fidelity of a quantum state, F = |⟨00...0|ψ⟩| 2 . In this case the fidelity of the final quantum state is measured relative to the computational ground state. This fidelity measurement is repeated by executing 10 such jobs on each QPU to obtain a mean accuracy. The state vectors needed to obtain fidelities for individual shots are not available through the PennyLane SDK with AWS. Consequently, the AWS braket software library is used instead to construct the same quantum circuits described in Sec. II C for physical QPU accuracy tests presented here, with the exception of the IBM Falcon r5.11 which was tested on QMware and accessed through IBM Cloud's Qiskit Runtime API.
The results, shown in Fig. 3(a) and Tab. IX, vary significantly between QPUs. The IonQ Harmony device attains greater than 99% fidelity across circuits sizes from two to ten qubits. The fidelity of the OQC Lucy device is high for small circuits with only two qubits, but degrades significantly for larger circuits, with 67 ± 2 %, 64 ± 2 % and 57.6 ± 0.9 % fidelity for four, six, and eight qubits, respectively. The performance of the Rigetti Aspen M-2 and IBMQ Falcon r5.11 QPUs is similarly high for the two qubit circuit, with fidelities of 89±6 % and 93.2±0.4 %, respectively. However the fidelity of these devices approaches zero for circuits larger than approximately eight qubits.
Note that the IonQ Harmony device has full connectivity between its 11 qubits, potentially resulting in a shallower transpilation depth that could contribute to its higher overall accuracy. Importantly, there are additional considerations that might contribute to this result, such as internal optimizations or error correction algorithms. Since this study focuses on performing a practical benchmark, these results were obtained from the end-user's perspective with minimal modification or tuning. Thus, this work does not rule out potential gains for other devices if the appropriate tuning is performed.

A. Large Quantum Circuits
The amount of random access memory (RAM) utilized in simulating any noiseless quantum circuit is a function of the dimension of the vector space, and hence grows exponentially with the number of qubits. An n-qubit state is specified by 2 n complex amplitudes, and thus requires approximately 16 × 2 n bytes of memory. This equates to approximately 16 GB of RAM for a 30 qubit circuit.
QNNs using large numbers of qubits are highly susceptible to the barren plateau phenomenon, where the gradients of a QNN with randomly initialized parameters vanish exponentially as the number of qubits increases [59]. This could present a serious obstacle to training a QML model with a large number of qubits unless a variety of mitigation methods are employed [60][61][62][63]. Solving the barren plateau problem is crucial to achieving highly expressive models that could outperform classical machine learning methods on complex datasets. Access to development services that are able to simulate QNNs with a large number of qubits is essential to solving the barren plateau problem.
To explore the performance of QMware's HQC4020 with an increasing number of qubits, additional benchmarks are performed for QNNs and HQNNs with up to 40 qubits. A 40 qubit simulation is achievable by reducing the floating point precision to single precision (32 bit) representation, for all other circuit sizes a double precision (64 bit) is retained. Fig. 4(d) illustrates the exponential increase in memory usage for a range of circuit sizes up to 40 qubits, for both single and double precision representations.
Tab. XI gives the inference runtimes for circuits with up to 40 qubits, up to a maximum of 34 qubits in the case of ASV. The runtimes for the QBN and ASV simulators increase exponentially, with QBN reaching a runtime greater than 13 hours for the 40 qubit QNN. Owing to the long simulator runtimes for large circuits, multiple repeats are not possible, and thus it is hard to draw meaningful conclusions from the single QBN and ASV trials presented for circuits larger than 27 qubits. Nonetheless, in the case of QNNs, ASV achieves a relative runtime of t ASV /t QBN = 1.34 (9) for circuits with 28 to 34 qubits. In the case of HQNNs there is no clear advantage, with t ASV /t QBN = 0.95 (22).

B. Multi-threading
The runtime performance for circuits with many qubits can be improved using various parallelization techniques. A common method for achieving a substantial increase in runtime performance for linear algebra operations is to compute matrix-vector and matrix-matrix products with the aid of multi-threading. The QMware basiq library provides native support for a multi-threaded approach to low-level C++ linear algebra operations. However, determining the optimal number of threads to execute a circuit with a given number of qubits is not straightforward. Fig. 4(a) illustrates a cross-sectional study of runtime for circuits with up to 20 qubits with multi-threading across as many as 24 threads. In Fig. 4(b) the optimum number of threads for a given qubit count is shown. These results can be applied generally to achieve best-in-class performance with the QMware HQC2040 using the basiq software library.
PennyLane provides some parallelization for gradient calculations using the adjoint operator method [56], which is applicable to parameterized quantum algorithms. However, the authors are not aware of any lowlevel parallelization in the PennyLane library for general linear algebra calculations encountered in generic quantum circuits.

C. Conclusion
This work presents a comprehensive study of various quantum computing platforms, using both simulated and physical quantum processing units. The results presented in Sec. II D demonstrate a clear runtime advantage for the QMware basiq library executed on the QMware cloud computing service. Relative to the next fastest classical simulator, QMware achieves a runtime reduction of up to 78% across all algorithms with fewer than 27 qubits. In particular, the QMware basiq library achieves a runtime reduction of 0.56 (32) relative to the PennyLane library for QNNs and 0.54 (30) for HQNNs.
The QMware HQC4020 hardware benchmarked against AWS ml.m5.24xlarge achieves a comparable relative runtime of 0.90 (17) for QNNs and 0.95 (16) for HQNNs. Thus, the advantage offered by the QMware cloud computing service is primarily due an harmonious interplay between software and hardware. This performance advantage can be attributed to the superior multithreading support present in the basiq library. AWS also provides the SV1 simulator, which performs marginally better than QMware basiq for large circuits with more than 27 qubits in the case of QNNs (up to the SV1 maximum of 34 qubits). There is no clear advantage for either QMware or SV1 in the case of HQNNs. Additionally, QMware is the only tested simulator that has the capability of simulating circuits with 34-40 qubits.
The price and scarcity of quantum hardware means it is more time and cost efficient to develop algorithms and train QNNs with quantum simulators such as Amazon Web Services Braket or QMware's cloud computing service. This is particularly true for variational quantum al-gorithms and QNNs which represent a promising utilization of quantum technologies in artificial intelligence on NISQ computers. In contrast to the exponential runtime scaling encountered in quantum simulators, the runtime of QPUs scales linearly with the circuit size. Publicly available QPUs are already able to achieve runtime improvements over simulator hardware for large numbers of qubits. For example, Rigetti's Aspen M-2 is able to execute 40 qubit circuits in approximately one minute, which is less than 0.02% of the runtime measured for QMware's simulator.
As quantum hardware improves and the number of qubits available grows, it will become possible to gain a substantial runtime advantage over simulator hardware when executing large quantum circuits. However, the fidelity tests presented in Sec. III B indicate that accurate inference with such a large quantum circuit is not yet possible. Moreover, the cost of accessing these quantum devices makes training a QNN on currently available QPUs prohibitively expensive. Owing to the exponential computational space, QNNs with relatively few qubits are able to tackle challenging data science and industry problems. Thus, the key to success in the field of quan-tum computing is to improve the cost and the accuracy of QPUs, and integrating them well within classical infrastructure. The hybrid interplay between quantum and classical machines is the key to seamlessly harness the best performance of QPUs and simulators depending on the use case. As described in Ref. [64], quantum benchmarking faces many more challenges than its classical counterparts. After all, there is no singular "quantum technology." Quantum devices are made from different materials and used for vastly different purposes. This makes it difficult to predict how the field will progress, as each type of technology develops breakthroughs at different rates. Additionally, quantum devices are ultimately compared not only to other quantum computers but also to the performance of classical devices, itself an ever-moving target.
Given the multitude of challenges, it is important to question how any benchmark result compares to others. In this appendix, a non-exhaustive list of key takeaways from other benchmarking papers is presented to better address where this paper sits in the wider landscape. The year of each paper's publication can be viewed in Figure 5. For more information on the current state of quantum benchmarking, Ref. [65] provides an excellent summary of many benchmarking papers. In that paper, Wang et al. separate quantum benchmarks into three classes: physical, aggregative, and application-level.
Physical benchmarks depend on hardware specificity. These tasks give a measure of the engineering capabilities and limitations of quantum devices, such as the impacts of decoherence and noise, which have been established as major speedbumps in the development of fully-functional quantum devices [41,66]. Examples of physical benchmarking papers include Ref. [39], a measurement of qubit coherences and gate-based fidelities of IonQ's trapped Ytterbium qubits, and Ref. [67], which measures T1 and T2 coherence times as well as the CCNOT gate and the adjoint Quantum Fourier Transform for a 20-qubit IBM device.
Aggregated benchmarks measure how well multiple qubits and gates work together. Perhaps the most widely used of these hardware-agnostic metrics is Quantum Volume. Introduced by IBM in Ref. [68], Quantum Volume measures the maximum circuit width (number of qubits) and depth (layers of gates) that a device can run such that the two numbers are equal and the fidelity is maintained above a certain threshold. Given a square circuit of width and depth size N , its Quantum Volume is defined as the exponentiation of the side length of this square, 2 N , and the side length itself is sometimes referred to as the number of Algorithmic Qubits [69]. For example, a quantum device that can attain high fidelity with 4 qubits and a circuit depth of 4 has a Quantum Volume of 2 4 = 16 and is said to have 4 Algorithmic Qubits.
Lastly, application-based benchmarks, which include this paper, demonstrate the abilities of a device to perform specific algorithms and tasks. A large portion of the excitement around future quantum computers is their potential for speedups in a variety of common algorithms and optimizations. As the applications for quantum devices are vast, application-based benchmarks vary dra-matically. For the sake of clarity, this category of benchmarks is divided into subcategories and is presented in the following paragraphs in roughly chronological order. These subcategories are quantum chemistry, multiple task suites, visual representation, and quantum annealing.
One implementation of quantum devices is the simulation of chemical reactions, which are by their nature quantum mechanical. The usefulness of this task has led to the simulation of quantum chemistry becoming a persistent benchmark for quantum computers. For example, Ref. [40] utilizes simulated alkali metal hydrides to test the Variational Quantum Eigensolver on IBM and Rigetti's superconducting devices. With this approach, they were able to reach a high enough accuracy in their simulations to reproduce the results of these chemical reactions. Along a similar vein, Ref. [46] proposes a problem from solid-state physics, a one-dimensional Fermi-Hubbard model. In one dimension, this model has an exact solution, which makes it straightforward to test the results of the Variational Quantum Eigensolver. By layering VQE on top of itself, they managed to recreate the theoretical one-dimensional results with Google's Sycamore hardware.
Later, entire suites of benchmarks were developed to test various tasks and metrics, such as QASMBench [43]. QASMBench is a benchmarking suite of computational problems for quantum devices based on the OpenQASM assembly language. It presents metrics in multiple tasks including linear algebra, chemistry, optimization, and cryptography. Other suites have also been developed, such as QScore [45] and QPack [47] which both present benchmarking that centers around Max-Cut and similar optimization problems.
Then, building on the framework of IBM's volumetric benchmarking, Ref. [44] considers specific tasks for which deep (larger circuit depth) and shallow circuits (larger circuit width) would be better suited, such as state preparation and IQP circuits respectively. While Ref. [42] utilizes volumetric benchmarks as a backdrop to measure a suite of quantum algorithms, such as the Quantum Fourier Transform and Grover's Search. These algorithms are tested many times at different depths and widths. Consistently, as the tasks increase in size and move farther away from the center of the quantum volume square, the fidelity drops. In this way, the relationship between a device's quantum volume and the fidelity of quantum operations is readily apparent through visually accessible figures that show when the devices begin to fail. The use of visual assessments allows for an intuitive understanding of benchmarking and has become more standard. For example, Ref. [70] tests 21 quantum devices with algorithms that were not only chosen for performance but also specifically chosen with visual accessibility in mind.
Finally, benchmarks from the field of quantum annealing, which utilizes the similarity of a physical Ising model and quadratic unconstrained binary optimization  (QUBO) to solve combinatorial optimization tasks, are necessary for annealing devices. Ref. [71] showed that specific instances of quantum annealing on DWave's Advantage were able to produce solutions within 0.5% of the best-known solution in a fraction of the time of the fastest classical algorithms. It should be noted this is not a fundamental speedup, nevertheless, order-of-magnitude differences in execution time obviously have substantial industry potential.
Application-based benchmarks span a wide field of practical industry tasks as well as theoretical problems tailored to comparing devices. Ultimately, while there have been many proposed benchmarks, they share a common theme in that they attempt to answer the degree to which a given quantum device can perform industry tasks and effectively communicate results. With this in mind, Figure 6 provides an inexact categorization of these benchmarks based on the specificity of the metrics-that is, the generality of a given benchmark with respect to the hardware required and algorithms available-as well as its practicality on current noisy quantum computers. This paper's testing of supervised learning on various devices serves as an application-based benchmark that has yielded preliminary results on the role of neural networks in quantum machine learning. This document tests a range of real and simulated quantum devices, comparing their time and accuracy performance on a classification task. The future role that quantum machines will play in high-level neural networks and deep learning remains to be determined, but these results demonstrate practical effects of quantum hardware as well as quantum-inspired classical hardware.

Appendix B: Raw numerical results
In this section, the raw results of the benchmarks are provided in tabular view. Tab. III demonstrates the data obtained during the training process, whereas Tab. IV shows the data obtained during the inference process. The blank spaces represent two types of nonapplicability: 1) the standard deviations of training (inference) runtimes were calculated up to 24 (26) qubits to avoid long execution times as well as limiting our carbon emission, and 2) as not all QPUs included enough qubits to fulfill all parts, their numbers are presented up to their highest even-qubit availability.
Tabs. V, VI, VII, VIII showcase the dependence of the QBN runtimes for quantum training, quantum inference, hybrid inference, and hybrid training, respectively. In all cases, it is evident that for an increasing number of threads we see a general improvement in runtimes. However, in some cases even for low-qubit circuits it is beneficial to use more than a single thread.
Furthermore, Tab. IX provides the numerical results for the QPU accuracy tests, whereas the times obtained are shown in Tab. X. These runtimes are largely in agreement with those in Tab. III. However, it is worth noting that these circuits are twice as deep as the ones used in runtime benchmarks as explained in Sec. III B. Tab. XI shows the runtimes associated with running large quantum circuits. A characteristic feature of this limit is the scaling of the quantum hardware that is unavailable to any classical machine. This hints at a decisive quantum advantage in this region with the condition that the accuracy would also become competitive.
Finally, Tab. XII showcases the pricing of the various QPUs used in this work for training and inference. TQ [33] FIG. 6. Key benchmarking papers from the past 5 years plotted with specificity versus practicality. Specificity describes the requirements of hardware as well as the generality of algorithms presented by the benchmarking paper. Practicality refers to the use of metrics that apply to real-world industry problems, as opposed to theoretical problems that have been manufactured for the purposes of benchmarks themselves. It should be reiterated that all quadrants have important uses. Naturally, the placement of each paper is subjective, and others may come to different conclusions about where each benchmark belongs.