Quantum speed-up in global optimization of binary neural nets

The performance of a neural network (NN) for a given task is largely determined by the initial calibration of the network parameters. Yet, it has been shown that the calibration, also referred to as training, is generally NP-complete. This includes networks with binary weights, an important class of networks due to their practical hardware implementations. We therefore suggest an alternative approach to training binary NNs. It utilizes a quantum superposition of weight configurations. We show that the quantum training guarantees with high probability convergence towards the globally optimal set of network parameters. This resolves two prominent issues of classical training: (1) the vanishing gradient problem and (2) common convergence to sub-optimal network parameters. We prove that a solution is found after approximately 4n2lognδN˜ calls to a comparing oracle, where δ represents a precision, n is the number of training inputs and N˜ is the number of weight configurations. We give the explicit algorithm and implement it in numerical simulations.


Introduction
Artificial neurons are elementary many-to-one logical gates whose action is controlled by tunable parameters, often referred to as weights. Networks of interconnected neurons, called (artificial) neural networks (NNs) [1][2][3], have proven immensely successful for a large variety of tasks, notably including pattern recognition [4,5], language processing [6,7], and simulation of molecular dynamics [8,9], leading to changes in practice in fields such as medicine, pharmaceutical research [10][11][12][13], and finance [14]. As digital logic naturally employs binary encoding, binary neural networks (BNNs) [15], for which the weights and outputs can only assume bit values, were recently introduced. Such simplifications drastically reduce the memory size and run time during the execution of the network [16].
A key reason for the success of NNs in the tasks mentioned above is their trainability. While computers were originally built to process information according to a pre-defined algorithm, NNs, including BNNs [15] can undergo training to learn how to process data themselves. In a standard set-up, data, for which the desired output is already known, is input and the network parameters are adjusted until the outputs of the NN coincide with the desired ones. This adjustment is performed according to training algorithms [1][2][3]. In the literature, such a calibration of the learning architecture falls under the umbrella of supervised learning.
Designing and understanding training algorithms for NNs is accordingly a critical area of research. Training NNs optimally was shown to be NP-complete [17,18], leading to long training times [17] and large consumption of memory [19,20]. Furthermore, common training methods, such as gradient descent, heavily rely on the shape of the optimisation landscape of the network parameters. Since the landscape is Simple instance of an FNN. The inputs a 1 , a 2 are forwarded to the neurons N 1 and N 2 , which process them and forward their outputs to the next layer, consisting of neurons N 3 , N 4 and N 5 . Afterwards, the outputs of the second layer is forwarded to N 6 and N 7 , which output a 1 , a 2 . Each of the three layers is connected with edges carrying weights w ∈ {w i } 12 i=1 .
interlinked in a specific way, constituting the architecture of the network. This allows neurons to exchange information: after a neuron operated on some input it can send the resulting output to the next neuron for further processing. The strength of such an interaction is described by the network parameters {w i } N i=1 , commonly referred to as the weights, which are most generally real numbers.
One prominent architecture is feed-forward neural networks (FNNs). Here, the neurons are grouped into consecutive layers. Neurons within the same layer do not interact, while all neurons between two consecutive layers are connected with another. The amount of neurons can be different in each layer. The time flows linearly in a sense that each neuron in layer i receives inputs at one single time t i -no back-action or cycles are allowed-and the outputs of the neurons are forwarded to the next layer i + 1 only. This way, the user inputs data to the input layer, and after a cascade of intermediate layers (commonly referred to as hidden layers) the output layer returns the response of the FNN. Figure 1 depicts a simple instance of a FNN with two inputs and outputs and one hidden layer. The FNN architecture has proven immensely successful for tasks like pattern recognition [4,5] and classification [52].
The task of training binary neural nets. The ability of NNs, including FNNs, to learn certain tasks like pattern recognition [4,5] and classification [52] comes with a high computational cost. One of the reasons for this is that the inputs to a neuron, and its weights {w} N i=1 can be arbitrary real numbers, which generally require large amounts of memory and arithmetic operations [15]. Only recently, Hubara et al [15] proposed BNNs as a simplified model for NNs. Here, the outputs of the neurons and the weights are restricted to bits, and consequently neuron processing reduced to bit-wise operations. This causes drastic reductions of memory size and speed-ups in data processing, compared to NNs with continuous parameters. Remarkably, despite the strong nature of the simplifications, the accuracy of BNNs was shown to be similar to NNs for certain tasks [15].
At the core of each learning process lies the calibration of network parameters {w i } N i=1 , sometimes referred to as training of the NN. To this purpose, pairs of training data (a, a * ) consisting of inputs a and corresponding desired outputs a * (also called labels) are used. This happens by sending the inputs through the NN, and comparing the network outputs a to the correct outputs a * . The weights in the NN are tuned according to the deviation of the output from the desired ones. To quantify the deviation, a task-specific cost function C({a , a * }) is defined. Note, that the outputs a implicitly depend on the choice of weights. Hence, the task is to find the configuration of weights which minimizes the cost function C. Typically, this is done via methods based on gradient descent. Here, a weight w is updated according to its induced change in the cost The training of classical NNs is known to have three main drawbacks: (1) there is no efficient method, and indeed training has been shown to be NP-complete even for small networks [17,18] (see appendix D for more explanations). (2) The cost landscape is usually non-convex. The solutions found in training methods based on gradient descent are only locally optimal and can be globally very suboptimal [22,23]. (3) The gradient of the cost function with respect to a specific weight can be vanishingly small or explodingly large, preventing convergence to a solution. This issue is often referred to as the problem of vanishing or exploding gradients [53].
Quantum neural networks. The processing of neurons in classical NNs is commonly highly non-linear. In computational models, however, quantum processing of information is fundamentally reversible. Modelling the action of a single neuron by a channel N this would mean that N is required to be unitary. Any channel can, with the addition of reference systems, be seen as a unitary dynamics on a larger system [54]. For this reason, we treat quantum extensions of classical NNs in this work as a collection of interlinked unitary channels corresponding to the quantumly extended neurons. In the section 3 (see also appendix E for Marking of the target weights. A QBN acts as a unitary U on the input |a , ancilla |0 and coherent weight state i |w i . The output of U is compared to the desired output |a * by the oracle Λ. To decouple the weights, an uncomputation U −1 is applied. further concrete examples) we illustrate this point by constructing an explicit quantum extension of the most common instance of classical feed-forward BNNs (FBNNs).
Since every neuron is unitary, the full network is unitary. This allows us to treat the training framework in large generality: the only assumptions we make is that the network subject to training (1) acts unitarily, has (2) a finite number N of parameters and (3) the parameters assume values from a finite set of cardinality d (as we consider BNNs each parameter takes a bit value-the framework considered here is, however, more general and allows for any arbitrary finite number d of values). In other words, although we focus on training classical FBNNs, the results of this paper hold in general for the more abstract task of discriminating the optimal unitary evolution, out of a set of d N candidates, for executing a certain task. In the remainder of this paper we denote by U the full action of the quantumly extended FBNN (QFBNN).
The quantum training algorithm. Here, we propose a fully quantum training protocol for QFBNNs. In contrast to previously suggested protocols [35][36][37]43] we maintain coherence throughout the training and do not collapse the state with each update of parameters. For illustration purposes we restrict the training to a single neuron. This readily generalises to networks, as both a single quantumly extended binary neuron (QBN) and a full QFBNN are, on an abstract level, described by a unitary with a finite set of two-level parameters. We comment in appendix E further on how the direct translation to QFBNNs works.
The general training strategy proceeds in the following three steps (a) Marking of the target weights: the initialization of the full set of parameters is denoted as a string of values (weights). Each possible weight string is multiplied by a phase factor which depends on how many times it leads to the correct output, for the input data in the training set. This is achieved coherently by initializing a quantum superposition of different QFBNNs and acting on it unitarily. (b) Binarisation of the marking: the phase factors from step 1 are binarised to the values {1, −1} using an algorithm based on quantum phase estimation. Strings with negative sign are candidates for solving the training problem. (c) Extracting the optimal set of weights: the binarisation step can in general mark M 1 different weight states, some of which are good but not optimal. In the last step, a binary search reduces M to the minimum by sorting out the suboptimal marked states. The remaining M states all perform equally well. Afterwards, the corresponding amplitudes are amplified and measured.
We prove that this yields with high probability the globally optimal set of weights, quadratically-in the number of weight strings-faster than classical global search. While the main text discusses the subroutines in detail, appendix G illustrates the functioning of the training method with concrete examples.
Marking the target weights. Let us assume we have in total N binary parameters (weights) {w 1 , w 2 , . . . , w N }, with values encoded as |0 and |1 . Consequently, there are in total N = 2 N different combinations of values for the weights. Each of these combinations will be denoted by a bit string w of length N, for which the corresponding quantum state reads |w . Furthermore, let n be the number of data pairs (a 1 , a * 1 ), (a 2 , a * 2 ), . . . , (a n , a * n ) used for the training. Here, the value a * i (sometimes called label) denotes the desired output of the given input a i (note that a single QBN produces exactly one output a for a given set of inputs). Then, a QBN takes as inputs one set of training data a ∈ {a i } n i=1 , the corresponding weights |w of the incoming data, and an ancilla qubit |0 to realize the unitary embedding of the classical binary neuron (CBN) (see appendix E for more information).
The first subroutine works as a phase accumulation approach to identify the goodness of a string, see figure 2.
Step 1: initialize all N = 2 N possible weight strings in a coherent superposition |W = 1 √ N i |w i . This yields the overall initial state where |0 is an ancillary qubit and (a, a * ) ∈ {(a 1 , a * 1 ), (a 2 , a * 2 ), . . . , (a n , a * n )} is one pair of training data. Due to the superposition, all weight states are simultaneously combined with the input |a . Next, the unitary action U of the QBN acts on |W , the input |a and an ancilla |0 , encoding the output |a on the ancillary system. The overall state is then given by where the QBN transforms the input |a and ancilla |0 into | a i and |a i respectively if the control state is |w i . Note that except for desired out |a * the systems in equation (2) are entangled in general.
Step 2: call the oracle Λ to compare the output |a i with the desired output |a * . If the two states coincide Λ adds a phase e iπ/n to |a i . This leads to where Δ w i = π/n if the oracle comparison between a i and a * was successful, and Δ w i = 0 otherwise.
Step 3: decouple the weights by uncomputation. By inverting the unitary action U of the QBN, the weights get decoupled from the remaining system and assume the state The full output state is given by |Out 3 = |W |a |0 |a * .
Step 4: accumulation of phases. The state |W is used as the initial weight state for the marking with a new training pair. By repeating steps 1-3 for all n pairs of training data, we achieve a coherent phase accumulation for all weight strings, resulting in the output weight state Here, N i ∈ [0, n] denotes the number of times the oracle comparison Λ was successful for weight string w i during the n rounds. The most frequently marked weights are closest to a phase of −1, whereas bad weight strings maintain a phase close to +1. In the appendix G we discuss the quantum circuit implementations of several basic examples of the above marking scheme and their simulation results. The simulations have been done via Huawei's quantum computing cloud platform HiQ (hiq.huaweicloud.com), as numerical evidence of the proposed quantum training algorithm. Now, letŨ be the unitary describing the full accumulation process, namelỹ where |0 = |0 ⊗n is the ancillas' state. It can be seen easily that the states |w i ⊗ ( n j=1 |a j ⊗ |a * j ) ⊗ |0 , i = 1, 2, . . . , n are eigenvectors ofŨ with corresponding eigenvalues e iπN i /n . This is an important insight for the next step in which the phases will be converted to a binary classifier for the quality of the strings.
Binarising the marking. The output state in equation (5) features a coherent encoding of the performance of each possible weight string for the given training data set. Next, we need to find a way to extract this information without destroying the coherence of the state. Standard quantum searches fail in this situation: it was shown in [55] that for arbitrary phases, the optimal number k * of iterations of the quantum search is in general unknown.
Instead, we propose a subroutine to binarise the phases in the weight state | W . To this purpose, we define a threshold count N τ , which denotes the minimum required number of successful oracle calls for a weight string to get marked. If N i N τ , the corresponding weight state will get a phase of −1. If N i < N τ , the phase is set to 1. This conversion can be done via phase estimation (PE) followed by a threshold oracle O ±1 . The complete process of converting the phases in equation (5) to a binary marking is visualised in figure 3.
The PE is executed on the unitaryŨ corresponding to the full phase accumulation routine (see equation (6)), writing PE(Ũ). The eigenvectors ofŨ are given by |u i = |w i ⊗ n j=1 (|a j ⊗ |a * j ) ⊗ |0 , with corresponding eigenvalues e 2πiφ (i) , φ (i) = N i /(2n) 1/2. Note that φ (i) can be represented in binary as Omitting normalisation for readability, we input the superposition |u = N i=1 |u i of of the phases φ (i) = N i /2n subject to the estimation. Then, the oracle O ±1 is called, which will classify the state as good if the phase is larger than a certain threshold N τ and bad otherwise. Finally, uncomputation of the PE achieves a decoupling of the marked weight state from the other systems. eigenvectors to PE(Ũ) and obtain an estimate for all φ (i) , i = 1, 2, . . . , N in the superposition (for a brief overview of the PE algorithm see the appendix). Let us assume that the ancilla space of PE is given by t qubits. Then, for each term in the superposition, PE(Ũ) converts the number φ (i) into a t-qubit state Then, for t large enough it can be determined from the estimate |φ (i) For illustration purposes, let us analyse some simple examples. Assume first we have t = 1, such that PE(Ũ) gives the first decimal bit φ 1 . Then, which assumes the value 011 if N i 3n/4, and 010 if n/2 N i < 3n/4. Thus, even though the number of uses ofŨ grows exponentially in t, the search interval (up to errors in the PE discussed later) is narrowed down exponentially fast in t. Hence, in order to single out the largest φ (i) only a small number t of ancilla qubits is necessary. Note that t only depends on our required precision, not on N. We comment later in greater detail how to choose t and its impact on the performance.
Finally, after PE and the oracle O ±1 , the weight state together with the phase register state reads where the exponent N i N τ is to be understood as a Boolean variable which is 1 in case the condition is true and 0 else. The state in equation (7) is entangled between the weight strings and the binary encoding of phases from the PE. In order to decouple the two systems while maintaining coherence of the weight state, we uncompute the PE, which transforms (7) to In the next step, the amplitudes of the marked strings in the state |Ŵ are amplified such that the concluding measurement detects one of the marked strings with high probability.
Full training cycle. It can be seen from equation (8) that the marked state |Ŵ is equivalent to the marked state of standard Grover search. Hence, in order to amplify the amplitudes of the marked elements the standard diffusion operator D = H ⊗n Λ 0 H ⊗n can be used, with Λ 0 = 2(|0 0|) ⊗n − I (see the appendix for a brief revision of Grover search). The quantum training for the binary neuron then succeeds by applying k * ≈ N M π 4 iterations of the binary marking via PE and the subsequent diffusion, assuming the number M of target weight strings is known. This is generally not the case: M depends greatly on the training data. Yet, even without knowledge of M we show in the appendix that one can find an optimal weight string in an expected time that scales as O N/M with a subroutine based on Boyer et al [56].   figure 3, the diffusion operator D = HΛ 0 H is applied to the weight register. Amplification of the marked elements is achieved by looping the full marking subroutine and the subsequent diffusion for k * times. A final measurement in the computational basis yields the optimal weight string w * . of matches N i for a weight configuration w i into qubit states. We elaborate in the appendix F on this method and compare it with the training algorithm from the main text.

Circuit implementation
We illustrate the training algorithm with an elementary instance of a NN, consisting of a single neuron with two inputs. In addition to studying the dynamics of the training with a small data set, we present the explicit circuit implementations of the phase accumulation sub-routine, as well as the full quantum training. Further details and more involved examples are given in the appendix.
Consider the example of a single QBN with two inputs and two weights. Figure 5 illustrates the example, and a circuit implementation of the phase accumulation subroutine is given in figure 6. In the circuit, the six input qubits are (from top to bottom): the two weights |w 1 and |w 2 , the two training inputs |a 1 and |a 2 , the ancilla storing the output of the neuron computation, and the desired output |a * . The initial training data point is (a 1 , a 2 , a * ) = (0, 0, 0). In the first step, the Hadamard transformations create the coherent superposition of all possible weight strings, |W = 1 2 (|00 + |01 + |01 + |11 ) (the individual weight strings are simply |w = |w 1 w 2 ). Next, the circuit in the first dashed blue box depicts the unitary actions of the neurons: the CNOT gates implements the multiplication between weights and inputs, the following Toffoli gate implements the addition (bitcount operation) of the weighted inputs and the activation function by sign function. The output is then saved in the ancillary qubit on wire 5. After the oracle Λ is called, the uncomputation of the neuron is performed (the circuit in the second blue box). A single phase accumulation cycle is then collectively denoted by the gate N.
The full phase accumulation subroutine succeeds by repeating N for n = 4 times, where n is the number of different training inputs. New input data is initialized in the circuit by applying X gates on |a 1 , |a 2 , |a * . In this example, the training set we adopt is: {(a 1 , a 2 , a * )} = {(0, 0, 0), (1, 0, 0), (0, 1, 1), (1, 1, 1)}. Due to the small numbers of inputs, we exhaust all four possible data points as the training data.
As an illustrative example of the full quantum training algorithm consider figure 7. Referring to figure 7 we first identify the circuit's input qubits (all initialized in state |0 ), from top to bottom. The very first qubit |0 1 plays the role of a control for the binary marking oracle O ±1 and is only needed later in the training algorithm. As described in the main text, the oracle executes a binary classification of the weight strings. This is achieved by comparing the performance of the weight strings with a threshold N τ . If the performance is above N τ then O ±1 flips the sign of the quantum state encoding the string.
The next three input qubits |0 2 , |0 3 , |0 4 are the ancillary control systems for the PE. This means that in the example here the subroutine estimates t = 3 binary digits of the number φ in the phase e 2πiφ (up to some error described in the appendix when introducing the PE algorithm). The remaining qubits play the following role: the multi-qubit register |0 5 represents the weight register, which is transformed into the coherent superposition |W = 1 √ N |w of all weight strings. The last qubits |0 6 are the remaining inputs (n pairs of training data and ancillas for saving the output of the NN). The training inputs and desired outputs are all initialized in zero. As we will see later based on concrete examples, the unitary actionŨ can be designed to include changing the training input qubits into the intended training pairs. The first subroutine in the blue box executed the PE. The PE algorithm first initializes the t = 3 control qubits in the |+ state. Each qubit controls 2 j−1 , j = 0, 1, 2 uses of the full marking unitaryŨ, acting on the full set of inputs toŨ (see the main text for more details). The circuit implementation ofŨ is discussed in detail in appendix G. Then, the inverse quantum Fourier transform (QFT) converts the control ancillas into the state |φ 1 φ 2 φ 3 , which represents a truncated binary encoding of the factor N i /(2n).  . Quantum circuit implementation of the phase accumulation sub-routine for the single neuron with two inputs and two weights. The first blue box implements the neuron. Then the Λ-oracle is applied to add a phase if the output is correct. In the second blue box the neuron is uncomputed. Next the inputs are altered by a Pauli X, followed by a repeat of neuron-Λ-uncompute neuron, labelled as N, and so on. Circuit implementation for the full quantum training algorithm. After initialising the threshold oracle's workspace qubit (qubit 1) and the weight states (qubits 5), the threshold-oracle is fed the output of a PE circuit, which includes the neural net within the controlled unitary. The PE is then uncomputed. The Grover diffusion operator is applied. Those steps after the initialisation, called Q here, are then repeated a number of times.
Next, the red box implements the oracle O ±1 . In case the integer N i supersedes the threshold N τ , the state is changed as |φ 1 φ 2 φ 3 → −|φ 1 φ 2 φ 3 . In this specific example depicted in figure 7, the threshold is set to be N τ = n, (namely the optimal weight should be good for all the inputs). When converting to binary digits this condition translates to φ 1 = 1, φ 2 = 0, φ 3 = 0. Therefore we apply a multi-controlled gate to realise the sign flip: only when |φ 1 = 1, φ 2 = 0, φ 3 = 0 the controlled-NOT gate acts on the ancilla. For this reason, the ancilla has been pre-set to |− by applying X and a Hadamard gate on it. In this case, |− → −|− if the threshold is met. Different thresholds can be implemented by switching between controls and anti-controls in figure 7 (whilst a controlled unitary U has form |0 0| ⊗ I + |1 1| ⊗ U, an anti-controlled U, drawn with a white dot, has form |1 1| ⊗ I + |0 0| ⊗ U). The resulting 8 possible O ±1 can be seen to implement the desired threshold by binary expansion of the corresponding |φ 1 φ 2 φ 3 . Example: N τ = (3/4)n is associated with 'threshold' state |011 , the binary encoding of (3/4)n/(2n) = 1/8, such that the top two dots are black and the lowest white. The full routine is thereafter to be uncomputed, with the purpose of decoupling the marked weight state |Ŵ = 1 √ N (−1) N i N τ |w from the other systems. The uncomputation of PE is depicted by the second blue box. After the decoupling of the weight state, the marked superposition |Ŵ undergoes amplitude amplification: the diffusion operator D = −HΛ 0 H, Λ 0 = 2(|0 0|) ⊗n − I amplifies the amplitude of the good weight strings and dampens the bad ones. The full cycle-marking and amplifying-is then collated into one box, called Q, which is repeated for k * times, followed by a measurement in the computational basis on the weight states. This yields an optimal weight string, up to some errors induced by the precision t of the PE algorithm and δ of the binary search (see the appendix for more details).

Quantum training performance
While speed-ups over classical training algorithms were achieved in gradient based algorithms utilizing quantum effects [35][36][37]43], convergence related issues remained unsolved. Classically, failure of convergence as well as convergence to locally optimal parameters rather than globally optimal ones can only be addressed by global search techniques. However, this will only fuel the issue of efficiency rather than shortening run-times.
We show in the following that the quantum training algorithm presented in this work addresses all of the drawbacks of classical training methods. As we will see shortly, the algorithm trains classical FNNs while experiencing a provable speedup. Moreover, it overcomes prominent issues of gradient-based training methods, both classical and quantum.
As a performance measure, we firstly describe the time, and secondly the number of qubits required for the quantum training algorithm.
One way to quantify the time requirement for the training is to count (1) the number of calls to the NN and (2) the number of calls to the oracles involved in the training algorithm. Counting the calls to as a way of quantifying performance is reasoned since the probability of a system being lost, or the amount of noise more generally, will often scale as the number of calls. Moreover, the cost to experimenters is often the amount of time the experiment takes, which may also scale as the number of calls.
There are two 'elementary' oracles. The oracle Λ compares two quantum states |ψ , |φ in a binary way-they are either found to be equal or different from another. In case of equality Λ 'tags' one of the states by a label (by adding a fixed phase). The second oracle O ±1 functions as a threshold check: given a number q encoded into a quantum state |q , and a threshold α encoded into |α , the oracle O ±1 acts as |q → (−1) q α |q . In other words, if the number q is found to be larger than the threshold α, then the oracle flips the sign of the quantum state encoding q.
In the quantum training presented above there are two separate contributions to the number of calls: (i) for each training input (a i , a * i ) there is one call to the accumulation subroutine (consisting of one call to the comparing oracle and two NN calls: one call and one inverse call for the uncomputation), and there are n accumulation cycles in total (recall n is the number of training inputs), (ii) the quantum search protocol on the weight states is expected to take a number of calls scaling as the O N/M , with M being the estimated number of solutions and N = 2 N being the total number of weight states. Hence, in terms of the number of training data pairs, n, and the total number of weight states N, we call the combined Grover oracle times. Here, M i is the number of optimal weights for the ith loop in the binary search and M i 1. The sum in equation (9) comes from the binary search with precision δ described in the appendix. The comparing oracle is then called times (the factor of 2 comes from the uncomputation of the PE), and the QBNN is called 2N C times. The last approximation in equation (10) follows from the fact that the estimated phases are given by N i /2n and the smallest possible precision of N i is δ = 1. In order to be able to give an estimate up to δ = 1, we require the tth binary digit in 0.φ (i) 1 φ (i) 2 . . . φ (i) t to be such that 2 −t = δ/2n = 1/2n. This gives 2 t − 1 ≈ 2n. This is not taking into account the small randomness in the PE algorithm itself, which can be avoided with a small number of extra ancillary qubits-see the appendix.
The number of qubits needed consists of (i) the number of qubits needed for the PE and (ii) the number of qubits needed for QFBNN. For (i), the PE requires t = log(2n/δ) ancillary qubits plus a small extra number-see the appendix-to estimate with precision δ of the binary search. For (ii) the number of qubits needed for the QFBNN, Q QFBNN , consists of the amount of qubits needed for the training inputs, the desired outputs, the N weight qubits, and the ancilla qubits: The following lemma is useful to calculate Q QFBNN in terms of the number of weights and the number of input bits: Lemma 1. The number of input qubits Q input , ancilla qubits Q ancilla , weight qubits Q weight and output qubits Q output in the QFBNN satisfy the relation The proof is given in the appendix C. Noting that Q desired output = Q output , we then see Thus once the number of weights and input/output size is determined the number of qubits can be readily calculated.
Quantum training faster than classical. The outcome of the quantum training methods discussed above is guaranteed with high probability to yield the globally optimal set of weights. To ensure classically that the globally optimal parameters are found would require a brute-force search over a list of size n × N, where N = 2 N and N is the number of weights in the network, n is the number of data pairs in the training set. This list consist of all weight strings and input pairs with assignment of a cost value. As each call of that list requires comparing desired outputs with outputs, such a brute force search calls the comparing oracle N C cl = n × N times. On the other hand, from equations (9) and (10), our quantum training calls the comparing oracle N C qm ≈ 4n 2 log n δ N times, where δ represents the precision of the threshold in the binary search for globally optimal weights (see the binary search in the appendix). The ratio of N C cl and N C qm is: Note that, for training to be practical, n should scale reasonably with the number of weights, and is thus expected to be bounded by a polynomial of N, at least in cases where the training data has benign scaling.
Learning theory suggests that a number of samples-the sample complexity-that is linear in the number of weights is optimal [57][58][59] and in some practical scenarios a rule-of-thumb of 10 training inputs per model variable (here weight) is employed [60]. Thus equation (14) can be termed an exponential advantage in N for such cases. If n were to increase exponentially with N our approach may not have an advantage as N C qm ∝ n 2 whereas N C cl ∝ n. We are not aware of cases with exponential scaling of n in N. As an example of linear or sub-linear scaling, within the architecture of a two-hidden-layer feed-forward network that are enough to learn n training samples [61], we have the relation n < N (see appendix H for details). Therefore for this example we have: The value of N is often large, in the order of thousands to millions [62], such that the quantum algorithm can exhibit dramatic reductions in training time relative to global classical search.
Overcoming vanishing and exploding gradients. During the classical training, the weights are updated with respect to the change they cause in cost. Binary activation functions are usually replaced by continuous approximations such as the sigmoid function. Yet, it can be shown that the gradient of the sigmoid function is bound to the interval (0, 1). Hence, in a network with L layers, approximating the derivative ∂C/∂w of the cost function C with respect to a weight w in layer by back-propagation leads to a multiplication of γ = L − terms, all smaller than one-the gradient vanishes exponentially in the number γ. Conversely, in certain cases the derivatives can be very large and the derivative explodes. Both situations obstruct convergence and cause the training algorithm to fail.
Common approaches to address this issue is using specific activation functions, such as the ReLU function, which improves the issue but does not resolve it in general [63]. Other solutions were proposed, such as classical global search methods [26,64] and special architectures for NNs [65][66][67], but also these approaches were shown to suffer from significant drawbacks [24] or fail to resolve the exploding gradient problem [25].
The quantum training method proposed in this work is a global search. It, therefore, finds globally optimal solutions while at the same time avoiding gradient related issues. While classically it is still an active field of research to improve on this problem, the quantum search completely resolves it. This makes the quantum search a powerful option to train NNs with parameters from a finite set of values.

Summary and outlook
In this work we formulated a fully quantum algorithm to train classical feed-forward binary neural networks. The working principle behind the training algorithm was to establish a quantum superposition of all possible NNs for a given architecture-in the light of recent progress in the field of coherent control of processes [47][48][49]. The algorithm then manages to single out the networks with the best performance by amplifying their weightings in the superposition. This guarantees with high probability the convergence of the training to one network which fit the training data best.
The algorithm avoids two prominent issues of gradient-descent based training (both classical and quantum): (1) convergence towards a suboptimal choice of training parameters and (2) training failures due to vanishing or exploding gradients.
Further, a performance analysis of the algorithm is presented. Here, we derive analytical expressions for the resources-run-time and number of qubits-needed to execute the algorithm. It is proven that the algorithm achieves a quadratic saving in training time-compared to classical training algorithms with the same convergence guarantees.
Finally, numerical simulations for several instances of BNN training were given. The algorithm was shown to function correctly and the quantum speed-ups were observed in the simulations.
We see several experimental and theoretical possibilities for developments. Once larger and less noisy quantum computers can be built in specialised laboratories these could use this kind of training to identify optimal BNNs off-line, which are then built classically and implemented classically on-line. Whilst implementing large networks experimentally is currently not feasible, training a single binary neuron-an elementary module of the network-with the methods described above requires approximately 13 qubits. This is within range of current state-of-the-art experiments, albeit with noise [68,69]. Error-correction would of course add further experimental complexity e.g. as in [70]. A theoretical question concerns the fact that the ability of finding the globally optimal weights may also induce a risk of over-fitting. There are several methods for avoiding over-fitting classically, including dropping out certain layers at times during the training, restricting the number of weights and weight regularisation techniques [71,72]. It thus seems well-motivated to build on this quantum training framework to explore analogous quantum techniques. For example quantum versions of drop-out could involve dynamic quantum network architectures e.g. using quantum control systems or teleportation between non-successive layers.
Finally we note that our algorithm could be employed more generally to find globally optimal unitary dynamics, given a task and a finite set of candidate unitaries. The task itself is encoded in a set of training data and might be unknown to the experimenter. For instance, one might think of an unknown device which was tested by sending in a list of inputs and recording the corresponding outputs. The algorithm in this work then identifies which of the unitary dynamics from the candidates is the best fit for the data. From this perspective the algorithm sounds promising for tasks like hypothesis testing [73], regression [74,75] and causal discovery [76].

Code availability
The numerical implementations were done via Huawei's quantum computing cloud platform HiQ (hiq.huaweicloud.com) and the open-source version of our code is available on www.github.com/phylyd/ QBNN.

Author contributions
All authors contributed substantially to the research presented in this paper and to the preparation of the manuscript. Figure A1. Circuit for the quantum PE. The t ancillary qubits act as control systems for the unitary K. The system input |u is the corresponding eigenvector to the eigenvalue e iπ2φ of K. The l-th ancilla applies K for 2 l−1 times on the system. After the in total 2 t − 1 controlled applications of K, the control systems are transformed by an inverse QFT F −1 n , which yields a binary encoding of the number φ.

Appendix A. Preliminaries: Grover search, quantum phase estimation
We commence with a brief introduction to Grover search and quantum phase estimation. The optimal number of iterations of the oracle and the diffusion before the concluding measurement is given by k * ≈ N M π 4 [28]. It is, hence, crucial to have knowledge about M, as stopping at the wrong time may result in a random output rather than a valid solution.
Quantum phase estimation. Suppose |u is the eigenvector of a unitary operator K, with eigenvalue e iπ2φ . To estimate φ up to some error , we initiate t qubits in the state |0 . The number t is determined by the precision of the estimation of φ, as well as the probability of success of the estimation. It can be shown [54] that for a PE with probability of error , in order to obtain the phase N i /(2n) accurate to m bits, we need a precision (and number of ancillary control systems) The PE algorithm succeeds in two steps and is depicted in figure A1.
Step 1. Apply a Hadamard gate to each of the t qubits in the first register, yielding the states |+ ⊗t . Then, the first ancilla acts as a control for 2 0 = 1 use of the unitary operator K, which is applied to |u . The second ancilla acts as a control for 2 1 = 2 uses of K, and so on. In general, the l-th ancilla acts as a coherent control for 2 l−1 uses of K. In the end, the collective output state |T of the t ancillas yields [54] Step 2. Apply the inverse QFT F −1 n in order to convert φ of the state |T into a t-qubit register. More precisely, the inverse QFT acts as where the phase to be estimated is assumed to have the binary form φ = 0.φ 1 φ 2 . . . , i.e. φ j is the j-th decimal bit of the number φ. It is important to remark that with the PE algorithm being unitary, it acts linearly on superpositions i |u i of eigenvectors |u i [54].

Appendix B. Binary search for globally optimal solution
We now present in detail the binary search algorithm to train a QFBNN without knowing the number of good parameter configurations. For a fixed precision t of the PE subroutine, the number of solutions M depends on the training set and is generally unknown. Indeed, there might be multiple marked weight strings, or even none which is above the threshold N τ . To resolve this problem, we propose a binary search on the threshold intervals as a subroutine to ensure, up to the precision of the search and up to imprecisions in the PE algorithm, that the optimal solution is marked and amplified. For a solution w * to the training the possible range of the number of matches N * in the accumulative marking scheme is initially an integer from the interval [0, n]. At every step of the search the possible range is cut evenly into two parts. Thus the size of the possible range goes as n2 −i for the i-th step of the binary search. The search stops when the size of the possible range reaches a small enough number which quantifies our desired precision, and which we call δ < n, i.e. the search runs until i is large enough so that n2 −i δ. To determine whether the updated possible range is the lower or upper half of the cut, we mark the weight strings in the upper half by −1 using the procedures described above. If after the search the measured weight string is in the upper half this becomes the new possible range. Else if it is not in the upper half the possible range becomes the lower half. Thus eventually this binary search identifies with a high probability an N * which is within δ of the global optimum. This takes log n δ steps, which scales efficiently in the number of n.
• For i log(n/δ) , loop the following steps: * Choose an integer s ∈ [1, m − 1] uniformly at random. * With current N τ , apply s iterations of the full training cycle from figure 4. Measure the weight state and obtain an candidate solution w * . * Run the classical BNN on w * only to obtain N * , the number of times w * gave correct outputs. * If N * N τ , this means the observed w * is indeed a solution for current N τ , and there might be better solutions for larger N τ , so update N τ as: N τ → N τ + ΔN i /2, in which ΔN i = n2 −i , and go to next step to update i. else (i.e. N * < N τ ), set m to min(μm, √ N), and go back to step (i). Time-out rule for this inner loop on m: when the loop of steps 1-4 continues to the case that the full training cycle has been totally executed O( √ N) iterations, but N * < N τ still, this means there is no solution for the current N τ , then update N τ as: N τ → N τ − ΔN i /2 in which ΔN i = n2 −i , and go to next step to update i. * Update i as: i → i + 1.
• Output the last value of w * that has N * N τ and end training. This w * is the optimal weights the algorithm found. The above search homes in on a solution that is globally optimal up to the δ precision.

Appendix C. Equality for the number of qubits needed in the QBNN
To calculate the total number of qubits needed by the algorithm which we derived in equation (13), we made use of the following lemma: where Q input was the amount of input qubits, Q ancilla the amount of ancilla qubits, Q weight the amount of weight qubits and Q output the number of output qubits. Here, we derive this equality and illustrate it by two examples. Figure C1. A single neuron with two inputs. The inputs are |a 1 , |a 2 , the two weights are |w 1 , |w 2 and there is an ancilla input in |0 . Figure C2. The case of 3 neurons in 2 layers. U 1 , U 2 and U 3 are the neurons. All neurons have two inputs and two associated weights. There are three ancilla bits, initialised as |0 1 , |0 2 , |0 3 .
Let us begin with the elementary case of a single neuron, depicted in figure C1 below. It is straightforward to count and verify the equality in equation (C.1), by noting that Q ancilla = Q output = 1 and that every input comes on an edge with an assigned weight, i.e. Q input = Q weight . Thus equality (C.1) holds for the case of a single neuron.
Next, we consider a more involved network with three neurons in two layers, see figure C2. Direct counting verifies that equation (C.1) also holds here. The reason in general is that in the first layer the input qubits and weight qubits always come in pairs, i.e.
where the superscript denotes that the count is for the weights in the first layer.
In the first layer and hidden layers, the ancilla qubits from the previous layer carry the output of the neuron computation, together with fan-out ancilla qubits, serve as inputs for the next layer, and each of them is paired with a weight qubit in the hidden layers.
Let the number of weight qubits Q weight be broken into two parts, those associated with the input layer and the rest, associated with the hidden layers: For example, in figure C2, Q weight = 6, Q (in) weight = 4 and Q (hidden) weight = 2. Similarly let the number of ancilla qubits Q ancilla be broken into three parts, those associated with the input layer, those associated with the hidden layers, and those associated with the output layer.
For example, in figure C2, Q ancilla = 3, Q (in) ancilla = 2, Q (hidden) ancilla = 0 and Q (out) ancilla = 1. In general some of Q (hidden) ancilla may be associated with fan-out gates, copying classical information to send to several neurons in later layers.
For the output layer each output qubit corresponds to an ancilla qubit in this layer: All ancillas except the output ancilla are paired up with weight qubits of the hidden layers, as they turn into inputs to hidden layers and each input is paired with a weight. Accordingly, Altogether, we see that the equality follows, for general QBNN according to our design. Equation (C.7) is the lemma we wished to prove.

Appendix D. Standard classical training of neural networks: gradient descent
A widely used approach in training NNs is to approach the minimum cost through the method of gradient descent. Here, each weight w is updated between the t-th iteration of the NN to the next one as Above, η is a positive number, usually referred to as the learning rate. The direct evaluation of the term ∂C/∂w is computationally expensive: approximating the derivatives numerically requires to loop over all network inputs. In addition, even if the values of the network parameters and inputs is restricted (e.g. for BNNs), commonly the weights are kept real-valued and only discretised after the evaluation of the partial derivatives [15,77]. Hence, improvements of memory consumption and execution speed are missing in the training stage of BNNs.
Moreover, gradient descent based training algorithms have the disadvantage of converging towards local minima in the cost landscape, rather than global ones. It was shown that locally optimal solutions to the training algorithm can be strongly sub-optimal [21]. To find a global minimum is in general NP-hard, consistent with the exponential-in the number of neurons-number of weight configurations [18,78].
There are several methods to simplify the evaluations in gradient descent. The most prominent way is back-propagation. Here, the evaluation of the gradients is approximated: while the value C({a , a * }) depends on the full history of weights between the layers, back-propagation computes gradients only relying on the latest input of the neuron rather than the full processing history of the network. Such an evaluation gives a chain of layer-wise derivatives, which, multiplied with another, often leads to good approximations of C({a , a * }). Yet, back-propagation is still subject to slow training times, such that for deep networks training is usually done on supercomputers before distributing the NN to the end user. Further, back-propagation suffers from gradient-related problems such as the ones mentioned above. This can be seen easily. As the derivatives are approximated through a product of several local derivatives, the direction of convergence remains the same, leaving the issue of finding local minima open. Moreover, if exploding and vanishing gradients occur then also the evaluation of local gradients suffer from the problems.

Appendix E. Quantum embedding of classical neural networks
A classical neuron implements a function, which takes multiple inputs and produces a single output. Hence, this many-to-one mapping makes the processing irreversible, which is a priori not compatible with reversible dynamics of quantum computing. However, this mapping can be generalised to a quantum neuron in the same way that quantum computing generalises classical computing, along the paradigm of [35]. To define the quantum generalisation, first, classical gates are extended to reversible gates, which are a subset of quantum unitaries. Afterwards, general unitaries are allowed for. In this section we illustrate the quantum extension of CBNs with a concrete example.
Finding a reversible extension. We consider the most common instance of binary neuron processing, where each CBN j implements the Heaviside activation function H(x) with activation threshold α j . More precisely, for a vector x, for a threshold α j the activation succeeds as The corresponding CBN is depicted below in figure E1.
For the instance depicted in the figure, the neuron takes the input a = (a 1 , a 2 ) with the two components a 1 and a 2 , and has weights w 1 , w 2 ∈ {1, −1} on the edges. An XNOR gate multiplies the inputs with the Figure E1. Functioning of a CBN. An XNOR operation multiplies the inputs a 1 , a 2 with the corresponding weights w 1 , w 2 . The resulting values s 1 = w 1 a 1 , s 2 = w 2 a 2 are forwarded to the bitcount, which outputs s = s 1 + s 2 . Finally, the sign function f(s) determines the activation of the neuron.  weights, yielding s i = w i a i , i = 1, 2. Afterwards, the operation bitcount sums the results to the value s = w 1 a 1 + w 2 a 2 . Finally, the sign function f(s) determines the activation value a of the neuron. Now, we introduce ancillary inputs and outputs for each operation in the CBN, such that the number of inputs coincides with the number of outputs. These ancillas carry the output values of the operations, while the inputs are preserved. The result is a reversible embedding U XNOR of XNOR and U bit+ of bitcount, see figure E2. Here, U bit+ includes both the bitcount operation and the activation through the activation function f.
Unitary embedding of reversible binary neurons. The reversible logical gates in figure E2, with inputs x i and outputs y i can be implemented as quantum unitaries U = i |y i x i |. This defines a quantum binary neuron (QBN). In order to find a quantum circuit implementation of a QBN, we decompose the operations U XNOR and U bit+ into elementary quantum gates.
The multiplication of the input and weight can be achieved by a controlled NOT (CNOT) gate on each input |a i , controlled by the state of the corresponding weight |w i (see figure E3). It can be easily seen from the truth table of the XNOR and CNOT gate that the operations indeed coincide. The gate U bit+ is realized by the Toffoli gate [79] on the ancilla, where the states |s 1 and |s 2 act as control systems.
Quantum binary neurons can form networks. So far, we proposed a unitary extension of a classical neuron. A NN consists of multiple interlinked neurons. Hence, the quantumly extended NN acts, overall, unitarily on inputs as well. Generalising figure E3, a quantum feed-forward binary neural network (QFBNN) takes a set of N weights {w 1 , w 2 , . . . , w N } as inputs, together with a set of data a = (a 1 , a 2 , . . . , a p ). The set {w 1 , w 2 , . . . , w N } denotes all weights in the networks-including input layer, hidden layers and output layer. The set of data a = (a 1 , a 2 , . . . , a p ) assumes that there are p neurons in the input layer (with assigned weights w 1 , w 2 , . . . , w p ). Each of the r neurons in the last layer of the QFBNN outputs one value, leading to the overall outputs a = (a 1 , a 2 , . . . a r ). Then, by depicting the action of the QBFNN as a single unitary U, the fully quantum training method introduced in the main text generalises straightforwardly from a single neuron to general QFBNNs. The only change that requires some more explanation is the comparison oracle Λ. The direct generalisation would be to define the action as Λ(|a , |a * ) = ⎧ ⎨ ⎩ e iπ/n |a |a * , if a = a * |a |a * , else.

(E.2)
Other generalisations that only require a certain number r ∈ [0, p] of outputs to coincide are possible as well.
When generalising from a single neuron to a FNN, the output values of the neurons are to be copied and distributed to each neuron in the next layer. While classically this is trivial to do, quantum mechanics prohibits exact copying of data-a phenomenon called no-cloning theorem [80]. For classical BNNs, this issue is resolved by applying CNOT gates to the output of each neuron, one for each copy operation. The CNOT acts on an ancillary qubit initialized in |0 , controlled by the output qubit |a of the neuron. Hence, for bit valued a the CNOT gate acts as In total from one layer with 1 neurons to the next with 2 neurons, we need 1 × 2 ancillas. When the overall state has coherence, the CNOT gates in general do not produce perfect copies but rather entanglement between the states and the ancillas carrying the output values. We call this imperfect spread of information fan-out operation [35].

Appendix F. Alternative approach to quantum training algorithms: counter registers
Quantum training via PE includes an overhead of neuron calls that scales exponentially in the precision t of the estimation. There is an alternative approach which is sensitive to the size of the training set, but avoids that overhead: rather than accumulating phases, the number of matches between the QFBNN outputs and the desired output can be encoded directly into a multi-qubit register. The algorithm is similar to the training cycle presented in the results, up to some changes: in figure 2 the oracle Λ adding incremental phases to good weight strings is replaced by a new oracle O count , acting on a log(n + 1) -qubit register as The number h of matches is then encoded in binary into the qubits. Initially the counter is set to |0 and increased coherently during the marking. This way, after the full set of training data, the output weight state reads where N i is the amount of matches for the weight string w i . Next, we again define a threshold count N τ and a new oracle O ±1 . If the condition N i N τ holds, the oracle marks the corresponding weight string w i by adding a factor of −1. Otherwise, the string is left invariant. The resulting state reads In order to decouple the weight states from the counter register, the counter needs to be reset, in analogy with the uncomputation of the PE in the main text. To maintain coherence, such a reset needs to be unitary. This can be achieved by uncomputing the n rounds of comparing the QBN outputs with the desired outputs.
The weight state then becomes 4 ) and the same amplification procedure as for the training algorithm with PE can be applied. The choice of N τ can found by adapting the binary search presented in appendix B. The alternative training algorithm with counter registers has the advantage of avoiding the exponential overhead in the number of neuron calls in the PE. However, there is one fundamental drawback: for the  training with PE we extracted the fraction N i /N, N i ∈ [0, n], whereas here we would encode the absolute number of counts N i . The number of training data pairs n is typically large. Hence, while small numbers t of ancilla qubits for the PE subroutine are sufficient, we need a great amount of qubits to encode the number of matches. Such an overhead could in practice be the bottleneck for implementations on real quantum computers. Figure G3. Circuit for oracle Λ. The first qubit is an output to be compared with the second qubit which is the desired output. An ancilla system-the third qubit-is also used. In circuit (i) controlled Xs (CNOTs) are applied followed by a controlled phase gate and uncomputation of the CNOTs. Circuit (ii) is an alternative equivalent representation of circuit (i). Figure G4. Training data used for the example of a single neuron with 3 weights and 3 inputs. Each training tuple consists of three inputs a 1 , a 2 , a 3 and one desired output a * . The aim of the training is to find weight configurations that can generate the desired output of the corresponding inputs. We consider two independent tasks (enumerated by 1 and 2) and study the training according to the data. Figure G5. Task 1: the evolution of the probabilities to measure the weight strings. For the first task there exists only one optimal weight configuration, namely (000). The vertical bars present the probabilities of the weight configurations, for each iteration of the full training cycle. The probability of the optimal weight is amplified and reaches its maximum after two iterations, in accordance to the optimal stopping time k * = √ 8π/4 ≈ 2.22 of the quantum search. * Apply control phase gate |1 1| ⊗ exp(iφ)I + |0 0| ⊗ I. This adds phase exp(iφ) if ancilla is in state |1 . * Uncompute CNOTs, i.e. redo (a) and (b). Now the ancilla is certainly in state |1 again. Next the action of the QBN is uncomputed (the circuit in the second dashed blue box). Finally, the full phase addition is repeated for n times, where n denotes again the amount of training input data. For readability, we only depict two data points {(a 1 , a 2 , a 3 , a * )} = {(0, 0, 0, 0), (1, 0, 0, 0)} of the training set in the figure and used ellipsis to indicate that the loop indeed succeeds over all remaining training pairs. The two independent sets of full training data considered for this architecture are specified in figure G4.
During the loops of the full training (see figure 3 in the main text), the probabilities of the weight strings change. Here we present the probability evolution of the weight strings, in the quantum training with the highest N τ for which there exists at least one optimal weight string: for task 1, figure G5 shows the probabilities of the weight strings evolving over the training cycles. Correspondingly, figure G6 illustrates the evolution for task 2.
G.2. Three-layer five-neuron network with 6 weights, 2 inputs and 1 output Next, we study the circuit implementation of the phase accumulation subroutine for a network with five neurons and layer configuration 2-2-1. This is depicted in figures G7 and G8. In total, there are two inputs Figure G6. Task 1: the evolution of the probabilities to measure the weight strings. For the second task there are two optimal weight configurations: (000) and (101). The probabilities of the optimal weights are amplified and reach their maximum after the first iteration, in accordance to the optimal stopping time k * = 8/2π/4 ≈ 1.57. Figure G7. Five-neuron network with layer configuration 2-2-1. The network takes two inputs and has in total 6 weights. Figure G8. Circuit implementation for a network with layer configuration 2-2-1, with 2 inputs and 6 weights. The qubits indexed 1-4 are the weights for the first layer and qubits with index 5 and 6 are the weights for the second layer. Qubits with index 7-10 are 4 training inputs duplicated from 2 inputs in the first layer, for the fan-out to the second layer. To save space in the diagram, the fan-outs in the first layer take place to the left, outside the diagram. Thus the inputs are already copied when they enter the depicted circuit. Qubits 11 and 12 are ancillas storing the outputs within the QBNN. Qubit 13 encodes the desired outcome. The dashed blue box depicts the action of the QBNN. After applying the oracle Λ, the action of the QBNN is uncomputed. The circuit depicts one round of marking for one data point in the training set, for the phase accumulation process. |a 1 , |a 2 and six weight states |w 1 , |w 2 , . . . , |w 6 , leading to 2 6 = 64 possible weight strings |w 1 w 2 . . . w 6 . The two dashed blue boxes show the circuit for the computation and uncomputation of the NN.
For the QBNN in this example, we use the training set {(a 1 , a 2 , a * )} = {(0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 1)}). In figure G9, we present the probability evolution of the weight strings, in the quantum training with the highest N τ for which there exists at least one optimal weight string.  10,11) in the second layer. The vertical bars presents the probabilities of the weight configurations, for each iteration during the training. The probabilities of the optimal weights are amplified over the iterations and reaches its largest amplification after two iterations. This result agrees with the expected stopping point given by N/Mπ/4 to achieve the largest amplification, which is 64/7π/4 ≈ 2.37 for this instance.

G.3. Three-layer six-neuron network with 8 weights, 3 inputs and 1 output
We conclude the set of examples with a six-neuron network in configuration 3-2-1, with 3 inputs and 8 weights (see figure G10). In total, this network uses 23 qubits. The circuit implementation of a single cycle of phase accumulation is shown in figure G11. We trained the QBNN in this example with the same two tasks presented in figure G4 above. For task 2 we present the probability evolution of the weight strings in figure G12. The threshold N τ is set such that there exists a single solution of weights to the training algorithm.
For this example there are 8 optimal weight configurations among the total of 2 8 = 256 configurations. The horizontal plane presents all the 256 weight configurations. Each point in the grid represents one weight configuration, with two axes indicating the weights in the two layers respectively. The vertical bars present the probabilities of all the weight configurations, for each iteration during the training. The probability of the optimal weights are amplified over the iterations and reaches its maximum at iteration 4, in agreement with the expected stopping point given by 256/8π ≈ 4.44. Figure G13 shows the probability of obtaining an optimal weight string, as a function of the training cycle for both tasks. Recall the comparison between our quantum training and classical training stated in the performance analysis in the main text. Classical global searches call the comparing oracle N C cl = n × N, whereas our quantum training algorithm only needs approximately N qm C = N G × 2n × (2n − 1) calls, with Figure G10. NN with 6 neurons in layer configuration 3-2-1. Figure G11. Circuit implementation for a network with layer configuration 3-2-1. Qubits 1-6 are the weights for the first layer and qubits 7 and 8 are the weights for the second layer. Qubits 9-14 are the 6 training inputs, duplicated from 3 inputs in the first layer, for the fan-out to the second layer. (To save space in the diagram, the fan-outs in the first layer take place to the left, outside the diagram. Thus the inputs are already copied when they enter the depicted circuit.) Qubits with index 15 and 16 are ancillas storing the outputs of the neurons within the QBNN and qubit 17 encodes the output of the QBNN. Qubit 18 encodes the desired output. The dashed blue box contains the action of the QBNN, consisting of weighing the inputs with the weights, adding up the weighted input together with the subsequent activation function (the gate implementation of the addition and activation is included in the dashed red box). After applying the oracle Λ, the action of the QBNN is uncomputed (the circuit in the second dashed blue box). The entire circuit depicts one round of marking for one data point in the training set, for the phase accumulation process. N G = log(n/δ) N and δ being the precision of the binary search to single out optimal weight configurations (see the appendix).
We begin with task 1. Inserting the value of n = 8 and N = 256 for this example, one can see that even for this small network there is a quantum advantage: while N C cl = 8 × 256 we find for our quantum training algorithm N qm C = 8 × 180. Note that for large networks the quadratic advantage in N will be much more significant, as discussed in the main text.

Appendix H. Correlation between n and N for a two hidden layer feed-forward network
In [61] the author proved that for a two-hidden-layer feed-FNN with m output neurons, the number of hidden nodes that are enough to learn N samples with negligibly small error is given by: 2 (m + 2)N.
(H.1) Figure G12. Probability evolution of the weight strings in the six-neuron QBNN with configuration 3-2-1, task 2. Initially all weights have the same amplitude but the quantum search technique increases the amplitudes of the best weights until the 4th iteration which is the optimal moment to measure the system. (The quantum search technique we employ does not require knowing at which moment to measure).
Specifically, the sufficient number of hidden nodes in the first layer is suggested to be In this architecture, the number of weights between the two hidden layers is the product of the number of nodes in the two hidden layers:  Figure G13. Probability of obtaining an optimal set of weights, for a six-neuron QBNN with layer configuration 3-2-1. The graphs show the optimal stopping time of the training, for both task 1 and 2. We see that the probability of success is close to unity after 6 iterations for task 1 and 4 iterations for task 2.
The total number of weights N total in this optimal architecture is larger than L 1 × L 2 (since we also have weights between input layer and first hidden layer, weights between second hidden layer and output layer). Thus, we find N total > L 1 × L 2 > mN. (H.7) Using the notation in our work, i.e. N total → N, N → n we have N > m × n, ( H . 8 ) where N is the number of weight qubits and n is the number of training samples, m is the number of neurons in the output layer. Since m is an integer larger than 1, so N > mn > n. This gives us the correlation n < N used in the main text.