The speed of quantum and classical learning for performing the kth root of NOT

We consider quantum learning machines—quantum computers that modify themselves in order to improve their performance in some way—that are trained to perform certain classical task, i.e. to execute a function that takes classical bits as input and returns classical bits as output. This allows a fair comparison between learning efficiency of quantum and classical learning machines in terms of the number of iterations required for completion of learning. We find an explicit example of the task for which numerical simulations show that quantum learning is faster than its classical counterpart. The task is extraction of the kth root of NOT (NOT = logical negation), with k=2m and . The reason for this speed-up is that the classical machine requires memory of size log k=m to accomplish the learning, while the memory of a single qubit is sufficient for the quantum machine for any k.

Learning can be defined as the changes in a system that result in an improved performance over time on tasks that are similar to those performed in the system's previous history.Although learning is often thought of as a property associated with living things, machines or computers are also able to modify their own algorithms as a result of training experiences.This is the main subject of the broad field of "machine learning".Recent progress in quantum communication and quantum computation [1] development of novel and efficient ways to process information on the basis of laws of quantum theory -provides motivations to generalize the theory of machine learning into the quantum domain [2].For example, quantum learning algorithms have been developed for extracting information from a "black-box" oracle for an unknown Boolean function [3,4].
The main ingredient of the quantum machine is a feed-back system that is capable of modifying its initial quantum algorithm in response to interaction with a "teacher" such that it yields better approximations to the intended quantum algorithm.In the literature there have been intensive and extensive studies by employing feed-back systems.They include quantum neural networks [5], estimation of quantum states [7], and automatic engineering of quantum states of molecules or light with a genetic algorithm [8,9,10].Quantum neural networks deal with many-body quantum systems and refer to the class of neural network models which explicitly use concepts from quantum computing to simulate biological neural networks [6].Standard state-engineering schemes optimize unitary transformations to produce a given target quantum state.The present approach of quantum automatic control contrasts with these methods.Instead of quantum state it optimizes quantum operations (e.g.unitary transformations) to perform a given quantum information task.It is also different than the problems studied in Ref. [3,4], where one does not learn a task but rather a specific property of a black-box oracle.
An interesting question arises in this context: (1) Can a quantum machine learn to perform a given quantum algorithm?This question has been answered affirmative for special tasks, such as quantum pattern recognition [11], matching of unknown quantum states [12], and for learning quantum computational algorithms such as the Deutch algorithm [13], the Grover search algorithm and the discrete Fourier transform [14].
Another interesting question is: (2) Can one have quantum improvements in the speed of learning in a sense that a quantum machine requires fewer steps than the best classical machine to learn some classical task?By "classical task" we mean an operation or a function which has classical input and classical output.Quantum machines such as quantum state discriminator, universal quantum cloner or programmable quantum processor [15] do not fall into this category.Quantum computational algorithms do perform classical tasks, but no investigation has been undertaken to compare speed of learning of these algorithms with that of their classical counterparts.To our knowledge the question ( 2) is still open thus far.In this letter we will give evidence for the first explicit classical computational task that quantum machines can learn faster than their classical counterparts.In both cases certain set of independent parameters must be optimized to learn the task.We will show that the fraction of the space of parameters, which correspond to (approximate) successful completion of the task, is exponentially smaller for the classical machine than for the quantum one.This analytical results supports our numerical simulation showing that quantum machine learns faster than the classical one.
We first define a family of problems of our interest: let m-th member (m ∈ N) of this family be the k-th root of NOT with k = 2 m , where the roots of NOT are defined as follows: Definition 1.The operation is k-th root of NOT if, when applied subsequently nk times on the Boolean input of 0 or 1, it returns the input for even n's and its negation for odd n's.We denote this operation with The learning procedure consists in a sequence of taking the inputs, performing transformations on them, returning the outputs, estimating the figure of merit between the outputs obtained and the expected ones and correspondingly making adjustments to the transformations.For the task of extracting the k-th root of NOT (see text for definition) the dimension of the space of parameters for a classical machine is log 2k larger than that for a quantum machine.
we want to discard the cases for which, for example, the operation returns k-th root of NOT when performed once, but does not return identity when performed twice.) The machine that performs this operation takes one input bit and returns one output bit.This bit will be called "target bit".In general, however, the machine could use many more auxiliary bits that might help the performance.Specifically, in the classical case the input i and output j are vectors with binary components.Any operation is defined by a probability distribution p( i, j) which gives the probability that the machine will generate the output j from the input i.Thus, one has j p( i, j) = 1.The readout of the target bit is a map: j → {0, 1}.Without loss of generality we assume that the target bit is the first component of the input and the output vector.The remaining components are auxiliary bits which play the role of the machine's memory.
In quantum case no auxiliary (qu)bits are necessary as only one qubit is enough to implement any k √ NOT.The input of the machine is a single qubit and the machine itself is a unitary transformation.The input state will be either |0 or |1 corresponding to the Boolean values of classical bits "0" and "1", respectively.The readout procedure is the measurement in the computational basis {|0 , |1 } and we consider the state that the qubit is projected to as the output of the machine.
In both cases the term learning is used for the process of approximating the function k √ NOT to which we will refer as the target function.We will consider that learning has been accomplished when the learning machine returns with high probability correct outputs for both inputs.Then a learning process is reduced to approximating the target function in a sequence of taking the inputs, performing transformations on the inputs, returning the outputs, estimating the fidelity between the actual outputs and the ones that the target function would have produced and correspondingly of making adjustments to the transformations.The schematic diagrams depicting both types of machines are shown in the Fig. 1.Now we will describe the learning in both cases in more detail.
Quantum learning: In every learning trial the following steps are performed: 1. Select a new unitary operator U using a Gaussian random walk (The first U is initialized randomly using the Haar measure).
2. Run the unitary U k on an input qubit state chosen to be |0 or |1 with equal probability.Measure the output qubit in the computation basis.Repeat this on M input states and store the results (classical bits).The number M defines the size of teachers (classical) memory of the quantum machine.
3. Estimate how close is the actual operation to the target one.To achieve this count the number of times the operation is successful in approximating the target function (i.e. it produces |1 , when the input was in |0 , and it produces |0 , when it was in |1 ).The number of successes is denoted by new s and old s in the executed and the previous trial, respectively.
4. If new s ≥ old s , go to 1 with the current unitary operator as the center of the Gaussian; Otherwise, go to 1 with the unitary operator chosen in the previous trail as the center of the Gaussian.Any single qubit rotation can be parameterized by Euler's angles as follows: Since the global phase α is irrelevant for the present application, we are left with the parameters δ In every new learning trial these parameters will be selected independently with a normal probability distribution centered around the values from the previous run and the widths of the Gaussians are taken as free parameters of the simulation.
There are two free parameters of the learning procedure: σ γ and σ β (σ δ = σ β ).In all simulations these parameters are optimized to minimize the number of learning steps.Note that if quantum machine performs the task for n = 1 perfectly, then it will also perform the task perfectly for all n.This is why our quantum machine is trained only to learn the task for n = 1.Nevertheless, after the learning has been completed one should compare how close the performance of the learning machine is to this of the target operation for all n.We define a set of figures of merit {P n } ∞ n=1 as follows: where b = 0 if n is even, and b = 1 if n is odd, and ⊕ denotes sum modulo 2. Note that each subsequent P n is more demanding in the sense that more constraints from the definition of the k √ NOT are being taken into account.This is reflected by the resultsprocedure, which are presented in Fig. 2.
The memory size of the teacher M is another free parameter of the quantum machine.The learning ability has a very strong dependence on M as can be seen from Fig. 3.For lower values of M M the learning is faster at the beginning (up to about 4x10 4 trials), before it slows down and saturates.At the saturation the size of the memory does not allow distinguishing between sufficiently "good" operations all for which new s = M.For higher M values the learning is slower, but it reaches higher fidelities.To combine the high speed with the high fidelity of learning we apply the learning procedure with variable M: The machine starts with M = 1 and whenever it obtains the number of successes new s = M it increments M by one.With this kind of algorithm the learning has one less free parameter.All our simulations were done for variable M, unless stated otherwise.
Next we describe the classical learning procedure.

Classical learning:
The classical learning is an iterative process of finding the optimal probability distribution p( i, j) for the classical machine to extract the k √ NOT.The speed of learning depends on the number N 2 − N of independent parameters (independent probabilities p( i, j)), where N = 2 dim(i) and dim(i) is the dimension of the input i and the output vector j.We will refer to N as the memory size of the classical learning machines because it is equal to the total number of distinguishable internal states of the machine.To minimize the number of learning trials required to complete the learning and thus to maximize the Proof.Each probabilistic classical machine can be considered as a convex combination of deterministic ones.If it performs some task perfectly, then there must also be deterministic machine that does the same.This means that we can restrict ourselves in this proof only to deterministic machines without any loss of generality.Any (deterministic, classical) machine can be represented as an oriented graph, with vertices corresponding to the internal states.Edge pointing from vertex i to j will mean that the operation on input i generates the output j.Any (finite) machine must have at least one loop and, if the machine is run subsequently a large number of times, it will eventually end up in that loop.Since the definition of the task involves arbitrary large n's we may start our analysis from n large enough such that the machine is already in the loop.Since we will prove the lemma by giving constraint from below on the size of the loop, we may assume that the whole graph is a one loop and each vertex is a part of it.
Let the length of the loop be N.Let g be the greatest common divisor g = GCD(k, N).Then there exist numbers x and y such that If the machine is initially in a vertex that corresponds to input "1" of the target bit and we apply the operation Nk times we will always end up in the same vertex "1", since Nk = 0 mod N. Since, however, the task is defined such that for N odd the ending vertex should correspond to "0" value for the target bit, one concludes that N must be even.Therefore, we can write N as N = 2 K c, where c is odd and K ≥ 1.We also have N x = 0 mod N, but since N x = gyx = ky, then ky = 0 mod N. According to our definition of k √ NOT this implies that y is even and, since GCD(y, x) = 1, x is odd.Also Since y is even and both x and c odd, then K ≥ m + 1 must hold.We conclude with The Lemma implies that if the machine is to perform k √ NOT perfectly it needs to have log k = m auxiliary bits in addition to the target bit.It is easy to check that this is not only necessary but also a sufficient condition.One just needs to design machine that is a loop of length 2k where the vertices corresponding to initial target input bits 0 and 1 are at a distance k from each other.The number of functions with this property divided by the total number of functions f : {0, 1} 2k → {0, 1} 2k gives the fraction of the target functions: The target functions thus constitute an exponential small fraction of all functions.Next, we will consider probabilistic classical machines which in order to approximate the target functions with high probability need to be sufficiently "close" (e.g. in the sense of Kullback-Leibler divergence) in the probability space.In such a way both the quantum and classical machines "search" in a continuous space of parameters, however, the relative fraction of this space that is close to the target functions is obviously much larger for the quantum case.
In the case of quantum machine any root of NOT can be performed with only one qubit.The operation that performs , where σ x is spin matrix along direction x.Therefore, the memory requirements for our family of problems grows as log k in the classical case, while remaining constant in the quantum one.Next, we introduce the classical learning procedure.We assume that the classical machine is initially in a "random" state for which p( i, j) = 1 2k .The learning process consist of the following steps: 1. Set initially the internal state of the machine such that its first bit (target bit) is in 0 or 1 with equal probability.All auxiliary bits are in 0.
2. Apply the operation k times and after each of them read out the output: j r , with r ∈ {1, ..., k}.We observe a sequence i ≡ j 0 → j 1 → j 2 → ... → j k of machine' states.If the target bit of the final state j k is inverse of the target bit of initial state i, move to step 3. Otherwise move to 4.
3. Increase every probability p( j r−1 , j r ) that led to success by adding a factor 1 ≥ K s ≥ 0. Renormalize the probability distribution such that j p( i, j) = 1 and go back to step 2.
4. Decrease every probability p( j r−1 , j r ) that led to a failure by subtracting a factor 1 ≥ K f ≥ 0 (if then the probability is negative, put it to be 0).Renormalize the probability distribution and go back to step 1.
Note that repeating the steps 2. and 3. the classical machine gradually learns to perform the task for all n.The learning has two free parameters K s and K f , exactly like the quantum learning (with a variable teachers memory size M).To estimate how close is machine's functioning to the one of the target machine we use the set of figures of merit for all n: {P n } ∞ n=1 , which are similar to those of Eq. (2).For example, P 2 ≡ P k (0, 1) + P k (1, 0) + P 2k (0, 0) + P 2k (1,1), where P k (1, 0) is the probability that the target bit has been changed from 1 to 0 after applying the transformation k times and other probabilities are similarly defined.
We have performed computer simulations of the both quantum and classical learning process.The results are presented in Fig. 4. We see that the learning in the quantum case is much faster for k > 2. This speed-up can be understood if one realizes that for the present problem the process of learning is an optimization of a square matrix: unitary transformation U in the quantum case, and a matrix with entries p( i, j) in the classical one.While the size of U remains 2 (with complex entries), the size of the matrix with entries p( i, j) grows linearly with k.It is clear that optimization of significantly larger matrices requires more iterative steps and thus leads to slower learning.
The classical learning algorithm given is not the most general and might not be optimal.The general framework for finding optimal learning procedures is still not fully understood.We have chosen the quantum and classical learning algorithms such that the comparison between them is most evident.The two tasks, i.e. finding a unitary operator for the k-th root of NOT, and finding a classical probability distribution that generates the k-th root of NOT, though are different from the physical point of view, both require optimization of matrix elements.Since for a given task, the classical machines require a significantly larger number of independent parameters (of which only a small fraction leads to the desired matrix) to be optimized, it is natural to assume that they also require a larger number of learning steps to accomplish learning, regardless of the explicit learning procedure employed.This is exactly what our numerical simulations show.
Quantum information processing has been shown to allow a speed-up over the best possible classical algorithms in computation and has advantages over its classical counterpart in communication tasks, such as secure transmission of information or communication complexity.In this paper we extend the list with a novel task from the field of machine learning: learning to perform the k-th root of NOT.

FIG. 1 :
FIG.1: Diagram of classical and quantum learning machines.The learning procedure consists in a sequence of taking the inputs, performing transformations on them, returning the outputs, estimating the figure of merit between the outputs obtained and the expected ones and correspondingly making adjustments to the transformations.For the task of extracting the k-th root of NOT (see text for definition) the dimension of the space of parameters for a classical machine is log 2k larger than that for a quantum machine.

10 FIG. 2 :
FIG.2:Quantum learning for performing the 4-th root of NOT.Different figures of merit P s (s = 1, 5, 10) as a function of the number of learning trials (x 10 3 ).The size of teachers memory M is varied to achieve the maximal value of the figures of merits for a given number of trials.The free parameters have the values σ γ = π 4 and σ α = σ β = π 8 .

FIG. 3 :
FIG. 3: Quantum learning for performing the 4-th root of NOT. Figure of merit P 10 as a function of the number of learning trials (x 10 4 ) for different sizes of teachers memory M (blue = 300, green =100, red=50).The free parameters have the values σ γ = π 4 and σ α = σ β = π 8

FIG. 4 :
FIG. 4: The figure of merit (P 10 ) of classical (CL) and quantum (QL) learning for performing different k-th (k = 2, 4, 8) roots of NOT as a function of the number of learning trials (x10 3 ).The values of free parameters are chosen to maximize the figure of merit.Already for k = 4 quantum learning is faster than the classical one.For the 8-th root of NOT, the figure of merit of classical learning is as for a random choice (= 0.5) at the given time scale.The free parameters have the values σ γ = π 4 and σ α = σ β = π 8 (for all roots) for the quantum case and the values K s = K f = 0.25 (2nd root), K s = K f = 0.75 (4th root) and K s = 0.75 K f = 0.25 (8th root) for the classical one.