Experimental semi-autonomous eigensolver using reinforcement learning

The characterization of observables, expressed via Hermitian operators, is a crucial task in quantum mechanics. For this reason, an eigensolver is a fundamental algorithm for any quantum technology. In this work, we implement a semi-autonomous algorithm to obtain an approximation of the eigenvectors of an arbitrary Hermitian operator using the IBM quantum computer. To this end, we only use single-shot measurements and pseudo-random changes handled by a feedback loop, reducing the number of measures in the system. Due to the classical feedback loop, this algorithm can be cast into the reinforcement learning paradigm. Using this algorithm, for a single-qubit observable, we obtain both eigenvectors with fidelities over 0.97 with around 200 single-shot measurements. For two-qubits observables, we get fidelities over 0.91 with around 1500 single-shot measurements for the four eigenvectors, which is a comparatively low resource demand, suitable for current devices. This work is useful to the development of quantum devices able to decide with partial information, which helps to implement future technologies in quantum artificial intelligence.

Increasing the computational capabilities of machines is an essential field in artificial intelligence. In this context, machine learning algorithms have emerged with great force in the last decades 1,2 . This class of algorithms can be divided into two families, learning from big data and learning from interactions. Learning from big data can be classified into two categories, supervised and unsupervised learning. In the supervised learning paradigm, we have a set of labeled data named training data, from which we want to infer some classification function to sort unlabeled new data. Unsupervised learning algorithms do not use training data. In this paradigm, the goal is to extract the statistical structure of an unsorted data set and divide it into different groups according to some criteria (clustering problem) 3-8 . In the category of learning from interactions we have the Reinforcement Learning (RL) algorithms [9][10][11][12][13][14][15][16][17][18] . The idea in this paradigm is that a known and manipulable system called agent (A) interacts with a non-manipulable system called environment (E). Here, the goal is to optimize a task G (A, E) , which depends on the state of A and E. For this, we use feedback loops to change the state of A using the information extracted from the interaction with E. Some impressive and recent examples of RL are the AI players for different strategy games like Go 19 , Chess 20 , or StarCraft II 21 .
On the other hand, it has been shown that quantum computing 22 can overcome some fundamental limits of classical computing, e.g., in searching problems 23 , factorization algorithms 24 , solving linear equation systems 25,26 , and for linear differential equations 27 . Therefore, it was natural to merge machine learning techniques with the advantages of quantum computing in the topic known as Quantum Machine Learning (QML) [28][29][30][31][32][33][34][35] .
With the development of Noisy Intermediate-Scale Quantum (NISQ) devices 36 , the research on simple quantum information protocol (suitable for NISQ quantum computers) and in QML has grown in the last years. The IBM quantum computer is one of the most famous open NISQ devices, which can be programmed using Qiskit 37 , an open-source python package, to create and run quantum programs using the IBM quantum cloud service 38 .
One of the most useful algorithms for linear algebra, and hence for quantum mechanics, are the quantum eigensolvers. The hybrid quantum-classical algorithms like variational quantum eigensolver (VQE) 39-41 take advantage due to its easy implementation in NISQ devices. The main idea of this class of algorithm is to calculate some expectation value (like energy) with a quantum processor, and then use a classical optimizer (like variational one) to reach the solution 42 . Nevertheless, it has been recently proposed an algorithm that uses a quantum www.nature.com/scientificreports/ optimizer 43 . Each iteration of the classical optimizer algorithm involves many single-shot measurements in the quantum system, which are required to calculate an expectation value. The development of an algorithm with more quantum features will involve the use of a more primitive classical subroutine.
In this paper, we implement the semi-autonomous eigensolver proposed in Ref. 44 . The protocol can obtain an approximation of all eigenvectors for an arbitrary observable using single-shot measurements instead of expectation values. Here, we use the most basic classical subroutine, which involves only pseudo-random changes handled by the outcome of the single-shot measurement and a feedback loop. Due to this feedback loop, this algorithm can be classified in the RL paradigm. Using our protocol, we can obtain a high fidelity approximation for all eigenvectors. In the single-qubit case, we get fidelities larger than 0.97 and larger than 0.91 for a two-qubit observable in around 200 and 5000 single-shot measurements, respectively. This work opens the door to explore alternative paradigms in hybrid classical-quantum algorithms, which is useful for developing semi-autonomous quantum devices that decide with incomplete information.

Methods
Basics on RL paradigm. We briefly describe the basic components of the RL paradigm. As mentioned above, in an RL algorithm, we define two systems: the agent A and the environment E. The interaction among these systems can be divided in three basic steps, the policy, the reward function (RF) and the value function (VF). The policy refers to the general rules of the algorithm and can be subdivided into three stages: first, the interaction, where we specify how A and E interact; second, the action, which refers to how A changes its perception of E modifying some internal parameters; and third, the information extraction, that defines the process used by A to infer information from E. The information extraction can be done directly by A or using an auxiliary system, named register, if A cannot read the response of the environment.
The RF is the criterion to reward or punish A in each iteration using the information collected from E. This step is the most important in any RL algorithm because the right choice of the RF ensures the optimization of the desired task G (A, E) . Finally, the VF evaluates a figure of merit related to the task G (A, E) , which provides us the utility of the algorithm. The main difference between RF and VF is that the first evaluates each iteration to increase the performance locally in time without considering the history of the algorithm. At the same time, VF depends on the history of the algorithm, which takes into consideration a large number of iterations given the global performance of the algorithm. RL protocol. We define the basic parts of our protocol as an RL algorithm. The state of the agent is denoted by where D k is a unitary transformation to prepare the desired agent state, the state |j� is the initial state provided by the quantum processor in the computational basis, and the subindex k denotes the iteration of the algorithm. The environment is expressed as an unknown Hermitian operator Ô written as with α (j) and |E (j) � the jth eigenvalue and eigenvector of Ô , respectively. The task G is set to maximize the fidelity between the state of the agent, |A  k � is equal to some eigenvector of Ô , we obtain c (j) = 1 in Eq. (4). Using this condition we define the next rule for the action. If the outcome is m = j ⇒ c (j) = 1 , then |A (j) k � is not an eigenvector of Ô . In this case ( m = j ), we modify the agent for the next iteration defining operator D k+1 as up to a global phase. Therefore, û(θ, φ, ) is a general rotation in the {|j�, |m�} subspace. The angles are random numbers given by where the range amplitude w k will be updated in each iteration according to the RF, which will be specified later. Now, for the case m = j , the state |A (j) k � could be an eigenvector of Ô , then we define We can summarize Eqs. (6) and (11) as Now, we define the reward function as where p > 1 is the punishment ratio, and 0 < r < 1 is the reward ratio. This means that each time we obtain the outcome m = j , we increase the amplitude range w k+1 , because m = j means that we are further away from an eigenvector and greater corrections are required. In the other case, when m = j means that we are closer to an eigenvector, then, we reduce the value of w k+1 obtaining smaller changes for future iterations. Finally, the value function will be the last value of the range amplitude w N after N iterations. If w N → 0 signifies that we have measured m = j several times, then c (j) ≈ 1 , which implies that we obtain a good approximation of an eigenvector.

Results
Single-qubit case. We implement the algorithm described above in the IBM quantum computer. We start with the simplest case, which is to find the eigenvectors of a single-qubit observable. Since there are only two eigenvectors, we only need to obtain one of them, because the orthogonality property can determine the second one. Figure 1 shows the circuit diagram for this case. As we can see in Fig. 1

the agent in each iteration is given by
In this case, we have only one the rotation ( û 1,0 ) of the form of Eq. (7), then, for simplicity, we redefine the operator D k =D(θ k , φ k , k ) as where σ (a) is the a-Pauli matrix and with {� θ , � φ , � } ∈ w k [−π, π] and w k given by Eq. (13), considering only two outcomes ( m ∈ {0, 1} ) and j = 0 for the whole algorithm. The gate in Eq. (15) has the form of the general qubit-rotation provided by qiskit, therefore, it can be efficiently implemented in the IBM quantum computer. We denote by, F , the maximum N � , and one of the eigenvectors at the end of the algorithm. We find that F is related to the probability of obtaining the outcome m = 0 ( P 0 ) by (see appendix A) (1) | is the gap between the eigenvalues of τÔ [see Eqs. (2) and (3)]. Figure 2 shows P 0 as a function of the fidelity F for different values of .
For the implementation we use the initial values θ 1 = φ 1 = 1 = 0 , w 1 = 1 and the quantum processor "ibmqx2". The algorithm is run until w N < 0.1 . Since the algorithm converges stochastically to the eigenvectors, we perform 40 experiments in order to characterize the performance of the algorithm by the central values of the data set. Also, we compare the performances of our algorithms with the VQE algorithm for the same environments using the same quantum processor. To test the algorithm, we use three different environment Hermitian operators: Here, we choose the reward ratio r = 0.9 and the punishment ratio p = 1/r . The results of the 40 experiments are collected in the Apendix Table 1 (Supplemental material) and summarized in the histograms of Fig. 3. From  Fig. 3a, we can see that the probability P 0 is bigger than 0.85 in 36 cases, which implies, as is shown in Fig. 3b, that most cases give fidelities larger than 0.94. Also, we have 36 experiments with F > 0.96 , the average fidelity is F = 0.98 and the standard deviation is σ = 0.019 which represent the 2% of the average fidelity F . Also, the average number of iterations of the algorithm in the 40 experiments is N = 103 , the minimum number of (17) www.nature.com/scientificreports/ iterations N min = 25 , and the maximum number of iterations N max = 528 . This number may look large, but we remark that we using only one single-shot measurement per iteration. In comparison, if we want to calculate a given expectation value, we require at least 1000 single-shot measurements for a single qubit. Then for this case, our algorithm requires less resources than any other classical-quantum algorithm that utilizes expectation values. For the VQE algorithm, first we choose 500 single-shot measurements per step and COBYLA as the classical optimization method. VQE needs 33 COBYLA iterations to converge, which means 16500 single-shot measurements in total, i.e.100 times the resources needed in our algorithm, and get a fidelity of 0.997. If we change the number of single-shot measurements to 8192 per step (it is the maximum shots allowed by IBM), we need 35 COBYLA iterations to converge, which means 286720 single-shot measurements, 1000 times more resources than our algorithms, nevertheless, the fidelity is 0.999.
Now, we choose the reward ratio r = 0.9 and the punishment ratio p = 1.5/r . The results of the 40 experiments are collected in the Appendix Table 2 (see supplemental material) and summarized in the histograms of Fig. 4. From Fig. 4a we can see that the probability P 0 is bigger than 0.9 in 35 cases, which implies, as is shown in Fig. 4b, that most cases give fidelities larger than 0.94. Also, we have 30 experiments with F > 0.96 , the average fidelity is F = 0.97 and the standard deviation is σ = 0.022 which represent the 2.3% of the average fidelity F . Also, the average number of iterations of the algorithm in the 40 experiments is N = 116 , the minimum number of iterations N min = 25 and the maximum number of iterations N max = 572 , again for this case our algorithm   www.nature.com/scientificreports/ uses less resources than the algorithm that use expectation values. As in the previous case, we compare the results with the VQE algorithm. For 500 shots per step, we get a fidelity of 0.883 with 23 COBYLA iterations, which means 11500 single-shot measurements, i.e.100 times more resources than our algorithm. For 8192 shots per step, the fidelity is 0.891 and we need 23 COBYLA iterations, the total single-shot measurements are 188416, i.e.1000 times more resources than in our algorithm.

3.
τÔ = cos 1 10 We choose the reward ratio r = 0.9 and the punishment ratio p = 1.5/r as in the previous case. The results of the 40 experiments are collected in the Appendix Table 3 (see supplemental material) and summarized in the histograms of Fig. 5. From Fig. 5a we can see that the probability P 0 is bigger than 0.85 in 39 cases, which implies, as is shown in Fig. 5b, that most cases give fidelities larger than 0.94. Also, we have 30 experiments with F > 0.98 , the average fidelity is F = 0.98 and the standard deviation of σ = 0.015 which represent the 1.6% of the average fidelity F . Also, the average number of iterations of the algorithm in the 40 experiments was N = 227 , the minimum number of iterations N min = 26 and the maximum number of iterations N max = 782 . In this case, as N max is around 800, we compare the VQE algorithm, at first with 800 shots per step, obtaining a fidelity of 0.911 using 14 COBYLA iterations, which means, a total number of single-shot measurements of 11200, i.e.50 times more resources than our algorithms. When we use 8192 per step, the fidelity is 0.999 and we need 14 COBYLA iterations, obtaining a total number of single-shot measurements of 114688, i.e.500 times more resources than our algorithm.
Even if VQE allows us to reach fidelities larger than 0.98 (the mean fidelity of our algorithm), it needs several resources, more than 100 times the resources using by our algorithm, which implies a great advantage of our proposal.

Two-qubit case. In this case, we have three different agent states given by
We update the matrix D k according to Eq. (12). To decompose the matrix D k in a set of one-and two-qubit gates, we use the method already implemented in qiskit 45 . To find all the eigenvectors we divide the protocol in three stages. In the first stage, we consider the agent state |A (0) k � =D k |00� , with D 1 = I and w 1 = 1 . The outcome of the measure have four possibilities m ∈ {00, 01, 10, 11} and we run the algorithm until w n 1 < 0.1 ( n 1 iterations). After this, we have that |A (0) n 1 � =D n 1 |00� is the approximation of one of the eigenvectors of Ô . In the second stage, we consider the agent state |A (1) k � =D k |01� , with D n 1 +1 =D n 1 and w n 1 +1 = 1 . Now, we take into account only three outcome m ∈ {01, 10, 11} , since we suppose that |A (0) N 1 � is a good enough approximation. If we obtain m = 00 , we consider it as an error, and we define D k+1 =D k and w k+1 = w k , it means that we do nothing, and not apply the updating rule for D k+1 and w k+1 , we denote this error as c 00 . We run this stage n 2 iterations until w n 1 +n 2 < 0.1 . As we do not do rotations in the subspace spanned by {|00�, |01�} during this n 1 +n 2 � =D n 1 +n 2 |01� and |A (0) n 1 +n 2 � =D n 1 +n 2 |00�. Finally, in the third stage, we consider the agent state |A (2) k � =D k |10� , with D n 1 +n 2 +1 =D n 1 +n 2 and w n 1 +n 2 +1 = 1 . Now, we have only two possibilities for the outcome measurement m ∈ {10, 11} . Here, we also suppose that D n 1 +n 2 |00� and D n 1 +n 2 |01� are good enough approximations. If we obtain m = 00 or m = 01 , we consider them again as an error and we do not apply the update rule, denoting these errors as c ′ 00 and c 01 , like in the previous stage. We run this case n 3 iterations until w n 1 +n 2 +n 3 < 0.1 . In this stage, we only modify the subspace expanded by {|10�, |11�} , then, we have that |A n T � =D n T |00�, |A (1) n T � =D n T |01�, |A (2) n T � =D n T |10�, |A (3) n T � =D n T |11�} , with n T = n 1 + n 2 + n 3 . To test the algorithm, we choose three cases. First we consider the bi-local operator given by In this case, the eigenstates and the eigenvalues are We note that the ground state is degenerate, then any linear state of the form |φ� = a|E (0) � + b|E (1) � will be also ground state of the operator and the same for the other states. In this case we define the fidelity of our algorithm by the probability to measure the initial state |j� We run this case using IBM backend "ibmq_vigo" and the results are shown in Appendix Table 4 (see supplemental material). In this case, we run the algorithm ten times and the mean fidelities are: F 00 = 0.931 , F 01 = 0.933 , F 10 = 0.932 , and F 11 = 0.919 . The mean number of iterations is N = 272 . In this case, the mean errors are: c 00 = 10 , c ′ 00 = 8 and c 01 = 5 . Therefore, the fidelity of our algorithm was higher than 0.91 for each eigenstate in less than 300 single-shot measurements. The same as the single-qubit case, we will compare with the VQE algorithm. At first, we choose 300 shots per step, and 56 COBYLA iterations, which means 16800 single-shot measurements, obtaining a fidelity of 0.976 for the ground state. Using 8192 shots per step, VQE needs 54 COBYLA iterations to converge, which means 442368 single-shot measurements, obtaining a fidelity of 0.997 for the ground state. In this case, VQE get a significantly more accurate result, but it is only for the ground state and uses 1000 times more resources than our algorithm which obtain all the eigenvectors.
The second example is the molecular hydrogen Hamiltonian with a bound length of 0.2 [Å] 46 : with g 0 = 2.8489, g 1 = 0.5678, g 2 = −1.4508, g 3 = 0.6799, g 4 = 0.0791, g 5 = 0.0791 . In this case the environment is given by with the next eigenvectors and eigenvalues In this case, we choose the same method as the previous case to calculate the F , we choose IBM backend "ibmq_ valencia" and the results are shown in Appendix Table 5 (see supplemental material). In this case, we run the algorithm ten times and the mean fidelities are: F 00 = 0.989 , F 01 = 0.973 , F 10 = 0.976 and F 11 = 0.979 . The mean errors are: c 00 = 7 , c ′ 00 = 4 and c 01 = 3 and the mean number of iterations is N = 111 . In this case, we need less than 150 single-shot measurements to obtain the fidelity over 0.97. For the VQE algorithm, at first we choose 120 shots per step and we need to use 59 COBYLA iterations, which means 7080 single-shot measurements, (21) F j = P j = |�j|D † n TÊD n T |j�| 2 . www.nature.com/scientificreports/ obtaining a fidelity of 0.994 for the ground state. When we use 8192 shots per step and VQE needs 64 COBYLA iterations to converge, it means 507904 single-shot measurements, obtaining a fidelity of 0.999 for the ground state. In this case, VQE can get better fidelities (larger than 0.99) but use again much more resources than our proposal, around 1000 times more to get only one of the eigenvectors. The third case that we consider to test the algorithm is the non-degenerate two-qubit operator 3. with eigenvectors and eigenvalues given by We run the algorithm in the IBM quantum computer "ibmq_vigo". In order to reduce the total number of iterations, we run the three stages of the algorithm four times as follows: 1. We choose r = 0.6, p = 1/r,D 1 = I, w 1 = 1 . Suppose that the total number of iteration after the three stages is N 1 = η 1 . 2. We choose r = 0.7, p = 1/r,D η 1 +1 =D η 1 , w η 1 +1 = 1 . Suppose that the total number of iteration after the three stages is N 2 = η 1 + η 2 . 3. We choose r = 0.8, p = 1/r,D N 2 +1 =D N 2 , w N 2 +1 = 1 . Suppose that the total number of iteration after the three stages is N 3 = η 1 + η 2 + η 3 . 4. We choose r = 0.9, p = 1/r,D N 3 +1 =D N 3 , w N 3 +1 = 1 , and suppose that the total number of iteration after the three stages is N = η 1 + η 2 + η 3 + η 4 .
We define the fidelity of each approximation as To obtain a data set to evaluate the performance of our protocol, we perform ten independent experiments. These data are collected in Appendix Table 6 (see supplemental material). The average fidelities that we obtain are F 00 = 0.941,F 01 = 0.933,F 10 = 0.929,F 11 = 0.935 , the average number of iterations is N = 1396 and the mean errors are: c 00 = 29 , c ′ 00 = 19 and c 01 = 18 . Therefore, in this case we obtain the four eigenvectors with fidelities larger than 0.92 in less than 1500 single-shot measurements, which at least corresponds to 6 measurements of mean values, being not enough for a classical-quantum algorithm that uses the optimization of mean values. For the VQE algorithm, we choose 2000 shots per step using 77 COBYLA iterations, which means 157000 single-shot measurements obtaining a fidelity of 0.918 for the ground state. For 8192 shots per step, VQE needs 88 COBYLA iterations to converge, it means 720896 single-shot measurements obtaining a fidelity of 0.944. In this case, VQE cannot surpass the performance of our algorithm, and use more than 100 times resources than our proposal only for the ground state.
For n−qubit observable ( n > 2 ), we can use the same protocol but considering more measurement outputs, which implies more stages in the algorithm.

Conclusions
In this work, we implement satisfactorily the approximate eigensolver 44 using the IBM quantum computer. For the single-qubit case, we obtain fidelities larger than 0.97 for both eigenvectors using around 200 single-shot measurements. For the two-qubit case, we use around 1500 single-shot measurements to obtain the approximation of the four eigenvectors with fidelity over 0.9. Due to the stochastic nature of this protocol, we cannot ensure that the approximation converges asymptotically with the number of iteration to the eigenvectors. Nevertheless, it is useful to obtain a fast approximation to use as a guess into another eigensolver that can reach maximal fidelity, like in the eigensolver of Ref. 43 . Also, we compare the performance of our proposal with the VQE algorithm, where VQE, in general, get better fidelities in the single-qubit case but use more than 100 times the number of resources than our algorithm. For two-qubit, the advantage in the maximal fidelity of VQE is a little better in comparison with our algorithm, but again, VQE needs several resources, i.e.more than 1000 times the resources used by our algorithm for all the eigenvectors. Also, the performance of the VQE algorithm depends on the variational ansatz used, which is not the case with our algorithm. This dependence of the VQE algorithms allows enhancing its performance using a better ansatz. The main goal of our algorithm is to get a high fidelity |E (0) � = 1 2 (|00� + |01� + |10� + |11�), α (0) = 0, |E (2) � = 1 2 (|00� + |01� − |10� − |11�), α (2) = π, www.nature.com/scientificreports/