Quantum circuit architectures via quantum observable Markov decision process planning

Algorithms for designing quantum circuit architectures are important steps toward practical quantum computing technology. Applying agent-based artificial intelligence methods for quantum circuit design could improve the efficiency of quantum circuits. We propose a quantum observable Markov decision process planning algorithm for quantum circuit design. Our algorithm does not require state tomography, and hence has low readout sample complexity. Numerical simulations for entangled states preparation and energy minimization are demonstrated. The results show that the proposed method can be used to design quantum circuits to prepare the state and to minimize the energy.


Introduction
Quantum computers are attracting attention as computers with computing power that surpasses that of classical computers [P18,AAB19]. In fact, algorithms that efficiently solve specific problems such as Grover's algorithm [G96] and Shor's algorithm [S94] have been proposed. In recent years, variational quantum algorithms [C21] have been actively researched, and quantum technology has been applied to various fields such as chemistry [PMS14,KMT17] and machine learning [DB18, MNK18, SK19, HCT19]. However, the design of a quantum circuit for solving a specific task under hardware constraints requires efforts [SBM06, FM17, LSJ15, SSP14, MFM08, AAH16, HNYN11], sometimes including empirical rules and domain knowledge as well.
Reinforcement learning (RL) [SB18,RN21] has been successful in the areas such as robot control [KCC13] and games [MKS15,SSS17]. Since there is a possibility that RL can solve complicated control problems, research has been conducted to apply RL to the control of quantum systems in recent years [BSK21,SEL21,NBS19,NY17,HWN21]. Most of these studies consider low level control at the hardware (Hamiltonian) level. But it is also important to control at the circuit level [NMM18], which is a higher level of abstraction [AU22], in order to perform concrete quantum computation. For simple circuits, it is demonstrated that the closed-loop control can lead to better control performance for trapped-ion quantum processors [NMM18]. State-of-the-art ion trap qubits have coherence time more than 10 min [WUZ17,WLQ21], which provides enough running time for on-line decision process on a classical computer.
In this paper, we consider applying RL to more general quantum feedback control at the circuit level. The basic RL algorithms solve for Markov Decision Process (MDP), where the current state of the agent can be exactly known from the observation of the environment. But for a quantum system, the Born rule asserts that an observation result is drawn from a probabilistic distribution over the state space. Therefore, it is necessary to formulate the problem as a partially observable problem. Quantum Observable Markov Decision Process (QOMDP) [BBA14,C16,YY18,YFY21] was proposed as a quantum extension of the Partially Observable Markov Decision Process (POMDP) framework for the classical partially observable problems [PT87,RN21], but no specific application of QOMDP was proposed. Our QOMDP planning approach is Bayesian, and does not rely on state tomography [NC11,YC21,KFC21] or expectation evaluation [ZHZY20,PT20,MLWEV21]. Hence it improves the quantum machine sample complexity per time step from ( ) with shadow tomography [A18]) to ( ) O 1 for number of observables N obs and accuracy  . However, our approach still requires exponentially expensive classical planning.
In this study, we formulate quantum control at the circuit level as a QOMDP reinforcement learning problem to solve for the quantum circuit design problem [K22]. The exact QOMDP Bellman equation for value iteration is derived. As a concrete algorithm, we propose a QOMDP planning algorithm with reference to planning in POMDP. In the exact POMDP planning for quantum state, there are three computational intractable parts. Firstly, the size of history set grows exponentially in time. Secondly, the Hilbert space is an uncountable set. Thirdly, the Hilbert space dimension grows exponentially with respect to the circuit width. We introduce the point-based value iteration (PBVI) algorithm from classical POMDP to make the approximating planning tractable and resolve the first and second issues. For the quantum Hilbert space, we perform exact filtering and do not make any approximation. Hence the calculations involving the belief state scale exponentially with respect to the number of qubits. We further consider circuit design problem for two types of applications: the problem of state preparation and energy minimization. The proposed algorithm was able to make Bell state and GHZ state [GHZ89] for state preparation. Regarding energy minimization, it was able to discover a low energy state with respect to the H2 and H-He+. The experimental results show the applicability of QOMDP to quantum control at the circuit level. Comparing to variational quantum eigen solver (VQE) [PMS14,MRB16,KMT17,C21] approach where the variational ansatz has to be chosen empirically, the QOMDP approach allows automatic search over a wide range of possible ansatzes. This paper is organized as follows. Related works are reviewed in section 2. The POMDP planning algorithm is introduced in section 3. Numerical experiments are presented and analyzed in section 4, followed by a concluding section.

Related work
Quantum circuit synthesis has been addressed in many works without using RL [

Methods
The overview of our QOMDP-PBVI method is depicted in figure 1. The offline planning is computed with a classical simulator. The output of the offline planning is a matrix set h which approximates the value function. The set η is then stored in a classical agent, and the agent is able to make online decision in a hybrid Quantumclassical computer. The theory and algorithm are explained in the following sections.
is the set of operators for rewards. The reward of executing action a in state | ⟩ s is calculated by . 3 a g is the discount rate. | ⟩ s 0 is the initial state. Regarding the interaction between the agent and the environment in QOMDP, the agent selects an action according to the policy and executes the action for the environment. The operation A o a corresponding to the action a performed in the environment is executed, and the observation o is fed back to the agent. The agent also receives a reward according to the equation (3). The above actionobservation-reward sequence is for a single time step, and this is repeated until the end of the episode. The agent's goal is to maximize expected future rewards. Note the relationship between POMDP and QOMDP. The state in QOMDP corresponds to the belief state in POMDP, and the formula (2) corresponds to the belief state update in POMDP. Therefore, it is natural to think that it is possible to extend the planning method in POMDP and devise a planning method to solve QOMDP. In the next section, we propose a planning algorithm in QOMDP based on this idea.
Firstly, we derive the value function of QOMDP. Let ∶ [ ] p´ A 0, 1 S be the policy in a QOMDP described by O and the policy is defined by . The value function for Q is calculated by Since it is known that the value function can be expressed in a simple form of piece-wise linear and convex function in the classical POMDP, it seems that the value function can be expressed in some simple form in QOMDP as well. In the following paragraph, we derive the expression of value function in QOMDP. Let Î h t t H be the history up to time step t and the history is expressed by Let ∶´ S H t S S be the mapping from the initial state | ⟩ s and the history h t to the transitioned state This function can be calculated by The probability of obtaining the history h t given the initial state | ⟩ s is calculated by Value function is calculated using equations (3), (7), and (8). The detail derivation is presented in appendix. The result is 1 Equation (9) shows that in QOMDP the value function can be expressed in the form of the expectation value of ¡ matrix with respect to state. Since the optimal value function (| ⟩) * V s q is the maximum value function,

Point-based value iteration algorithm
Since the policy p has continuous parameters, ¡ also has continuous parameters. Therefore h all becomes an uncountably infinite set. Since the equation (11) cannot be calculated for an uncountably infinite set, the optimal value function in equation (11) is approximated as follows using a finite set of ¡ matrix As we have confirmed that the value function can be expressed as equation (12), next we will explain how to update this value function. When we update the value function, we update the ¡ matrix set The value function can be calculated from previous value function by Bellman equation as follows.
The ¡ matrix set h will be updated by However, it should be noted here that equation (16) cannot be calculated because S is an uncountably infinite space. Therefore, it is necessary to update the ¡ matrix sets h without using equation (16). In this research, we propose an algorithm updating ¡ matrix sets h based on point-based value iteration [PGT03] classical POMDP planning method for this problem.
In this section, we propose our QOMDP planning algorithm based on the classical POMDP planning algorithm PBVI [PGT03]. In the point-based method, the problem that the union in equation (16) cannot be calculated is dealt with by approximating the uncountable state space. Since the state space S is a Hilbert space and the number of elements is infinite, we consider approximating this with a set of a finite number of statẽ S As a result, the calculation of equation (16) can be performed and the ¡ matrix set h can be updated as follows.
The approximation state setS is expanded alternately with the update of the value function in each iteration.
is used in this work. The state with the largest distance among the obtained shortest distances is added toS as a new state. Since the state set is expanded by executing the above process for all the states belonging to the state set before the expansion, the size of the set become doubled at most. The initial condition for the matrix set is The value function is updated as many times as the number of horizons, then the state set is expanded. These value function update and state set expansion are executed alternatively. The pseudocode is shown in figure 2 Algorithm 1 for state set expansion and Algorithm 2 for value function update.

Policy for decision making
In this section, we explain the policy of how to decide an action based on the updated value function. Let h be the ¡ matrix set after executing point-based value iteration algorithm. The value function is represented by In equations (17)-(19), there is an action corresponding to each ¡ matrix. The optimal action * a is decided as the action corresponding to the highest valued ¡ matrix: a s s , S Notice that the elements are only indexed by | ¢ñ s , so the size of the set is | | |˜| h = . S For a real quantum device, the agent updates value function by point-based value iteration using only classical computer, and then executes the action decided by the policy in the real device. The agent executes an action calculated by the policy and gets an observation from real device, updates belief state by equation (2) using the action and the observation, and calculates a next action based on the updated belief state.

Complexity analysis
In this section, we explain the computational complexity of point-based method. We first notice the sample complexity advantage of our method over traditional state tomography-based methods [KFC21].
O which grows exponentially in time.
(2) The Hilbert space S is infinitely uncountable. (3) The Hilbert space dimension is | | = 2 n S for n qubits. We use finite set approximation to tackle the first two intractability. We employ the notation that | | A is the number of actions. | | O is the number of observations. | | h is the number of ¡ matrices in the previous update step. |˜| S is the number of states in the state set. Equation

Applications: quantum circuit design
We define the quantum circuit design as a RL problem using QOMDP. Quantum circuit design is a task to arrange the gates and bits in a circuit in order to solve a specific problem in a quantum computer. When trying to formulate this quantum circuit design with the framework of RL, it is necessary to pay attention to the partially observability in the quantum system. Therefore, it is appropriate to use the QOMDP, which can handle partially observability. We will implement the quantum circuit design using QOMDP as follows. We prepare two subsystems, 'System' and 'Ancilla', and apply a unitary to the whole system as an action. An observation is a measurement outcome of 'Ancilla'. The circuit is designed with an action sequence that maximizes the sum of rewards defined based on the state of 'System'. 'System' is a main system and 'Ancilla' is an auxiliary system for indirectly acquiring 'System' information. Since the quantum state would collapse if 'System' is measured, it is a partially observable problem in which the state of 'System' is inferred by measuring 'Ancilla' without measuring 'System'.
The specific flow is as shown in figure 3(a). The agent is classically implemented, and the environment contains the quantum circuit. At each step, the agent selects a quantum gate to be executed in the circuit as an action from the action set, and then executes the action in the environment. The operations performed for a given action include the selected quantum gate in the circuit, measuring 'Ancilla' and obtaining the measurement result. The reward is calculated from the state of 'System', and the measurement result and the reward are fed back to the agent. The agent makes a classical update based on the obtained measurement result and performs the next action based on it. The above flow is repeated until the evaluation result of the task reaches the threshold value or the number of steps reaches the maximum number of steps.
In this quantum circuit design, each item of QOMDP is as follows. S is the Hilbert space of 'System+Ancilla.' is the set of measurement outcome of 'Ancilla'. The unitary applied for an action is shown in figure 3 ( ) q Rz i are Rx, Ry, and Rz gates of the rotation angle θ applied to the i-th qubit, and H i is the Hadamard gate applied to the Ã°ÂÂ--th qubit. CX i j , is the controlled NOT gate whose control bit is the i -th qubit and target bit is the j -th qubit.
At the end of each action, an measurement is performed on the ancilla and hence the corresponding Kraus operator A o a is defined by where I system is identity gate of 'System' and | ñ e o is an orthonormal basis vector of state space of 'Ancilla'. R is defined for each task so that the reward can be calculated. g is the discount rate. | ñ s 0 is | ñ 0 . We note that this approach requires fast resetting of qubits, which could be done for superconducting qubits [YT21]. In this paper, we demonstrate two quantum circuit design examples: state preparation and energy minimization. Each task is explained as follows.
3.6. Task 1: state preparation State preparation is a task to design a quantum circuit to generate a target state | ñ s . Target This task is implemented by using fidelity as the reward in above quantum circuit design. When fidelity is used for reward, the reward is calculated by  [A21]. The experiments were averaged over 10 different random seeds. The hyperparameters were set as follows. The maximum number of steps in circuit design is 100, the threshold of fidelity to end an episode is 0.99, the hyperparameters for the value iteration algorithm in Algorithm 2 are 10 for horizon H, 9 for number of iterations I, and 10 for the minimum initial size of the state set N. To evaluate the results, state generation is executed 100 times using the obtained policy after the update is completed for each iteration. The evaluation is performed by averaging the obtained fidelity and the number of steps taken. Higher fidelity and fewer steps are better.
First, we describe the case where the target state is a Bell state. The planning result is shown in figure 4. The experiment was performed by changing the U_ancilla gate, applied for 'Ancilla' in figure 3(b), to {Rx, Ry, Rz} gates with the rotation angles {π/3, π/2.5, π/1.6, π/1.5}. The horizontal axis shows the number of iterations of the value iteration method, and the vertical axis shows the fidelity gotten and the number of steps taken when the circuit design was executed by the policy obtained from the iterations. Figures 4(a)-(c) show the results when the U_ancilla gate is changed to the Rx, Ry, and Rz gates with various rotation angles. One example of the circuit obtained by performing the QOMDP planning algorithm is shown in figure 5(a). The state obtained when the circuit was executed is shown in figure 5(b).
Second, we describe the case where the target state is the GHZ state and the number of qubits of 'System' is 3. Planning result is shown in figure 6. The experiment was performed by changing the U_ancilla gate in figure 3(b) to {Rx, Ry, Rz} gates with the rotation angles {π/3, π/2.5, π/1.6, π/1.5}. The horizontal axis shows the number of iterations of the value iteration method, and the vertical axis shows the fidelity gotten and the number of steps taken when the circuit design was executed by the policy obtained from the iterations. Figures 6(a)-(c) show the results when the U_ancilla gate is changed to the Rx, Ry, and Rz gates and various rotation angles. One example of the generated circuits is shown in figure 7(a). The state obtained when the circuit was executed is shown in figure 7(b).
Third, we describe the case where the target state is the 4-qubit GHZ state. The presentation is similar to that of the 3-qubit case. The data is depicted in figure 8. The circuit and the generated density matrix are depicted in figure 9.
In figures 4(c), 6(c), and 8(c), we observe that the learning curves for Bell-GHZ states are constant functions and are independent of the ancilla rotation angle f if

ancilla z
We also observe similar behavior in To explain these observations, we introduce a lemma. The proof for the lemma is provided in appendix.

Lemma
We use the notations Furthermore, n-qubit Bell-GHZ states are equal superpositions of all-zero state | ⟩ ⨂ 0 n s and all-one state | ⟩ ⨂ 1 . n s All-zero state always has even parity, while all-one state has the same parity as the number of system qubit n .
s This implies that for = n 3, s the target Bell-GHZ state is an equal superposition of odd-parity state and even-parity state. The agent would not be able to distinguish the target state from the system ( ) ( ) ( )

Task 2: energy minimization
We show the experimental results of H2 and H-He+. The Hamiltonians of the molecules are derived using OpenFermion [BM]. In these experiments, the orbital basis is STO-3G and the Fermion-qubit transformation is Jordan-Wigner. Since the minimum energy of the molecule is not known in advance in the energy minimization experiment, it is difficult to set the threshold value. Therefore, in this experiment, the episode ends when the number of steps reaches the maximum step. The hyperparameters were set as follows. The maximum number of steps in circuit design is 10. The hyperparameters for the value iteration algorithm in Algorithm 2 are 10 for horizon H, 9 for number of iterations I, and 10 for the minimum initial size of the state set N. The energy unit is Hartree for all the experiments. Regarding the threshold value of the energy expectation value that is the condition for the end of the episode, as mentioned above, the minimum energy is not known in advance and it is difficult to set it. Therefore, we set this threshold value to a value that can never be reached. The episode ends only with the maximum number of steps. The set thresholds were −2 for H2 and −10 for H-He+.
To evaluate the results, the energy minimization circuit design is executed 100 times using the obtained policy after the update is completed for each iteration. The evaluation is performed by averaging the obtained energy expectation values. The smaller energy expectation value is better.
First, we describe the energy minimization circuit design experiment for H2. The planning result for H2 with bond length 1.0Å is shown in figure 10. The experiment was performed with the U_ancilla gate in figure 3(b) as Ry gate and the rotation angles are {π/3, π/2.5, π/1.6, π/1.5}. The horizontal axis shows the number of iterations of the value iteration method, and the vertical axis shows final energy expectation value when the circuit design was executed by the policy obtained from the iterations. Figure 10 demonstrate that our algorithm can solve simple quantum circuit design problem if suitable ancilla unitary is chosen. One example of the generated circuit is shown in figure 11(a). The change of energy expectation value when the circuit was executed is shown in figure 11(b). In figure 11(b), the horizontal axis shows the number of steps in the episode, and the vertical axis shows the energy expectation value.
Next experiment was conducted with the bond length spaced by 0.1 Å from 0.2 Å to 3.0 Å. The result is shown in figure 12. The experiment was performed with the U_ancilla gate in figure 3(b) as Ry gate and the rotation angles changed to {π/3, π/2.5, π/1.6, π/1.5}. The horizontal axis shows the bond length, and the vertical axis shows the energy value. Since the policy can be obtained at each rotation angle and each iteration, the best policy is the one with the smallest average energy expectation value over four possible angles and nine iterations. In figure 12, the average and minimum energy expectation value when the circuit design is executed 100 times using the best policy are plotted. The exact minimum energy obtained by diagonalizing the Hamiltonian of H2 is also plotted in figure 12. In figure 12, the minimum energy obtained by our method is represented by black dots, and the exact minimum energy is represented by orange line. A kink is observed around 1.5 Å of QOMDP curve. Similar phenomena appear in VQE calculation for LiH molecule [KMT17]. This might due to incorrectness of the spin wavefunction [STS20], which might be improved by modifying the QOMDP action space. Notice that VQE algorithm has polynomial complexity with respect to number of qubits, while our algorithm requires exponentially large classical planning. VQE simulation achieving chemical accuracy (0.0016 Hartree) for H2 molecules has been reported [KMT17] with <10 circuit depth and <10000 function calls. Our algorithm fails to reach high accuracy around bond length 1.5 Å. The potential advantage of our algorithm is that the agent could automatically search for the ansatz instead of human-design ansatz based on prior knowledge. However, there is no guarantee that the QOMDP agent can always find the global minimum.
Second, we describe the case where the molecule is H-He+. The presentation is similar to that of the H2. The planning result and the execution result are depicted in figures 13 and 14. The minimum energy is depicted in figure 15. Figure 15 shows that for all bond lengths, the minimum energy gotten by QOMDP, represented by the black dots, is almost the same as the exact minimum energy represented by orange line.

Conclusion
In this work, a QOMDP based planning algorithm is designed to solve for the problem of quantum circuit architecture search. Point-based approximation is used to resolve the intractability due to the planning history and the continuous Hilbert space. We implement the algorithm, and the simulation results suggest that the algorithm can successfully find circuits to produce entangled states and to minimize energy functionals for simple molecules. Our algorithm only requires small number of readouts from quantum circuits for online decision making. However, it costs exponentially large classical resources in the planning stage of the algorithm. One possible approach to scale up our method is to equip the classical agent with a tensor network simulator [CHHGK21] to tackle the exponentially scaling with respect to the circuit width. Future investigations are required to make the method suitable for large scale quantum computations. The Lemma is presented and proved here.