Optimizing ZX-Diagrams with Deep Reinforcement Learning

ZX-diagrams are a powerful graphical language for the description of quantum processes with applications in fundamental quantum mechanics, quantum circuit optimization, tensor network simulation, and many more. The utility of ZX-diagrams relies on a set of local transformation rules that can be applied to them without changing the underlying quantum process they describe. These rules can be exploited to optimize the structure of ZX-diagrams for a range of applications. However, finding an optimal sequence of transformation rules is generally an open problem. In this


Introduction
ZX-calculus is a diagrammatic language for the representation of quantum processes as graphs equipped with a set of local transformation rules.Due to the utility of these transformation rules, ZX-calculus has been applied to a wide range of problems ranging from fundamental quantum mechanics [1] over the description of measurement-based quantum computing [2] and analyzing variational quantum circuits [3] to quantum error correction [4,5].In particular, ZX-calculus has proven a promising candidate for speeding up tensor network simulations [6] and quantum circuit optimization [7][8][9][10][11].However, finding the optimal sequence of transformation rules to achieve a given task is often a non-trivial task.Therefore, we bring together ZX-diagrams with reinforcement learning (RL), a machine learning technique where an agent iteratively interacts with an environment to learn a policy predicting an optimal sequence of actions.RL has been successfully applied to various domains such as game-playing [12,13], robotics [14,15], quantum chemistry [16,17], and problems in quantum computing like quantum error correction [18][19][20], quantum control [21,22], and circuit optimization [23,24].For optimizing ZX-diagrams, RL has two advantages over other machine learning methods: First, it doesn't require a training set of optimized diagrams, which is not available in our case.Second, by iteratively rewriting the diagram using RL instead of directly generating the optimized diagram in a single shot using supervised methods, we can verify the equivalence of the unoptimized and its corresponding optimized diagram, which is otherwise exponentially hard in the number of inputs and outputs of the diagram.Ensuring this equivalence is crucial for the task we set out to solve.
To capitalize on the graph structure of ZXdiagrams, we encode the policy of our reinforcement learning agent as a graph neural network (GNN) [25].As a proof-of-principle application, we train the agent to reduce the number of nodes in random ZX-diagrams for the following reasons: • It is intuitive for humans to understand, and good heuristic algorithms are available for benchmarking.
• It allows the use of a complete set of transformation rules, making it a more fundamental task as opposed to more specialized problems such as quantum circuit optimization [10].
We show that the agent learns a non-trivial strategy outperforming a custom greedy strategy, simulated annealing, and handcrafted state-of-theart ZX-diagram optimizers originally developed for quantum circuit optimization [7,10].Moreover, the agent's policy generalizes well to diagrams much larger than seen during training.
Our work lays the foundation for applying the combination of RL and ZX-calculus to a broad range of tasks like minimizing the gate count of quantum circuits, speeding up tensor network simulations, or checking quantum circuit equivalence, by changing the optimization goal of the agent in future work.

ZX-diagrams
A ZX-diagram is a graph representation of a quantum process defined by an arbitrary complex matrix of size 2 k × 2 j , where j is the number of ingoing and k the number of outgoing edges of the diagram.For example, we can represent the following matrix either as a quantum circuit consisting of single qubit X-and Z-gates and a CNOT gate or as a ZX-diagram according to The central building blocks of ZX-diagrams are Z-spiders (white) and X-spiders (grey) defined as where |1/0⟩ (| + /−⟩) are the eigenvectors of the Pauli-Z (Pauli-X) matrix, α is an angle, and n and m are non-negative integers specifying the amount of input and output edges of the spider.In addition to unitary operations, ZX-diagrams can also contain states and post-selected measurements.Therefore, a translation from quantum circuits into ZX-diagrams is straightforward, while the reverse is not always possible.While multiple different ZX-diagrams can describe the same underlying matrix, they can be transformed into each other by the set of local transformation rules depicted in Figure 2, which are correct up to a non-zero scalar factor [27].These rules also imply that multiple edges connecting spiders of the same color can be reduced to just one edge and multiple edges between spiders of differing colors can be taken modulo two.Therefore, and due to the inherent symmetries of the Z-and Xspiders, ZX-diagrams can be regarded as simple graphs [28].For a more detailed introduction of ZX-diagrams, see Appendix A.

Optimization of ZX-diagrams as a reinforcement learning problem
Reinforcement learning (RL) is a machine learning technique where an agent recursively interacts with an environment during a trajectory comprising multiple steps.At each step t, the agent uses its policy to select an action (in our case a graph transformation) based on an observation describing the environment's state (in our case a ZXdiagram) as depicted in Figure 1.This action then modifies the state of the environment and a new observation and a numerical value, the reward r t (in our case the difference in node number between the old and new diagram), is supplied to the agent.This scheme continues until the environment terminates the trajectory after a fixed amount of steps or the agent chooses a special Stop action.The agent is trained by repeating two phases: During the sampling phase the agent interacts with the environment for a fixed amount of steps.Then, during the training phase, the agent's policy is updated to maximize the expected cumulative reward over a complete trajectory ⟨ t γ t r t ⟩, where γ is the discount factor [29].To enable the use of graph neural networks to encode the agent's policy, we use a custom implementation of a state-of-the-art reinforcement learning algorithm named Proximal Policy Optimization (PPO) [30] to train the agent (for details on the algorithm and ablation studies of its features see Appendix C).
Each of the transformation rules of ZXdiagrams acts only on the local neighborhood of an edge or node.We can, therefore, identify each possible action of the RL agent with either

ZX-diagram
Figure 1: Schematic of the optimization loop.At each step, the reinforcement learning agent is provided with a ZX-diagram in the form of a graph.The agent then uses a graph neural network to suggest action probabilities of local graph transformations (color-coded), which act on either a unique edge (orange) or node (blue).Finally, an action is sampled from this probability distribution and applied to the diagram.In total, there are 6 separate actions per node and edge, some of which are not allowed in their local environment and, therefore, masked (grey dots).For a definition of the graph transformations see Figure 2.
a unique node or a unique edge as indicated by the blue lines in Figure 2. The agent's policy then predicts the unnormalized log-likelihood for each of the possible actions.By normalizing over the whole diagram, we build a probability distribution from which we sample an action that is applied to the diagram (see Figure 1).Some of the transformations are symmetric and only implemented in one direction (arrows) resulting in one action.For example, the Color change transformation changes the color of a spider by inserting Hadamards on all connected edges.Because of the Identity transformation, an implementation in the other direction would be redundant.Other transformations need to be implemented in both directions (equal signs).For example, the Hadamard (un)fuse transformation either fuses three spiders into a single Hadamard or splits up a Hadamard into three spiders and needs to be implemented as two separate actions.
The Unfuse transformation is especially challenging to implement since it requires choosing a subset of the edges connected to the selected node.As the number of edges connected to a spider is in principle unbounded, defining multiple Unfuse actions, each corresponding to separating the spider with a specific division of its edge set is not feasible.As a solution, we split up the Unfuse transformation into multiple consecutive actions of three types.First, the start Unfuse action selects a spider.After that, the selected spider is marked by a new node feature and the agent can only select one of two actions at each step: It can iteratively either use the Mark edge action on an edge connected to the selected spider or select the stop Unfuse action.Once the stop Unfuse action is selected, the spider is split and all previously marked edges (orange edges in Figure 2) are moved to the newly created spider.The angle remains fully at the original spider.To also split the angle between both spiders, as an extension, multiple different stop Unfuse actions could be defined, each standing for a different angle of the newly created spider.Due to the symmetry of the Bialgebra right transformation, it can not be identified with a single unique edge.Instead, it is applied if the agent selects one of the corresponding 4 colored edges.Due to its potentially global properties, the Copy transformation is only implemented in one direction.In principle, the other direction could be implemented similarly as the Unfuse action by first iteratively marking all participating nodes.
In total, the agent can choose from 6 different actions for each node and each edge of the considered diagram.Additionally, the agent can always select a global Stop action to end a trajectory if it expects that it can't optimize the diagram any further.To enable more efficient training, we mask actions that are not allowed in their lo- cal environment by setting their probability to 0. Finally, after each step, possible Identity and Hadamard loop transformations are applied automatically and redundant edges are removed.We also delete parts of the ZX-diagram that are disconnected from all ingoing and outgoing edges, as they correspond to simple scalar factors.
To encode ZX-diagrams as observations supplied to the agent, we represent them as undirected graphs with one-hot encoded node features.Each node has a color feature that can either be Z-spider, X-spider, Hadamard, Input, or Output and is, therefore, represented as a 5 dimensional vector.For example, the color feature of an X-spider would be [0, 1, 0, 0, 0].The Input and Output nodes are used to define the ingoing and outgoing edges of the diagram by being connected to their otherwise open end.Additionally, each node has an angle feature that can either be an unspecified placeholder angle α, multiples of π/2, or specify that the node is not a spider and doesn't have an angle.The angle feature, therefore, is a 6 dimensional vector.The discrete multiples of π/2 are necessary to evaluate the possibility of the transformation rules depicted in Figure 2. Finally, each node has a binary feature indicating whether the node has been marked by the start Unfuse action.The complete feature vector of each node n, x n 0 , given to the agent's policy is then all of its features concatenated into a single vector, resulting in a 12 dimensional vector.The feature vector e (n,m) 0 of the edge connecting node n and m contains just a single number that is 0 if the edge has not been marked by the Mark edge action and 1 otherwise.
Finally, a vital part of every RL algorithm is the definition of the reward of the agent as it determines the optimization goal.To demonstrate the utility of our algorithm we choose a reward that is computationally cheap to evaluate and intuitive to understand for humans but still requires non-trivial strategies to maximize: The difference in node number of the diagram before and after action application.As the sum of those differences corresponds to the total change in node number, the agent tries to minimize the total amount of nodes in the diagram at the end of the trajectory.Optimally reducing the node number of ZX-diagrams includes checking whether a given ZX-diagram corresponds, up to a global phase, to the identity operation which is a QMA-complete problem [26].

Neural network architecture
The use of a graph neural network (GNN) to encode the agent's policy has several advantages: against each action taken for the RL agent (orange), a greedy strategy (blue), and simulated annealing (green).For the RL agent and simulated annealing, multiple trajectories are plotted (transparent).The RL agent and simulated annealing significantly outperform the greedy strategy in terms of cumulative reward with the RL agent requiring an order of magnitude less steps than simulated annealing (inlay).Actions taken by the RL agent that intermittently increase the node number (i.e.non-greedy actions) are indicated by arrows.(c) Average number of nodes after optimization of 1000 ZX-diagrams with 10-15 initial spiders (left), which is the size the agent was trained on, and 100-150 initial spiders (right).We compare the RL agent (orange) to simulated annealing (green), a custom greedy strategy (blue), the full_reduce function of the PyZX software package (red), a combination of full_reduce and the greedy strategy (turquoise), and a combination of full_reduce and the RL agent (yellow).Hyperparameters for simulated annealing are optimized to give good performance on two example diagrams and then kept fixed for all diagrams.The RL agent is outperforming the other strategies.As we suppose the ideal policy depends only on the local structure of the ZX-diagram, we expect the GNN to train more efficiently and generalize better to unseen diagrams than other neural network architectures.Also, unlike a dense neural network, the GNN can handle any size of input data.Therefore, it can be efficiently trained on relatively small diagrams and later straight-forwardly applied to much bigger diagrams.As an input, the GNN directly takes the graph representation of a ZX-diagram.First, 6 message-passing layers [31] are applied to the graph.At each layer i, the node feature vectors x n i are updated according to (3) where N n are the nearest neighbors of node n, and ϕ i and ψ i are single dense neural network layers.We also update the edge feature vectors at each layer according to e (n,m) where θ i is also a single dense neural network layer.After the message-passing layers, we apply the multi-layer perceptron χ node (x n f ) to the final features of each node x n f and the multi-layer perceptron χ edge e .The networks χ node and χ edge have 6 output neurons each which are interpreted as the unnormalized log-probabilities of the possible actions (see Figure 1).
As the Stop action of the agent depends not only on the local structure of the graph but also on global features, we treat it differently from the other actions by computing its unnormalized logprobability according to ) where χ stop is a multi-layer perceptron, the MEAN functions are taken over the final node/edge features, and C is a vector containing global information about the amount of each node type, edges and allowed actions.
For an efficient implementation of the GNN, we use the TensorFlow-GNN software package [32] with custom layers to handle undirected edges.For a graphical representation of the network and further details on the network architecture and implementation see Appendix D.

Training
We train the agent to reduce the node number in randomly sampled ZX-diagrams with 10-15 ini-tial spiders (for details on the diagram sampling see Appendix B).The agent is trained for a total of 36 * 10 6 total actions.However, it already reaches its optimal performance around 9 * 10 6 actions as shown in Figure 3 (a).To evaluate the trained agent, we sample 1000 new ZX-diagrams of the same size as the training set and optimize them for 200 steps.We then calculate the average of the minimum number of nodes found during optimization which is significantly lower than the number of initial nodes.Next, we want to answer the question of whether the learned policy can straightforwardly be applied to larger diagrams by repeating the same evaluation on ZX-diagrams with 100-150 initial spiders.Even though the agent was only trained on diagrams an order of magnitude smaller, it can reduce the number of nodes in the diagram substantially, thereby highlighting the powerful generalization ability of GNNs [see Figure 3 (c)].To demonstrate the need for non-trivial strategies of the trained agent to achieve these results, we show two selected actions that initially increase the spider number but later lead to an overall positive cumulative reward in Figure 3 (d).
Training the agent takes around 41 hours on a single compute node with 32 CPUs and 2 GPUs.We run multiple environments in parallel on the CPUs during the sampling phase and train the agent distributed on both GPUs.The implementation of the algorithm could directly take advantage of larger compute nodes to speed up training time.

Comparison with other techniques
To better estimate the agent's performance, we compare it with various other strategies.The greedy strategy always selects the action with the highest possible reward as long as there are actions with a non-negative reward available.If there are multiple actions leading to the highest possible reward, the greedy strategy chooses randomly out of them.Simulated annealing is a probabilistic strategy for non-convex global optimization problems [33].We optimize its hyperparameters, i.e. the start temperature and temperature annealing schedule, by hand on two example diagrams [used for Figure 3 (b) and (e)] and then keep them fixed.For more details on the simulated annealing algorithm see Appendix E. The PyZX strategy uses the most powerful routine of the PyZX software package, i.e. the full_reduce function, which is based on the circuit optimization algorithms presented in [7,34].Because the full_reduce function is used for circuit optimization, it uses an incomplete set of transformation rules and doesn't perform well on the node reduction task.Therefore, we also apply the greedy strategy (PyZX + Greedy) or the RL agent (PyZX + RL) after the full_reduce function.
To compare the different strategies, we evaluate them on the same two sets of 1000 diagrams as the RL agent.The RL agent outperforms even the best non-RL strategy [see Figure 3 (c)] without relying on human-designed task-specific algorithms, suggesting promising results when applied to related ZX-diagram optimization tasks in future work.The PyZX + RL strategy is slightly better than the pure RL agent, indicating that some of the composite rules used in the PyZX strategy are difficult for the agent to learn.
The RL agent needs on average less than 4 s to simplify a diagram with 100-150 initial spiders running on a single GPU and single CPU while simulated annealing with 20000 steps needs over 36 s and the greedy strategy over 100 s, albeit running on only a CPU.Both PyZX and PyZX + Greedy are significantly faster than the RL agent.However, the run-time of the PyZX and the RL approaches are expected to scale equally with the size of the diagram (see Section 5.4).
ZX-diagrams containing only Clifford spiders, i.e. spiders with phases, that are multiples of π/2, can be efficiently simulated classically [35].Therefore, it is often more important to reduce the number of non-Clifford spiders, i.e. the α spiders with an unspecified angle.Thus, we evaluate the performance of the same trained RL agent for reducing α spiders and find it outperforms the other strategies also on this task (see Table 2).

Analysis of learned policy
While deep neural networks have been successfully employed to solve a wide range of problems, they are often regarded as a 'black box method' due to difficulties in interpreting their learned strategies.However, it is in principle highly desirable to gain some insight into how the neural networks arrive at their predictions [36].For graph neural networks, an interesting quantity is how local their learned strategy is, i.e. how far away pre-dictions on nodes or edges are influenced by the node and edge features of the diagram.Therefore, we evaluate how far away from a chosen action the ZX-diagram still influences the agent's decision.
To this end, we optimize ZX-diagrams with the agent until 1000 actions of each type are sampled.For each sampled action and the corresponding ZX-diagram, we then build up the diagram in layers around the node/edge identified with the action.Layer n is defined as all nodes that can be reached in n steps by traversing the diagram from the starting point.For each layer n and the corresponding sub-diagram spanning only nodes up to this layer, we compute the agent's unnormalized probability of sampling the original action P layer .We deliberately choose not to normalize the probabilities, as otherwise far away action probabilities would influence our results through the normalization constant even though no actual information traveled through the GNN.
In Figure 4 (a) we plot the average over the 1000 sampled actions of the quantity ϵ which captures how different P layer is from the unnormalized probability in the full diagram P complete .We define ϵ as ϵ = max where ϵ = 0 indicates that P layer and P complete are equal.The max function is necessary to give meaningful values when averaging this quantity.We find that to predict the agent's policy with an accuracy of 1%, information of 3-5 layers is required.Results for all action types are shown in Figure 10.Finally, we compare the agent's policy in a simple scenario to the, in this case, known optimal policy.Specifically, we take a closer look at the Copy action by evaluating its probability in a class of example diagrams as shown in Figure 4 (b).A phaseless Z-spider is connected to a phaseless X-spider with n out additional edges.On n extra of those edges, Z-spiders with arbitrary phase are inserted (see inlay).We plot the probability P copy of applying the Copy action to the edge connecting the phaseless spiders against n extra for several n out .The agent learns this ideal strategy to good approximation even though it was only trained on random ZX-diagrams and never specifically on diagrams of the type considered here.

Scaling
Each application of the GNN requires linear time in the number of nodes and edges in the diagram after which, as currently implemented, one action is applied.When scaling to larger diagrams, this could be improved: After each evaluation of the GNN, multiple actions can be applied as long as no information has passed through the GNN between the action locations.Since, in our case, the GNN consists of 6 message-passing layers, the actions should be separated by at least 6 layers, as introduced in the previous section.How many actions can be applied simultaneously depends on the connectivity of the ZX-diagram.If the ZX-diagram is extracted from a quantum circuit or measurement-based computation, it typically does not contain long-range connections between spiders and we expect that O(n nodes ) actions can be applied after each GNN evaluation.Using this approach, the run-time of our RL algorithm scales the same as the full_reduce algorithm of the PyZX software package [34], albeit with a worse prefactor.The total number of actions required to simplify a diagram depends heavily on its structure and no clear statement can be made about how it depends on the number of nodes in the diagram.

Outlook
In this work, we have introduced a general scheme for optimizing ZX-diagrams using reinforcement learning with graph neural networks.We showed that the reinforcement learning agent learns nontrivial strategies and generalizes well to diagrams much larger than included in the training set.The presented scheme could be applied to a wide range of problems currently tackled by heuristic and approximate algorithms or simulated annealing.
For example, in [6] the authors speed up tensor network simulations of quantum circuits by optimizing the graph property treewidth of the corresponding ZX-diagram using simulated annealing, which could straightforwardly be replaced by a reinforcement learning agent.
In [7], a deterministic algorithm for simplification of quantum circuits using ZX-calculus is introduced.The used transformation set is restricted to just two kinds of actions to preserve a special graph property of the ZX-diagram called gFlow, guaranteeing an efficient extraction of a quantum circuit from the optimized diagrams.Later, a heuristic modification was proposed to reduce the number of two-qubit gates in the resulting circuits [8].Meanwhile, also other gFlow preserving rules have been found [37].Additionally, the notion of gFLow can be relaxed to the more general Pauli flow which permits additional transformation rules while still allowing efficient circuit extraction [38].However, it is currently unclear when these rules should be applied for the goal of circuit optimization.In future work, a reinforcement learning agent could be trained including all gFlow or Pauli flow preserving rules with a reward dependent on the efficiently extracted quantum circuit corresponding to the diagram, thereby taking advantage of new rules and replacing human heuristics with a learned strategy.The agent's reward could, for example, be the total gate, two-qubit gate, or T-gate count.
Finally, the PyZX software package [34] has been used for quantum circuit equivalence checking using the ZX-calculus [39,40].However, this approach is only guaranteed to work for Clifford circuits due to the limited set of transformation rules of the employed algorithm.Since no circuit needs to be extracted from the optimized ZXdiagram for quantum circuit equivalence checking, an RL agent could use a complete set of transformation rules to potentially overcome this shortcoming.
During the final preparations of this manuscript, a master thesis using reinforcement learning for quantum circuit compilation with ZX-calculus, albeit using convolutional neural networks, was released [41].

Data availability
Python code of the custom reinforcement learning algorithm using graph neural networks and neural network weights of the trained agents are publicly available on GitHub [42].

A Details on ZX-calculus
Using the definition of the spiders in Equation (2), we can derive representations of common quantum gates, basis states, and measurements in the ZX-calculus [see Figure 5 (a)].Single-qubit Z/X-rotation gates with arbitrary angles can be represented as single Z/X-spiders with one input and one output edge and their phase corresponding to the angle of the rotation gate.Moreover, a CNOT gate can be represented by a phaseless Z-spider connected to a phaseless X-spider.Since this is a complete gate set, it follows that any quantum circuit can be represented as a ZXdiagram.We show an exemplary translation of a quantum circuit into a ZX-diagram in Figure 6.However, not every ZX-diagram can easily be represented as a quantum circuit, since ZX-diagrams can also represent states (spiders with only output legs), and post-selected measurements (spiders with only input legs).Similar to a basis change from Z-to X-basis in quantum circuits, the color of spiders can be changed by inserting Hadamards on all connected edges, as shown in Figure 5 (b).For more details on the ZX-calculus see e.g. the review article [28].

B Sampled diagrams
To enable the agent to simplify a wide range of ZX-diagrams, we sample a diverse set of diagrams during training.A typical example is shown in Figure 3 (e).Each new ZX-diagram is constructed with the following steps: First, the number of inputs and outputs is sampled uniformly between 1 and 3. Since the RL agent acts locally on the ZX-diagrams without requiring access to their underlying matrix representation, we expect the agent's learned policy to perform similarly well for diagrams with any number of input and output edges.Second, the number of initial spiders n init is sampled uniformly between 10 and 15.The amount of Hadamards is then sampled between 0 and ⌊0.2n init ⌋.The angles of the initial spiders can be one of 0, π, π/2, and α.To determine the angles of the spiders, we uniformly sample a number between 0 and 1 for each angle type, reduce the number for π, π/2, α by a factor 0.4 and then normalize the result to a probability distribution from which we sample the angle of each spider.We then uniformly sample the expected number of neighbors n neigh per spider between 2 and 4. From this, we compute the edge probability p edge such that when we create each possible edge in the diagram with p edge we will have an expected amount of n neigh neighbors per spider.
We then add each possible edge between all pairs of spiders with probability p edge to the diagram.Finally, we apply the automatic actions that we also apply after each action by the RL agent, i.e. removing redundant edges, removing parts of the diagram not connected to any input or output, and applying all possible Identity and Hadamard loop transformations.For the performance evaluation of the agent on bigger diagrams we instead sample the number of initial spiders n init between 100 and 150.If, instead of randomly sampling ZX diagrams, we had created them by translating quantum circuits, this spider number would correspond to circuits with, for example, up to 75 single-qubit gates and 75 two-qubit gates.

C Details on custom PPO algorithm
PPO is an actor-critic RL method with a policy network predicting action probabilities and a critic network predicting the so-called advantage of a specific action [30].The critic network is only used during training to reduce the variance in gradient update steps.Due to the variable size of our observations and action space, we use a custom implementation of PPO.During the sampling phase of the training, we run n env environments in parallel for n max steps each.Then, the agent's experiences are randomly split into minibatches of size n minibatch which the agent's policy and critic network is then trained on for one gradient step.After the agent is trained on all minibatches, they are reshuffled and another round of training starts for a maximum of n train steps.However, if the Kullback-Leibler divergence, estimated as in [43], between the agent's newly trained policy and the policy used in the last sampling phase gets larger than the constant c KL we stop the training early and start a new sampling phase.This is not a standard feature of PPO algorithms but has e.g.been implemented in [44].We linearly anneal both the clip range c of the PPO algorithm (as defined in [30]) and the entropy coefficient ϵ, which rewards higher entropy of the policy during training leading to more exploration.During training, we clip all gradients to a maximum of c absgrad and also clip the norm of the gradients of a minibatch to c normgrad .For the gradient updates, we use the ADAM optimizer [45] with a learning rate η and exponential moment decay rates β 1 and β 2 .All parameter values are summarized in Table 1, chosen as suggested in [30,46], and not further optimized.
We perform ablation studies on some features of the PPO algorithm by switching them off and training a new agent without them.The results are summarized in Figure 7. Entropy annealing has a significant positive impact on the agent's performance when simplifying large diagrams.As a policy with high entropy is more probabilistic, it might need more than the 200 given steps to fully simplify a large diagram.All other features don't impact performance significantly.However, we did not optimize the hyperparameters of any of the features which might further increase the performance of the agent.

D Details on network architecture
We graphically represent the steps of the GNN network in Figure 8.
In the policy network, we use 6 messagepassing layers.The message functions ψ i , node features computed by ϕ i , edge features computed by θ i , as well as the hidden layers of the final action prediction networks χ node , χ edge and χ stop , all contain 128 neurons and use the Tangens hyperbolicus as an activation function.
The χ node and χ edge multi-layer perceptrons both have only a single hidden layer, while χ stop has two hidden layers to better learn the more complex global Stop action.In addition to the final node/edge states x i f /e (n,m) f , χ node /χ edge also get as input an integer number, the stop counter.The stop counter is defined as min (20, Steps left in trajectory) and tells the agent when a trajectory is about to finish due to the maximum amount of allowed steps being reached.
The global vector C, which is used as part of the input of χ stop contains the number of nodes and the number of edges.Additionally, it holds the number of Z-spiders, X-spiders, Hadamards, spiders with zero/pi/arbitrary angle, and the amount of allowed Hadamard fuse and Euler actions all normalized by the total spider number and the amount of allowed Fuse, Pi, Copy, Bialgebra right, and Bialgebra left actions all normalized by the total edge number.Finally, it contains the stop counter and a binary flag, whether the agent has currently selected the start Unfuse action.We find that providing the agent global information for predicting the Stop action and for predicting the advantage through the critic network is critical to achieving stable training and avoiding exploding gradients as the GNN can otherwise only learn local quantities of the graph.
The critic network has the same network architecture as the network predicting the probability of the Stop action but shares no weights with the policy network.
We initialize all trainable parameters of the neural network layers as recommended in [46] using an orthogonal initializer with gain √ 2 for all hidden layers, gain 0.01 for the action prediction networks χ node , χ edge and χ stop , and gain 1 for the final layer of the critic network.
No optimization over the network size or pa- x n 0 = [(0,0,0,1,0), (0,0,1,0,0,0), (0)] x j 0 = [(0,0,1,0,0), (1,0,0,0,0,0), (0)] x k 0 = [(0,0,1,0,0), (0,0,0,0,1,0), ( x n i at message-passing layer i.First, messages for each connected edge are computed using the ψ i dense neural network layer.Second, the messages are combined by an element-wise sum operation.Finally, the new node feature vector x n i+1 is computed using the ϕ i dense neural network layer which takes the combined messages and the previous node feature vector as an input.(c) Update of the edge feature vector e (n,k) i at message-passing layer i.The new edge feature vector e (n,k) i+1 is computed using the θ i dense neural network layer which takes the feature vectors of the two nodes connected to the edge and the previous edge feature vector as an input.(d) After all message-passing layers are applied, a final multi-layer perceptron χ node /χ edge is applied to each node/edge respectively to predict the final unnormalized probabilities of each possible action.
rameters is performed suggesting further possibilities for improving the performance of the RL agent.

E Details on simulated annealing
Simulated annealing is a probabilistic algorithm iteratively transforming the ZX-diagrams.At each step, it randomly selects one of all allowed actions.If the immediate reward r of the action is non-negative, the action is applied.If r is negative, the action is only accepted with probability p accept = exp(r/T ), (7) where T is the so-called temperature.T is typically continuously decreased during the optimization process.We choose to exponentially anneal T with the start temperature T start at optimization step n step according to where c ann determines the speed of the temperature decay, as it performs better on the example diagrams than linearly annealing T .This may be because the exponential temperature decay leads to a longer nearly greedy phase of the algorithm in the later stages of the optimization.We further improve the performance of the sim-  ulated annealing algorithm by changing the reward structure of the Unfuse transformation.Instead of giving 0 reward when the start Unfuse action is selected and −1 rewards when the stop Unfuse is selected we switch the order of the two rewards.This helps the algorithm to avoid selecting start Unfuse in the later, nearly greedy stages of optimization and then getting stuck since it never accepts the negative reward of the stop Unfuse action.
We optimize T start and c ann on two diagrams which the greedy strategy can not optimize well [the diagrams used for Figure 3 (b) and (e)] and then keep them fixed while evaluating the performance of simulated annealing on the same set of diagrams, we evaluated the RL agent on.We find that T start = 0.5 performs well with c ann = 0.01/0.001/0.0001for a maximum of 200/2000/20000 optimization steps.We also tried T start = 1 which performed similar to T start = 0.5 on the example diagrams but considerably worse on average and even higher starting temperatures which even failed to optimize the example diagrams.As shown in Figure 9, simulated annealing performs slightly worse on average while needing a lot more optimization steps than the RL agent.

Figure 3 :
Figure 3: Results.(a) Training progress as the agent is trained to reduce the node number in random ZX-diagrams.Mean cumulative reward of the agent per trajectory against total steps taken in the environment.(b) Optimization of an example ZX-diagram ten times larger than the RL agent's training diagrams.Number of nodes in the ZX-diagramagainst each action taken for the RL agent (orange), a greedy strategy (blue), and simulated annealing (green).For the RL agent and simulated annealing, multiple trajectories are plotted (transparent).The RL agent and simulated annealing significantly outperform the greedy strategy in terms of cumulative reward with the RL agent requiring an order of magnitude less steps than simulated annealing (inlay).Actions taken by the RL agent that intermittently increase the node number (i.e.non-greedy actions) are indicated by arrows.(c) Average number of nodes after optimization of 1000 ZX-diagrams with 10-15 initial spiders (left), which is the size the agent was trained on, and (d) Two examples of non-greedy actions learned by the agent (orange lines), that lead to a positive cumulative reward by consecutive Fuse actions (blue lines).(e) Example ZX-diagram sampled from the agent's training set.The greedy strategy can reduce the node number by applying 3 Fuse actions (blue lines) while the agent further optimizes the diagram beginning with a non-greedy Pi action (orange line).
(n,m) f to the final features of each edge e (n,m) f

Figure 4 :
Figure 4: Analysis of learned policy.(a) Action dependence on the local environment.1000 actions of each type are sampled by the agent.Then, for each action and the diagram in which it was chosen, sub-diagrams are built up in layers around the node/edge identified with the action (see inlay).For each sub-diagram spanning only the nodes in a specific layer, we compute the agent's unnormalized probability of sampling the chosen action P layer and compute the difference ϵ to its probability P complete in the full diagram, where we define ϵ in Equation (6).We plot the average of this difference against the number of layers for 5 action types.(b) Probability of sampling the Copy action on the blue edge in the diagram depicted in the inlay for multiple outputs of the diagram n out and multiple additionally inserted spiders on the outputs n extra .The ideal strategy is to select the Copy action for n out − n extra ≤ 2. The agent approximately learns the ideal policy.

Figure 5 :
Figure 5: (a) Translation of common quantum gates, states, and (post-selected) measurements into corresponding ZX-diagrams.The translations are true only up to a scalar factor.Square boxes are Z-/X-rotation gates with an angle α.(b) By inserting Hadamards (black boxes) on all edges connected to a spider, its color can be changed.

Figure 8 :
Figure8: Schematic of the GNN steps.(a) One-hot encoding.Each node is represented by a one-hot encoded five-dimensional color feature, a one-hot encoded six-dimensional angle feature, and an additional number indicating whether the node was previously selected by the start Unfuse action.The three features are concatenated to form a single vector.Edges have a one-dimensional feature vector indicating whether the edge was previously selected by the Mark edge action.(b) Update of the node feature vector x n i at message-passing layer i.First, messages for each connected edge are computed using the ψ i dense neural network layer.Second, the messages are combined by an element-wise sum operation.Finally, the new node feature vector x n i+1 is computed using the ϕ i dense neural network layer which takes the combined messages and the previous node feature vector as an input.(c) Update of the edge feature vector e

Figure 9 :Figure 10 :
Figure 9: Simulated annealing.(a) Average number of nodes left after optimization through simulated annealing with start temperature T start = 0.5 evaluated over 1000 ZX-diagrams with 10-15 starting spiders (left) and 100-150 starting spiders (right).The temperature decay factor c ann is chosen as 0.01/0.001/0.0001for 200/2000/20000 total steps taken respectively, which results in an acceptance probability of non-greedy actions as shown in (b) for different values of the instantaneous reward of the action.
[27]ding of the local transformation rules of ZX-diagrams as actions of a reinforcement learning agent.Blue colors indicate the encoding as an action of the agent acting on either an edge or a node.Some transformations are implemented in both directions as separate actions of the reinforcement learning agent (equal signs), while some are only implemented in one direction (arrows).Three dots stand for zero or more edges.Each rule also holds with the spiders' colors inverted and in both directions.Black squares represent a Hadamard gate as defined by the Hadamard fuse transformation.During the Unfuse transformation, a spider is split into two by arbitrarily splitting up its angle between the two resulting spiders, connecting them with a new edge, and transferring a subset of the originally connected edges (orange) to the new spider.In the Copy transformation, a ∈ 0, 1.In the Euler transformation, α 1 /β 1 /γ 1 are related to α 2 /β 2 /γ 2 by trigonometric functions as defined in[27].
The ideal strategy in this diagram is to apply the Copy action if n out −n extra ≤ 2 as then multiple Fuse actions are enabled, leading to a cumulative positive reward.

Table 1 :
[30]meter values used in the PPO algorithm and GNN.For the definition of γ and λ see[30].

Table 2 :
Total number of nodes (left) or number of α spiders (middle) left after optimization with different strategies.The right column is the average run-time to optimize a ZX-diagram with 100-150 start spiders.