Quantum Computation with Machine-Learning-Controlled Quantum Stuff

We describe how one may go about performing quantum computation with arbitrary"quantum stuff", as long as it has some basic physical properties. Imagine a long strip of stuff, equipped with regularly spaced wires to provide input settings and to read off outcomes. After showing how the corresponding map from settings to outcomes can be construed as a quantum circuit, we provide a machine learning algorithm to tomographically"learn"which settings implement the members of a universal gate set. At optimum, arbitrary quantum gates, and thus arbitrary quantum programs, can be implemented using the stuff.


I. INTRODUCTION
Imagine we have some "stuff" with quantum properties. Can we use it as a quantum computer? Join us in picturing a long strip of Plasticine-like stuff, whose unknown properties are accessible only via regularly spaced setting and outcome wires. By appropriate choices of settings, followed by correct interpretations of outcomes, is it possible to implement quantum computation? If the stuff secretly contains matter known to be capable of quantum computation -say, a row of trapped ions -it is obviously possible. Suppose, for example, that the setting wires were to manipulate the local magnetic field and the timing of measurements, whose results were transmitted by the outcome wires. Any given quantum program could then be implemented, given knowledge of the details. Normally, one would first decide upon a correspondence between physical and logical operations, and then engineer a computer to respect it. Here, we discuss the reverse task, that of mapping arbitrary quantum computation onto the fixed physics of initially uncharacterized matter: stuff. Our iterative approach to this problem begins with a hypothesized map between physical and logical operations. For example, we might guess that feeding first 2 and then 3 to the first input wire implements the quantum CNOT gate. This guess is likely to be wrong, but we will demonstrate that given a few assumptions about the internal dynamics of the stuff, it is possible to determine just how wrong. We can then converge towards a correct guess by gradient descent, applied to a specially constructed set of neural networks. The essential problem we are concerned with, the determination of the unknown quantum dynamics of a black box system (in our case, the arbitrary length strip of stuff), might be viewed as a slight generalization of socalled "quantum process tomography" [1][2][3][4][5][6]. We are * alewis@perimeterinstitute.ca aware of two major differences. First, we assume no ability to interact with the quantum stuff except via the classical settings and outcomes on the wires. In particular, we do not presume advance knowledge of how to prepare or to measure states. This echoes previous work on randomized gate benchmarking [7][8][9][10], self-consistent quantum process tomography [11,12], and especially operational process tomography [13]. Second, we will never explicitly obtain the process matrix of the stuff, but only the ability to map chosen quantum circuits onto it. Presumably, the process matrix would need to be somehow encoded in the weights of any successfully optimized neural networks, but we make no attempt to characterize such an encoding. This situates our work within the rapidly developing field of "quantum machine learning" [14,15], and, more specifically, within the subfield concerned with applying machine learning approaches to quantum tomography [16][17][18][19][20][21]. In Section I to follow, we introduce the essential ideas required to construe operations upon quantum stuff as a user-defined circuit mapped onto spacetime. These ideas are then refined and formulated precisely in Section IV. The iterative process of determining which operations correspond to which circuit, or "bootstrap tomography", is formulated in Section V. Finally, in Section VII, we propose a machine learning algorithm to realize this tomographic process.

II. QUANTUM STUFF AND COMPUTATION
Picture a strip length-L strip of stuff. Fix two wires at each site x l = l∆X, l = (0, 1, 2, . . .), one to send in classical settings, and another to read off classical outcomes. Binning the duration of these interactions into discrete time intervals centred at times t n = n∆T , n = (0, 1, 2, . . .) defines a discrete set of spacetime points, x = (x, t) = (l∆x, n∆t), each labelling a setting-outcome pair. We refer this structure of setting-outcome pairs at discrete events as the computational lattice. On the left we see a section of a length of stuff with input and output wires placed at regular intervals. A time-gated sequence of inputs is fed into the input wires and a similar time-gated sequence of outputs is read off the wires. We represent the input/output at position x and time t by a dot as shown. On the right we see the same figure with the dots divided up into octagons and squares. Points lying on the boundary between octagons are assigned to the upper octagon. map from sequences of settings to probability distributions over sequences of outcomes. To use the stuff as a quantum computer, we need to know the pertinent aspects of that mapping, construed as one between known mathematical objects. This knowledge can be usefully decomposed into that of two functions: an encoder to translate a given program -the "logical input" -into physical settings, and a decoder to translate probability distributions over physical outcomes into mathematical objects, the "logical output". The encoder/decoder pair has functioned correctly if and only if the emitted logical output is, accounting for the probabilistic nature of the physical outcomes, indeed that dictated by the logical input. Suppose, for example, we wish to perform the computation q out = 2q in . Then the encoder must map from the logical input "2 × q in " to some sequence of physical settings. The decoder must then map the corresponding sequence of physical outcomes into the logical output q out . The encoder/decoder pair is correct, in this case, when the emitted q out approaches 2q in , having smoothed (e.g. averaged over) probabilistic fluctuations. One might reasonably expect the discovery of a correct encoder/decoder pair to be a fairly daunting task. However, in this case and, as we will see, in general, checking whether a given encoder/decoder pair is correct is quite simple. Furthermore, one can easily construct a smooth measure, or loss function, of how far from correct a given encoder/decoder pair is: the RMS error between q out and 2q in , for example, or some monotonic function of it [22]. A correct encoder/decoder pair will minimize this loss function. Later, we will detail a means to accomplish this functional minimization automatically on a classical computer, via a bespoke neural network parameterization of the encoder and decoder.

III. GATES, CIRCUITS, AND TESSELATIONS
We would like to implement arbitrary quantum computations, not just multiplication by 2. It is well known that arbitrary quantum computations can be decomposed into elemental quantum gates forming a "universal gate set" (UGS). It would thus be sufficient for our encoder/decoder pair to correctly implement all members of some UGS.  Quantum gates, however, map between quantum states, not classical information. By assumption, we do not have direct access to the internal quantum dynamics of the stuff. We will thus instead concern ourselves with quantum circuits, wirings together of quantum gates that are entirely characterized by classical settings and outcomes. Any given quantum computation may be expressed as a quantum circuit. Doing so further refines the formulation of our task. The logical input thus becomes the classical settings and gate labels defining a particular quantum circuit. The logical output is that circuit's theoretical outcome distribution. The encoder/decoder pair has functioned correctly if the emitted outcome distribution matches the theoretical one. The loss function can be any of various distance metrics between those distributions. Later we will demonstrate that, given a few physical assumptions about the stuff, there exist "tomographically complete" sets of quantum circuits. If an encoder/decoder pair, acting upon stuff with the assumed properties, correctly implements all the circuits in a tomographically complete circuit set, it correctly implements all the gates in a UGS. The combination of encoder, stuff, and decoder in that case forms a universal computer. The degree by which it fails to do so can be taken, most primitively, as the summed losses of all the circuits in the set. We want the stuff not only to implement an arbitrary computation, but to restrict the required number of setting/outcome retrievals to do so. To achieve this, we will require the encoder's implementation of each individual gate to occupy only a finite volume of spacetime. Specifically, we group together events in the computational lattice x = (l∆x, x∆t), depicted in the left panel of Figure 1, into a tesselation of octagons and squares, depicted in the right panel of Figure 1. Each octagon, or tessel, will be used to implement one of a universal set of two qubit gates. The qubits for these gates will, under this implementation, be input at the lower slanted edges, and output at the upper slanted edges. Octagons are convenient because they adjoin to form causal-diamond-like structures. The squares appear by geometric necessity and will implement a fixed "do nothing" gate, different from the identity. We will eventually view the encoder and decoder as neural networks to be trained by a machine learning algorithm. One failure mode of this algorithm will occur if the tesselation encloses too few events per gate to implement the UGS properly. In that case, we can uniformly scale up the tesselation and try again. We label each octagon by its midpoint, x. Let X octagons be the set of octagon positions we consider in some given tessellation over the length L and some time duration, T .
The two qubit gates situated on the octagons will form a regular lattice which we will call the gate lattice. A universal gate set (UGS) must include enough gates to do universal quantum computation with respect to the gate lattice. An example of a universal set of gates, complete with respect to this lattice is given in Table I. Each of the gates in the table is a two-qubit gate. The first subset of gates are the usual gates included in a UGS adapted for the lattice. Because we assume control over the stuff only through the classical settings, we need to include two more subsets of gates not usually mentioned as part of a UGS. The second subset are the identity and swap gates which allow us to transport qubits around the circuit. The third subset of gates are the preparation and measurement gates. These gates are non-unitary. The first gate in subset 3 performs the identity measurement on the incoming qubits and then prepares two qubits, each in the 0 state. The second gate in subset 3 projects the left qubit onto the 0 basis while leaving the right qubit unchanged. We have similar notation for the other gates. Unlike the other gates in the table, these measurement gates have outcomes associated with them. It is necessary to include the second and third subsets of gates in our UGS because we need to train our stuff to implement them the same as any other gate. They do not come for free. We can connect gates together to form what we will call a fragment (denoted by F). Whenever we build a fragment, there must be some gates having one or two open inputs. We will call such gates initial gates. A circuit, C, On the left we see an example of a circuit built with gates from our universal gate set. Note that the circuit is closed off from external influences because, at the bottom, input signals are absorbed by the identity measurement and, at the sides, quantum information coming into the circuit is shunted back out. On the right we see how gates can be assigned to octagons.
is a special case of a fragment for which all initial gates are of the type that ignore any open inputs. The gates in the table provide two ways to do this. First, the preparation gate in subset 3 simply absorbs incoming quantum systems so they do not affect probabilities for the circuit. Second, any gates that include an identity map can be used to shunt quantum information coming through an open input back out through an open output so probabilities for the circuit are not affected. The circuit in Figure  2 contains examples of both types. We can calculate a probability for a circuit using the rules of Quantum Theory. A fragment that is not a circuit is subject to outside influences and so will not necessarily have a probability associated with it. We will choose some particular UGS, call it G, to proceed. It does not have to be the one described here, and there might UGS's that are more suited to this project. However, the gate set must have elements that enable us to close off in the manner just described. There must also be some gates with outcomes, so we can read off the results of the computation.

IV. BUILDING CIRCUITS
In this section, we describe how sequences of operations upon the stuff may be arranged in such a way to permit comparison with a given quantum circuit. Thus, within each octagon of the tesselation, we will encode (by inputting an appropriate sequence of signals into the setting wires attached to the stuff) and decode (by selecting on an appropriate sequence of output signals from the outcome wires attached to the stuff) in an attempt to implement a putative element of the UGS, G.
Initially we do not know what the appropriate encoding and decoding are. Thus we start with some initial choice and then, through the machine learning algorithm, train until we settle on encodings and decodings that minimize the loss function. Let the putative encoding (decoding) for gate We are, then, admitting the possibility that the same gate might require different encodings and decodings in different regions of spacetime. One could imagine eventually adjusting the machine learning algorithm to be introduced to exploit any assumed homogeneity, but we will not explore this issue further. At the nth step we will have a particular encoding and decoding scheme This specifies an encoding for every element of G for every octagon. We will attempt to implement various quantum circuits using this encoding. Then, using the machine learning algorithm trained upon the resulting empirical information, we will iterate the encoding and decoding scheme, obtaining a new one to be used during the (n + 1)th step. We consider a fragment, F, made from the gates, g, in our UGS with locations assigned to the positions of some of the octagons in some octagon square tessellation. The fragment F is specified by where X is the set of positions of octagons positions in the tessellation at which a gate is placed. A circuit, C, is a special case of a fragment in which every initial gate is of the type that ignores any open inputs into it.
The attempted implementation of circuit C during the nth step is now given by in the octagons. Actually, this is not entirely sufficient, since we must also specify what happens operationally in regions of spacetime outside the given octogons (tessels).
We will return to this point at the end of this section. Each gate g in C maps quantum information to classical outcomes with various probabilities. The full circuit, however, fixes this internal quantum information (by incuding state preparation and measurement "gates", for example), and is thus characterized by a single probability p C for a specific set of classical outcomes to be observed. Quantum mechanics can be formulated [23] as an assignment of a p C to every possible circuit C.
In the next section, we will show that under certain assumptions, the converse is also true. That is, if each circuit within a tomographically complete circuit set indeed occurs with its predicted p C , the underlying operational map is indeed that specified by the relevant sequence of gates.
In order to certify that the computation within the stuff is indeed quantum mechanical, we thus seek encoder/decoder pairs that perform this same assignment via the stuff. That is, the theoretical map, from circuit settings and gate labels C to outcome probabilities p C , is also the operational one, from the tesselation of C and a given Y n [C] to the observed outcome probability p n C . We can then define one of several loss functions measuring how closely Y n [C] indeed implements some C, where err(x, y) is some convenient positive function with a global minimum at x = y, for example err = |x 2 − y 2 |. Now suppose we have fixed a set S of circuits that we wish to simultaneously implement, for example because they form a tomographically complete circuit set. A loss function over the full set can then be written as Since the internal quantum information of the stuff might depend upon occurrences outside the spacetime region we have identified with our circuit, for this to work in practice we must also specify what happens at the spacetime points,X, that are not in the octagons labelled by x ∈ X. This includes those in the squares and those that occur elsewhere.
One strategy out of potentially many is to choose a "null" encoding E 0 at each "external" spacetime point, and to simply ignore signals on the outcome wires there, so that we need not specify any decoding. The null encoding might also be iterated ("trained") in order to, for example, appropriately "zero out" quantum information in the external regions; in that case it would be denoted E n 0 . Similar considerations apply to the squares. We do not need to concern ourselves with the encoding for spacetime points in the future of the circuit, C, via the causal assumption that influences cannot travel backwards in time.

V. BOOTSTRAP TOMOGRAPHY
Now let us discuss the construction of so-called "tomographically complete circuit sets". If the loss function, L[Y n , S], is minimized when summed over a big enough set of test circuits, we would like it to be the case that any circuit gives the correct probabilities (to within some small error). In this section we state a theorem (proven in Appendix 1) that if the empirical probabilities, p n C , are exactly equal to the ideal probabilities, p C (so the loss function (4) is minimized), for a certain set of circuits, S tom , then this is true for all circuits. The circuits in S tom have the property that they are bounded in size. We conjecture that a robust version of this theorem also holds -namely that the loss function (4) over S tom bounds the loss function for all circuits. We will need the important notion of a bounded fragment (or circuit). This is one that fits inside a box of some constant size, ∆L and ∆T , where this box size does not increase in size as we increase L and T . It is shown in Appendix 1 that we can associate a vector, r n X [F], with any fragment in region X. This vector linearly relates the given fragment to a tomographically complete set of fragments for the given region. In the case of Quantum Theory, the vector r X [F] is linearly related to the superoperator associated with the fragment. The vectors, r n X [F], are used to calculate the probabilities for circuits. We can determine the vectors, r X [F], by doing tomography on a set of circuits F ∪F for differentF. We say we have fragment tomography boundedness if we can do tomography on a bounded set of fragments (pertaining to X) by means of a bounded set of circuits. In Appendix 1 we define a composition tomograph, Λ Λ Λ X1,X2,... X which tells us how to combine r vectors pertaining to non-overlapping (though possibly adjacent) regions X 1 , X 2 , . . . to obtain the tomographic information pertaining to the region X = X 1 ∪ X 2 ∪ . . . . We say we have composition tomography boundedness if the composition tomograph for a composite region formed from any number of regions can be determined from the composition tomographs for composites that fit inside bounded boxes. In other words, we can do the calculation for any circuit from calculations pertaining to smaller bounded parts of that circuit. In Appendix 1 the following theorem is proven.
If we have fragment to-mography boundedness and composition tomography boundedness, then there exists a bounded set of circuits S tom such that if, at iterationñ, we have pñ C = p C ∀C ∈ S tom then pñ C = p C for any C ∈ S circuits . S tom is then called a "tomographically complete circuit set". We conjecture (but do not prove) that a robust version of this theorem holds.
Conjecture 1. If we have fragment tomography boundedness and composition tomography boundedness, then there exists a bounded set of circuits S tom such that if, at iterationñ, we have where B is a constant and N C the number of gates in C. The motivation for this conjecture is that the tomography process will fix the parameters in the gates to some error. If we enlarge the set, S tom , we can expect to get a better bound on the error since we then collect more information. The causaloid framework [24][25][26][27] is used to prove Theorem 1. This framework was originally developed as a framework for modelling indefinite causal structure in the context of Quantum Gravity. These theorems mean that we can rely on the quantum stuff to implement an arbitrary circuit as long as the measured probabilities for circuits in S tom are close enough to the ideal probabilities calculated from Quantum Theory. This is good because it would not be practical to measure the probabilities for all C ∈ S circuits since the rank of this set grows very rapidly with L and T . For the two tomographic boundedness properties to hold requires in each case that (i) that a mathematical prerequisite holds and then (ii) that the physics of the stuff accords. The mathematical prerequisite is that the properties hold for ideal circuits constructed on the given lattice from the given universal gate set. To check this requires a mathematical calculation. Since we have a universal gate set, it is immediately clear that this mathematical prerequisite holds for fragment tomography boundedness (as we can use the UGS to construct a tomographically complete set of fragments that are bounded). We conjecture in Appendix 1 that the mathematical prerequisite holds for composition tomography boundedness for any UGS. If the mathematical prerequisites hold, then we can consider whether the physical properties accord. This will, most likely, be settled through working with the stuff. However, we can always cook up situations in which the stuff fails to have the properties. For example, the properties will fail if the stuff has hidden signalling between far separated locations. For example, the bit of the stuff at x might send a radio frequency signal to the bit of stuff at some x in the deep future outside any bounding box where this radio signal cannot tomographically probed by circuits that live in a bounding box. While it would be satisfying to prove Conjecture 1 mathematically, it is possibly more useful to test it empirically. The conjecture does, in any case, necessarily involve assuming the boundedness properties which are, themselves, in need of empirical investigation. We test the conjecture empirically by determining the extent to which minimizing the loss function on various bounded sets of circuits allows us to reproduce the probabilities for sets of larger circuits.

VI. RANDOM CIRCUIT SAMPLING
In the preceding sections we have illustrated that minimization of the loss function (4) by an implementation Y n [C], acting upon the stuff with respect to a tomographically complete set of quantum circuits S tom , certifies that each individual gate in a UGS has also been correctly implemented by Y n [C]. The loss function (4) is inconvenient to use directly, however, because a) computing a probability for every circuit in S tom for every training iteration is likely to be expensive and b) the differing contributions to the overall gradients by the (possibly many) terms in the sum over S tom are expected to confound gradient descent optimization. We will instead operate upon randomly constructed circuits, with each training iteration acting upon either a single example or perhaps a small minibatch of them. Random circuits can be constructed in various ways. For example, we could start by randomly assigning gates from the UGS to a few positions, resulting in a fragment. Next, we can identify the locations of open wires, and close the circuit by assigning random preparation measurement gates to each. Alternatively, one could begin with the preparation gates, randomly add gates from the UGS capable of receiving inputs from those already present, and at some also-random point close the construction with a measurement gate, repeating the process from scratch if it has failed to deliver a circuit. Any such random constructions will be controlled by some parameters which dictate the average size of the circuits. Consider a set (minibatch), S rand , of circuits generated randomly by some such technique. We make the following random sampling assumption Assump: Random circuit sampling.
There exists a bounded set of circuits S rand such that if, at iterationñ, we have where D is a constant and N C the number of gates in C. For this assumption to hold good, we must have a machine learning algorithm that does not know what random set of circuits is going to be chosen from one iteration to the next. In other words, our procedure must indeed be "random" in the sense that the algorithm never learns to predict its output in advance. The practical loss function is thus where S rand is a randomly constructed subset of S tom . Since S rand would canonically contain only one or a few elements, the interior sum is much simpler than that of (4). In addition, by virtue of the random circuit sampling assumption, the exterior sum can be treated by individually optimizing each of its summands. In other words, repeatedly optimizing L[Y n [C], C] with respect to randomly constructed C from S tom also optimizes (5) and, by assumption, (4).

A. Neural network implementation of encoder/decoder
We have repeatedly alluded to our intention to view the encoder/decoder pair as neural networks to be trained by a machine learning algorithm. In this section we will elaborate upon this process. Recall that in the preceding sections we have formulated the problem of translating arbitrary quantum programs into operations upon the stuff as bootstrap tomography: the functional minimization of the loss function (4) (in practice (5)) obtained by comparing the observed and predicted outcome distributions with respect to the encoder/decoder pair at training iteration n, E n [x, g] and D n [x, g]. Here g = g(t, x) is the gate label assigned to the point x = (t, x) by the tesselation of the given circuit. The problem of automatically varying a function to optimize a loss function is the central concern of machine learning. The subfield of deep learning [28,29] has become explosively popular in the past decade or so, as sufficient computational resources have become available to produce nearly magical -and highly profitable -results in the most impressive cases. Deep learning becomes especially useful relative to other approaches when the function to be optimized represents, or depends upon, a complicated probability distribution, as one might expect the encoder/decoder pair to. The user decides on a program (circuit), which via the tesselation assigns a gate label to each spacetime event.
Each spatial point is assigned an encoder, mapping these gate labels to inputs to the stuff. Output from the stuff is then processed by the decoder into simulated logical output from the program. The decoder receives the gate label g and the central spatial point of the tessel x in addition to its depicted input. Training: by choosing circuits from a tomographically complete circuit set, the comparison between predicted and actual output from each can be used as a loss function for the encoder and decoder, such that all circuits are correctly implemented at optimum. This is achieved by representing the encoder and decoder as neural networks, and descending their weights towards this optimum, using randomly constructed circuits as input.
A deep learning optimization, or "training", begins by representing the target function as a linear composition of smooth activation functions called a neural network. Different network structures are defined by different arrangements of activation functions, with "deep" learning being somewhat vaguely defined by its focus upon network structures formed of "many" successive layers. The activation functions themselves are parameterized by their weights θ so that different functions decompose into a given network structure by different choices of weights. Every function theoretically has some neural network representation [29]. While this in itself is not especially impressive, the linearity and smoothness of a neural network's activation functions permit functional derivatives with respect to the full network to be expressed as sums of partial derivatives of the weights, essentially via the chain rule. The gradient of the full function with respect to some loss can then by efficiently descended by descending those of the individual activation functions, a process known as backpropagation [29][30][31]. Precisely how backpropagation is best applied in a given situation is somewhat problem-dependent. We will save detailed consideration of this matter in future studies, when we implement bootstrap tomography on a small, classically simulated spin chain. For now we will instead depict the problem at the Algorithm 1 Schematized optimization of the encoders and decoder.
for number of training iterations do 3: Circuit ← a randomly generated circuit.

7:
Descend the gradients of θ E x and θ D to minimize Loss, as dictated by the chosen optimization strategy.

8:
end for 9: end procedure more abstract level depicted in Figure 3. We will denote the neural network representation of the decoder as D[θ D ; x, g], so that The network representation D itself is fixed for all n.
Training updates are instead implemented by varying the weights θ D between training iterations -they are otherwise fixed.
The encoder E n [x, g] controls the internal quantum information of the stuff, and thus unlike the decoder must interact with it in real time. It nevertheless needs to share information within tessels in order to track that quantum information's flow. This point will be elaborated upon in the upcoming Subsection B. For now we briefly note that we handle this problem by representing E n [x, g] with a "fleet" "recurrent" neural networks, one at each spatial point x. We denote the encoder weights at some x as θ E x , and the full set of weights over all spatial points as θ E x . Thus The network representation E is again fixed for all training iterations n, with variation between iterations instead implemented by manipulations of the weights θ E x . The output of the network additionally depends upon a memory vector M. This vector is passed between different θ E x within a tessel, allowing encoders following different constant-x "worldlines" to communicate. We will elaborate upon this point in Subsection B. Let us now follow the logic of Figure 3 in words. The user selects a program, which is mapped by the tesselation into a field of gate labels g(t, x). Each gate label at x is passed along with the appropriate memory vector M to the encoder at that same x, E[θ E x , M; x, g]. This yields a raw input signal, to be sent to the setting wire at x. Once an entire tessel has been implemented, the corresponding "raw" outcome settings from the stuff are passed into the decoder, which maps them into "logical" output. If the weights θ D and θ E x are optimal, as indicated by the loss function (5) reaching its global minimum, the logical output may be interpreted as the correct result of the program. Otherwise, the gradients of the weights in the direction of decreasing (5) can be calculated, as by the "optimizer" in Figure 3, and descended along to obtain new, better optimized weights. In machine learning parlance, this process is called "training" the networks. The next iteration operates upon a new randomly constructed circuit, and the process is repeated until a desired convergence threshold is reached. Illustrative pseudocode is provided as Algorithm 1.

B. RNN fleet implementation of encoder
As discussed previously, the encoder E n [x, g] and decoder D n [x, g] have differing relationships with the real time behaviour of the stuff, which suggests a particular network implementation for the encoder that we call an "RNN fleet". We will elaborate upon this point here. Figure 4 depicts several aspects of the behaviour of the encoder/decoder algorithm in spacetime. In the top left panel, we see a single tessel, assigned to a gate labelled g. The top middle panel depicts this same tessel, implemented as operations upon the encoders and thus upon the stuff. As the stuff proceeds through time, the gate label g is fed to the encoder at each (t, x) in the tessel. As we see in the top right panel, the output from the tessel is collected into a vector. Once all of it has been collected, it is sent to the decoder, which produces the logical output of the gate. Since nothing in this procedure depends upon the logical output directly, the precise time at which the decoder D n [x, g] operates is not especially important, within reason. The encoder E n [x, g], on the other hand, controls the internal quantum information of the stuff. The order in which its inputs are processed therefore critical. In addition, it must be synchronized within a tessel. The problem of processing "time series" with a specific temporal ordering occurs repeatedly in machine learning. The prototypical example is machine translation, since correct translations depend upon prior context. A genre of network structures known as "recurrent neural networks", or RNNs [29,30,32,33], are adapted to remember such context. In addition to their weights, which are fixed except during training, such networks maintain a "memory" vector M. The output of the RNN depends on the memory as well as upon the weights and the input. x) in spacetime is assigned a gate label g by the tesselation. g is constant within each tessel, and takes a uniform "null" value outside of a tessel. Bottom Left: each spatial point x is assigned a recurrent neural network (RNN) "Encoder", mapping g and a "memory" vector M to the input to the stuff, along with a new Mn+1 (note this n paramaterizes subsequent RNN calls, not training iterations). The map is governed by "weights" θ E x , local to x and held fixed except during training. Top Middle, Top Right: the stuff advances through time, receiving encoded input dictated by the gate labels. Its raw outputs O(t, x) at each point in each tessel are collected into a vector, and then fed along with the gate label g to the decoder. The decoder, another neural network with weights θ D , emits the simulated "logical output" of the gate. Bottom Middle, Bottom Right: two strategies to allow the encoder RNNs to collaborate over a region of spacetime. Rasterized memory (Bottom Middle) involves passing Mn in left-right order throughout a tessel. Causal memory (Bottom Right) involves passing it forward within each encoder's future light cone, achieving, per the locality assumption, the same end in a shorter timescale.
But unlike the weights, the memory is modified between successive function calls, allowing it to represent contextual information: which contextual information to record is in turn determined by the weights. We thus implement the encoder E n [x, g] at each separate spatial point x as an RNN = E[θ E x , M; x, g]. as depicted in the bottom left panel of Figure 4. Each E[θ E x , M; x, g] follows a particular constant-x worldline through the tesselation, partly motivating the term "RNN fleet". Additional motivation for the term comes from the need of the various RNNs to be synchronized within a tessel. Thus, the memory vector M n is not simply passed forward along a worldline, but is instead shared within each tessel between networks of different weights. Consequently, we need to synchronize the processing of several time series at different spatial points. The bottom middle and bottom right panels depict two strategies for synchronizing within tessels. We call the first and simplest strategy, in the bottom middle panel of 4, rasterized memory. In this paradigm, encoders E[θ E x , M; x, g] are called sequentially at each timestep from left to right. Starting from a fixed null value, memory is thus passed within the zig-zagging "spacelike" lines depicted in the bottom middle panel of Figure 4, which resemble the "rasterized" beam path of a cathode ray tube television. Circuit simulation using the rasterized strategy is illustrated by the pseudocode of Algorithm 2. The illusion of motion created by such televisions relies upon the time required for the beam to traverse the screen being much shorter than the processing time of the eye. The ability of the rasterized memory strategy to effectively synchronize encoders within a tessel correspondingly depends upon the processing time of the encoders being much shorter than that between simulation timesteps. The causal memory strategy depicted in the bottom right of Figure 4 relaxes this assumption. It is based on the assumption that the error incurred by sending encoder input out of sequence falls off with the spacetime distance between the disordered events. Instead of passing memory sideways between every event in the tessel, we thus pass it forward in spacetime within a fixed "light cone" of predetermined width s, unless doing so would cross a tessel. Given locality, this should synchronize just as well as the rasterized strategy, without putting impositions upon the relative timescales of the encoder and the simulations. Circuit simulation using the causal strategy is illustrated by the pseudocode of Algorithm 3.

VIII. ACKNOWLEDGEMENTS
A. G. M. Lewis is supported by the Tensor Network Initiative at Perimeter Institute. Research at Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Research, Innovation and Science. APPENDIX 1 In this appendix we provide definitions of fragment tomography boundedness and composition tomography boundedness and we prove Theorem 1. The idea is to consider tomography on fragments. We consider two types of tomography. First, we have fragment tomography, whereby we obtain a mathematical object (r X [F] below) associated with the fragment. Second, we consider composition tomography whereby we obtain the rule for composing these mathematical objects for composite regions such as X 1 ∪ X 2 ∪ X 3 . If the tomography boundedness properties hold then we only need to consider fragment tomography for fragments up to a certain size and composition tomography for composites up to a certain size. To do tomography on these fragments and composite fragments are completed into circuits. Hence it follows that we only need to consider circuits need be no bigger than a certain size. This provides our set, S tom ⊂ S circuits of circuits. If we obtain probabilities p n C = p C for C ∈ S tom (to within some bounded error) then it follows that p n C = p C (to within some bounded error) for all circuits. For a circuit, C = F ∪F we can always write the proba-bility as p n F ∪F = r n X [F] · p n X [F] (8) where we define the ordered set p n [F] = (p n F k ∪F : for k ∈ Ω Y ) (9) where k ∈ Ω X labels the elements, F k , of some tomographic set, T tom X ⊆ T X of minimal possible rank. Here T X is the set of all possible fragments at X. In general, the choice of tomographic set, T tom X , is not unique. We can always write (8) because, in the worst case, we can choose T tom X = T X and then the vector r n Y [F] is just a list of 0's except with a 1 at position k. In general, however, there will be some linear relationships between these probabilities so that we can use a proper subset, T tom X ⊂ T X . We can think of p n X [F] as the generalized state prepared byF for region X. And we can think of r n X [F] as the generalized effect associated with fragment F performed in region X. The choice of tomographic set, T tom X , must be good for calculating the probability for any circuit in any region X ∪X (that is for anyX associated with a fragmentF). Using simple linear algebra [34] we can obtain the set {r n X [F] : ∀ F ∈ T X } if we are given enough empirical information in the form of p n C for C ∈ S tom X . The set, S tom X , has to be big enough to make this possible -namely it has to generate Ω X linearly independent p n X [F] vectors. In this case we will say S tom X is tomographically complete for X. We can be sure this is true simply by choosing S tom X to be the set of all circuits in S all circuits having elements at positions in X. This is not very useful however as this set grows very rapidly with L and T . To obtain a more useful notion we define a bounded set of circuits or fragments as one for which each element fits inside a box with bounded spatial and temporal dimensions which do not scale with L and T . Consider the following property.
Fragment tomography boundedness. We say we have fragment tomography boundedness if, for any bounded set of fragments, there exists a bounded and tomographically complete set of circuits. We can motivate the assumption that this property holds by finiteness and locality. First, note that the operationally accessed part of the Hilbert space associated with the inputs and outputs for any F ∈ T should be finite so only require a finite number of circuits for tomography. Furthermore, by locality we should be able to do tomography on this Hilbert space by means of circuits that are not too much bigger than the fragments. Consider a composite region, X 1 ∪ X 2 (where X 1 and X 2 are disjoint). Then, for the circuit C = F 1 ∪ F 2 ∪F, we can write the probability as where p n X1∪X2 [F] = (p n F k 1 ∪F k 2 ∪F : for k ∈ Ω X1∪X2 )