Watch and learn -- a generalized approach for transferrable learning in deep neural networks via physical principles

Transfer learning refers to the use of knowledge gained while solving a machine learning task and applying it to the solution of a closely related problem. Such an approach has enabled scientific breakthroughs in computer vision and natural language processing where the weights learned in state-of-the-art models can be used to initialize models for other tasks which dramatically improve their performance and save computational time. Here we demonstrate an unsupervised learning approach augmented with basic physical principles that achieves fully transferrable learning for problems in statistical physics across different physical regimes. By coupling a sequence model based on a recurrent neural network to an extensive deep neural network, we are able to learn the equilibrium probability distributions and inter-particle interaction models of classical statistical mechanical systems. Our approach, distribution-consistent learning, DCL, is a general strategy that works for a variety of canonical statistical mechanical models (Ising and Potts) as well as disordered (spin-glass) interaction potentials. Using data collected from a single set of observation conditions, DCL successfully extrapolates across all temperatures, thermodynamic phases, and can be applied to different length-scales. This constitutes a fully transferrable physics-based learning in a generalizable approach.


INTRODUCTION
Machine learning has emerged as a powerful tool in the physical sciences, seeing both rapid adoption and experimentation in recent years. Already, there have been demonstrations of learning operators [1], detecting unseen patterns within data [2], "discovering" physical equations [3], and predicting trends within the scientific literature [4]. Within the field of statistical mechanics, machine learning was recently used to estimate the value of the partition function [5], solve canonical models [6], and generative models conditioned on the Boltzmann distribution have been shown to be efficient at sampling statistical mechanical ensembles [7]. The connection between physics and machine learning continues to strengthen, with new results appearing daily.
Despite these and other successes, machine learning has severe limitations. A major obstacle to the more widespread use of machine learning in the physical sciences is that typically, learned models tend to exhibit poor transferability. Using a model outside of the training or parameterization set can be unreliable [8]. This, along with the lack of well defined and well behaved error estimates can result in erroneous results, the magnitude of which are often uncontrolled. While transferability can be somewhat improved through techniques such as regularization [9], there is currently no general approach which can guarantee transferability to new conditions or ensure reliability of a model when it is presented with previously unseen data.
Here we demonstrate a new approach which overcomes the transferability problem by coupling machine learning concepts with physical principles in a new way. The approach, which we call distribution-consistent learning (DCL) enables fully transferrable learning with a minimal number of observations: using it, we can collect observations at a single set of conditions, yet make accurate predictions for all others (including in different physical phases and across phase boundaries). Using DCL, it is possible to extrapolate observations over a wide range of conditions, including those far from the training set. This is achieved through the straightforward yet rigorous application of the physical concept of equilibrium and the postulate of the uniformity of physical law.
To explain DCL, we will first focus on simple classical spin models (ferromagnetic Ising and Potts models with coupling constant J = 1) for the purposes of illustration and validation. For such simple cases, application of statistical methods to "invert" observations are not new [10,11].
Conceptually, DCL is valid when ensemble probabilities are governed by a known statistics (e.g. Boltzmann or Fermi distributions, which describe equilibrium and some near-equilibrium processes) and the interaction energies do not depend on the control parameters.
Finally, we demonstrate the generalizability of DCL explicitly by applying it, without modification, to an unsolved inversion problem: a semi-local "spin-glass" where all couplings between neighbours are selected randomly from a Gaussian distribution (µ = 0, σ = 1).

PROBABILITIES AT EQUILIBRIUM
When a system is at equilibrium, its macroscopic properties can be interpreted as weighted averages over the character of a large number of micro-states, σ i . For an example such as a polymer, such micro-states may be different molecular conformations, whereas for a spinsystem they are the different combinations of possible up (down) spin arrangements. Such micro-states are visited with a probability given by the Boltzmann distribution. For the case of the canonical ensemble in particular (constant particle number, N , volume, V , and temperature, T ), the ensemble probability of a particular micro-state is given as where H is the classical Hamiltonian (and the energy, E i , of σ i is given by H(σ i )), β is the inverse temperature, 1 k B T , and Z is the partition function.
We note that in principle if one were able to observe an equilibrium system long enough, it would be possible to estimate the ensemble probabilities of each micro-state simply by counting -count how often the system visits a particular micro-state and divide by the number of observed events. This information, combined with the observation temperature (β O ), and Eqn. 1 would be sufficient to determine the energy differences between any two micro-states σ i , σ j Of course, for all but trivial state spaces, such an approach is completely impractical. The probability of visiting micro-states above the ground state at finite T is exponentially small in E, and the probability of visiting them multiple times within an observation period (which would be necessary in order to collect reliable 2 FIG. 1. An input example is decomposed into regularly shaped tiles, with each tile consisting of a focus and context region. a) As a pedagogical example, we expand 4 adjacent tiles comprising a generic binary grid. For this case the focus region, f 2 , is 1 ⇥ 1, and the context is unit width (c = 1). The tiles are simultaneously passed through identical neural networks with shared weights. The individual outputs are summed, producing an estimate of ", an extensive quantity. In b), we show an example tile with a focus of f = 2, and a context of c = 2. The optimal selection of f and c depend on physical contraints determined by the target (learned) function.
invariant models are commonplace. Furthermore, photographs, handwriting, and audio recordings too large to be processed by the deep neural network can be resized, cropped, and segmented without destroying the features necessary to make a prediction [22]; there is no absolute spatial scale upon which the label of interest depends. In a physical measurement or simulation, however, the physical scale of a pixel matters critically. Consider the case of an X-ray di↵raction experiment where the interference pattern recorded at the detector depends strongly on the wavelength of scattered light and the physical lengthscale over which the signal is collected. It is not possible to reconstruct the signal properly unless such parameters are known and are consistent.
In this work, we overcome this limitation, and propose a general method which preserves the extensivity of physical quantities, and also accommodates arbitrary input size. We propose a new organizational structure of a deep neural network which can train and operate on (e↵ectively) arbitrary-sized inputs and length scales while maintaining the physical requirement of extensivity. We call our approach Extensive Deep Neural Networks (EDNN), employing domain decomposition to solve the problem of operator evaluation and learning across length scales. Although domain decomposition techniques have a long history in computer simulation and modelling, here we have taken a new approach, and allow the model itself to identify and self-optimize the overlap of tiles at domain boundaries.
Previous work with deep tensor neural networks has identified the necessity of extensivity [23], but the use of molecule-specific representations limits its uses. Our method is su ciently general to allow for applications to any system that can be defined in a coordinate system in which extensivity holds. Furthermore, our method results in a model which can be evaluated on arbitrarily large system sizes.

Extensive deep neural networks
The overall objective of an EDNN is the same as that of our previous work based on convolutional deep neural networks: learn the mapping between an input structure and one or more extensive properties, " (e.g. total energy, entropy, magnetization, particle number, charge, etc). To date, our input structures have consisted of continuous, regular, real-space grids. We note that some operators may be more conveniently decomposed in one space than another, depending on their locality length-scale (see below), and reciprocal-space representations may be beneficial.
EDNN di↵er from our previous work based on standard convolutional deep neural networks as they divide the task into a number of tiles (sub-domains) for training and inference. This subdivision is a user-controlled process based on physical constraints discussed in what follows.
EDNN operate on segments of data which we refer to as tiles. Each tile is itself made up of two regions: the focus and context (Fig. 1b). In our implementation, focus regions are orthorhombic structures of dimension fx, fy, fz. In general, there is no requirement to use tiles h2 hk hn wh wh wh wh ws ws ws ws 1. An input example is decomposed into regularly shaped tiles, with each tile consisting of a focus and context region. a) As a pedagogical example, we expand 4 adjacent tiles comprising a generic binary grid. For this case the focus region, f 2 , is 1 ⇥ 1, and the context is unit width (c = 1). The tiles are simultaneously passed through identical neural networks with shared weights. The individual outputs are summed, producing an estimate of ", an extensive quantity. In b), we show an example tile with a focus of f = 2, and a context of c = 2. The optimal selection of f and c depend on physical contraints determined by the target (learned) function.
invariant models are commonplace. Furthermore, photographs, handwriting, and audio recordings too large to be processed by the deep neural network can be resized, cropped, and segmented without destroying the features necessary to make a prediction [22]; there is no absolute spatial scale upon which the label of interest depends. In a physical measurement or simulation, however, the physical scale of a pixel matters critically. Consider the case of an X-ray di↵raction experiment where the interference pattern recorded at the detector depends strongly on the wavelength of scattered light and the physical lengthscale over which the signal is collected. It is not possible to reconstruct the signal properly unless such parameters are known and are consistent.
In this work, we overcome this limitation, and propose a general method which preserves the extensivity of physical quantities, and also accommodates arbitrary input size. We propose a new organizational structure of a deep neural network which can train and operate on (e↵ectively) arbitrary-sized inputs and length scales while maintaining the physical requirement of extensivity. We call our approach Extensive Deep Neural Networks (EDNN), employing domain decomposition to solve the problem of operator evaluation and learning across length scales. Although domain decomposition techniques have a long history in computer simulation and modelling, here we have taken a new approach, and allow the model itself to identify and self-optimize the overlap of tiles at domain boundaries.
Previous work with deep tensor neural networks has identified the necessity of extensivity [23], but the use of molecule-specific representations limits its uses. Our method is su ciently general to allow for applications to any system that can be defined in a coordinate system in which extensivity holds. Furthermore, our method results in a model which can be evaluated on arbitrarily large system sizes.

Extensive deep neural networks
The overall objective of an EDNN is the same as that of our previous work based on convolutional deep neural networks: learn the mapping between an input structure and one or more extensive properties, " (e.g. total energy, entropy, magnetization, particle number, charge, etc). To date, our input structures have consisted of continuous, regular, real-space grids. We note that some operators may be more conveniently decomposed in one space than another, depending on their locality length-scale (see below), and reciprocal-space representations may be beneficial.
EDNN di↵er from our previous work based on standard convolutional deep neural networks as they divide the task into a number of tiles (sub-domains) for training and inference. This subdivision is a user-controlled process based on physical constraints discussed in what follows.
EDNN operate on segments of data which we refer to as tiles. Each tile is itself made up of two regions: the focus and context (Fig. 1b). In our implementation, focus regions are orthorhombic structures of dimension fx, fy, fz. In general, there is no requirement to use tiles 2 FIG. 1. An input example is decomposed into regularly shaped tiles, with each tile consisting of a focus and context region. a) As a pedagogical example, we expand 4 adjacent tiles comprising a generic binary grid. For this case the focus region, f 2 , is 1 ⇥ 1, and the context is unit width (c = 1). The tiles are simultaneously passed through identical neural networks with shared weights. The individual outputs are summed, producing an estimate of ", an extensive quantity. In b), we show an example tile with a focus of f = 2, and a context of c = 2. The optimal selection of f and c depend on physical contraints determined by the target (learned) function.
invariant models are commonplace. Furthermore, photographs, handwriting, and audio recordings too large to be processed by the deep neural network can be resized, cropped, and segmented without destroying the features necessary to make a prediction [22]; there is no absolute spatial scale upon which the label of interest depends. In a physical measurement or simulation, however, the physical scale of a pixel matters critically. Consider the case of an X-ray di↵raction experiment where the interference pattern recorded at the detector depends strongly on the wavelength of scattered light and the physical lengthscale over which the signal is collected. It is not possible to reconstruct the signal properly unless such parameters are known and are consistent.
In this work, we overcome this limitation, and propose a general method which preserves the extensivity of physical quantities, and also accommodates arbitrary input size. We propose a new organizational structure of a deep neural network which can train and operate on (e↵ectively) arbitrary-sized inputs and length scales while maintaining the physical requirement of extensivity. We call our approach Extensive Deep Neural Networks (EDNN), employing domain decomposition to solve the problem of operator evaluation and learning across length scales. Although domain decomposition techniques have a long history in computer simulation and modelling, here we have taken a new approach, and allow the model itself to identify and self-optimize the overlap of tiles at domain boundaries.
Previous work with deep tensor neural networks has identified the necessity of extensivity [23], but the use of molecule-specific representations limits its uses. Our method is su ciently general to allow for applications to any system that can be defined in a coordinate system in which extensivity holds. Furthermore, our method results in a model which can be evaluated on arbitrarily large system sizes.

Extensive deep neural networks
The overall objective of an EDNN is the same as that of our previous work based on convolutional deep neural networks: learn the mapping between an input structure and one or more extensive properties, " (e.g. total energy, entropy, magnetization, particle number, charge, etc). To date, our input structures have consisted of continuous, regular, real-space grids. We note that some operators may be more conveniently decomposed in one space than another, depending on their locality length-scale (see below), and reciprocal-space representations may be beneficial.
EDNN di↵er from our previous work based on standard convolutional deep neural networks as they divide the task into a number of tiles (sub-domains) for training and inference. This subdivision is a user-controlled process based on physical constraints discussed in what follows.
EDNN operate on segments of data which we refer to as tiles. Each tile is itself made up of two regions: the focus and context (Fig. 1b). In our implementation, focus regions are orthorhombic structures of dimension fx, fy, fz. In general, there is no requirement to use tiles invariant models are commonplace. Furthermore, photographs, handwriting, and audio recordings too large to be processed by the deep neural network can be resized, cropped, and segmented without destroying the features necessary to make a prediction [22]; there is no absolute spatial scale upon which the label of interest depends. In a physical measurement or simulation, however, the physical scale of a pixel matters critically. Consider the case of an X-ray di↵raction experiment where the interference pattern recorded at the detector depends strongly on the wavelength of scattered light and the physical lengthscale over which the signal is collected. It is not possible to reconstruct the signal properly unless such parameters are known and are consistent.
In this work, we overcome this limitation, and propose a general method which preserves the extensivity of physical quantities, and also accommodates arbitrary input size. We propose a new organizational structure of a deep neural network which can train and operate on (e↵ectively) arbitrary-sized inputs and length scales while maintaining the physical requirement of extensivity. We call our approach Extensive Deep Neural Networks (EDNN), employing domain decomposition to solve the problem of operator evaluation and learning across length scales. Although domain decomposition techniques have a long history in computer simulation and modelling, here we have taken a new approach, and allow the model itself to identify and self-optimize the overlap of tiles at domain boundaries. Previous work with deep tensor neural networks has identified the necessity of extensivity [23], but the use of molecule-specific representations limits its uses. Our method is su ciently general to allow for applications to any system that can be defined in a coordinate system in which extensivity holds. Furthermore, our method results in a model which can be evaluated on arbitrarily large system sizes.

Extensive deep neural networks
The overall objective of an EDNN is the same as that of our previous work based on convolutional deep neural networks: learn the mapping between an input structure and one or more extensive properties, " (e.g. total energy, entropy, magnetization, particle number, charge, etc). To date, our input structures have consisted of continuous, regular, real-space grids. We note that some operators may be more conveniently decomposed in one space than another, depending on their locality length-scale (see below), and reciprocal-space representations may be beneficial. EDNN di↵er from our previous work based on standard convolutional deep neural networks as they divide the task into a number of tiles (sub-domains) for training and inference. This subdivision is a user-controlled process based on physical constraints discussed in what follows. EDNN operate on segments of data which we refer to as tiles. Each tile is itself made up of two regions: the focus and context (Fig. 1b)

ESSENCE OF THE APPROACH
Rather than attempt to infer probabilities from counting, we instead consider the structure of σ itself. It is generally known that interactions between spins can give rise to correlations on a range of scales, and that emergence of such correlations are temperature dependent. In some sense, these correlations are similar to patterns that emerge in language due to the rules of grammar. We can "unwrap" σ into a sequence (Fig. 1) and analyze them using language-based sequential machine learning methods. While many such sequence models exist, recurrent neural networks (RNN) are a particularly powerful deep learning technique which have seen significant use in recent years. Primarily, RNN have been used in natural language processing (e.g. predictive text and translation), as well as predicting time-series data such as stock markets and weather. More recently, autoregressive RNN have been adapted to spatially structured inputs such as images [12].
We first train an RNN on the micro-states observed at a fixed set of experimental temperature conditions, β −1 O (for this study we used β −1 O = 4) as DCL requires only data collected at a single temperature. Formally, the precise temperature of observation does not matter, so long as the system is approximately ergodic. Here we tested this explicitly by increasing the temperature by a factor of ×2. This did not change our results qualitatively.
Our training set consists of observations σ i as well as the thermodynamic conditions upon which they were collected (i.e. the observation temperature β −1 O ). Importantly, the Hamiltonian operator and values of the energy are not included in our training data. This is analogous to what occurs experimentally -one does not generally have knowledge of the interaction potential between particles, but can always conduct an experiment at fixed conditions and make observations of which micro-states are explored during such an experiment.
To train the RNN, we provided it with examples of spin micro-states as unwrapped configurations (Fig.  1 1). We experimented with different "unwrapping" techniques (which we called "snake" or "spiral" respectively); tests confirm our results are not sensitive to this choice, so long as the procedure is used consistently. Our training procedure and network architecture follow standard techniques and are reported as Supplementary Information.
Once the RNN has been trained, we can use it to determine a particular P (σ i ) by computing a series of conditional probabilities of the unwrapped σ i sequence. To do this, we initialize the RNN with the first spin value of σ i [1] = s 1 and record its prediction for P (s 1 ). Applying this now initialized RNN model to s 2 produces a conditional probability prediction P (s 2 |s 1 ). The total probability of P (σ i ) is found from the multiplicative product of many conditional probabilities where L is the linear dimension of the system and d is the dimension.
For a two state system (Ising), the RNN has only a single output since P (up) = 1 − P (down). Ising has an input vector of size one (for the spin) and an equivalent output vector. We generalize this for the Potts model, which has an input vector of size Q (either a 1 or a 0 in each element). The output is also a Q wide vector.
Using Eqn. 3, we can compare the probabilities of any two micro-states, P (σ i ), P (σ j ). Inserting these into Eqn. 2 (combined with the experimental conditions under which the samples were obtained, β O )) provides us with an estimate of the difference of internal energy between the micro-states σ i , σ j . We are able to compute such an estimate for any pair, including configurations which were never visited during the training process. Henceforth, we refer to this process of estimating E i as the RNN energy model.

USING THE RNN ENERGY MODEL
The top row of Fig. 2 show the performance of the RNN energy models across a range of energies for both the Ising and Potts models. In both cases, the RNN energy model does an excellent job estimating the energy difference between any two micro-states. We note that it is exactly this quantity which is needed to perform finite temperature Metropolis Monte Carlo simulations to predict the thermodynamic properties of a spin system.
Using these models, we carry out such numerical simulations under conditions which are far away from the training set. When we compare results for the phase transitions generated using the true Hamiltonian with those generated with the RNN energy models, we see excellent agreement for both Ising and Potts models (Fig. 2  bottom row). This confirms that the RNN model errors in the prediction of ∆E ij are small enough that they do not alter the essential physics of systems under study. The black diagonal line is a guide to the eye. Using these models of the energy with Metropolis Monte Carlo, it is possible to predict properties over a wide range of temperatures (bottom row, b and d). We note that RNN energy models are trained only on samples which are recorded at To (denoted as a vertical dashed line), yet can be used to make predictions across a wide range of physical conditions, including through a phase transition. The red shaded area indicates the variance observed in our numerical simulations.
Since the RNN already allows us to estimate ensemble probabilities (and thus energy labels from observation), we might think that nothing more can be extracted from our initial observations. It turns out, however, we can incorporate more physics knowledge through the use of a second neural network. We will demonstrate that incorporating this second network will both overcome some limitations of the RNN energy model and improve the accuracy of our predictions without any new observations or labels.
One of the most obvious disadvantages of the RNN energy model is that it has no concept of locality with respect to energy updates. Whenever we wish to know the difference in energy between σ i and σ j , we must reevaluate all of the conditional probabilities which make up P (σ i ) and P (σ j ). For interactions which are short ranged (as is the case for both the Ising and Potts models), this is very inefficient. In conventional simulations of such systems, it is customary to reevaluate only interactions which have changed as a result of the MC trial move.
With the RNN energy model, there is no general way to achieve such "local" energy updates. This is because for spin flips at some locations in the lattice one may need to recompute only some conditionals, but in general one has to recompute an extensive number of them Another limitation is that the RNN energy model is only able to make predictions for system sizes which are the same as those within the training set. Ideally, one would like to be able to observe an L × L system, learn something from it, and then make predictions for a larger M × M case.

THE EDNN ENERGY MODEL
In previous work [13], we showed that with the proper construction, neural networks have the ability to directly learn the locality length-scale, l, of operators such as the Hamiltonian. By locality length-scale, we mean the amount of information in the neighbourhood of a focus region, f , which is necessary to compute extensive properties. Magnetization, for example, is a fully local (l = 0) operator. It is possible to divide the task of computing magnetism for every site in the system, record the value, and sum all at the end (since it is an extensive quantity). For a nearest neighbour spin model, additionally context, c, is needed in order to determine site energies, i.e. the values of the spins in the neighbourhood. Using an extensive deep neural network (EDNN), these locality scales can be learned directly from the data through hyperparameter optimization of c.
Initially, our motivation for using an EDNN in this investigation was to be able to learn from small scale systems (and 8 × 8 spin model in this case) and apply that learning to a larger system (e.g. 16 × 16). Training an EDNN with RNN energy model produced labels produces an EDNN energy model. As expected, the EDNN is able to take the small scale examples and transfer that learning to larger systems (Fig. 3 shows the performance of the EDNN energy model). Creating an EDNN energy model had a another unforeseen benefit, however. By construction, EDNN topologies require that physical laws are the same everywhere. They are designed to learn a function which, when applied across a configuration, maps the sum of outputs to an extensive quantity such as the internal energy. Interestingly, in this case, we find that this physics-based network design requirement results in improved performance in predicting the underlying interactions even when the labels used in training have noise introduced by the imperfect RNN energy model.
As discussed above, we first train an RNN to predict energy differences ∆E. In the second step, we train an EDNN to reproduce predictions from the RNN. The EDNN has never seen labels other than those estimated by the RNN. Despite this, when we compare EDNN predictions relative to the true values -it has better performance than the RNN itself. We suspect that the physics based construction of the EDNN enables it to see through noise introduced by the imperfect RNN and achieve a better estimate of the underlying operator (the root mean squared error, RMSE, of the RNN energy model for Ising is 2.06J compared with 1.47J for the EDNN, Fig. 4). EDNN enforce the postulate of uniformity of physical law into our training procedure; our performance improves as a result.

A GENERAL METHOD
As a demonstration of the generality of DCL, we now consider the case of a spin-glass where interactions between neighbours are sampled from a random distribu- tion (the Edwards-Anderson model). Even with a much more complex and rich Hamiltonian, the RNN is able to learn only from observations at a fixed temperature, yet can be used to make accurate predictions across a wide range of conditions. Again, only unlabelled observations at a single fixed temperature are required to determine the behaviour of the system under arbitrary conditions, Fig. 5 (see S.I. for another example of random couplings).
By applying DCL to a spin-glass, we have demonstrated, for the first time, the ability to effectively learn the Hamiltonian operator directly without ever knowing its form, symmetries, or see direct labelled example. This is the first time such an inversion has been demonstrated for non-trivial systems.

CONCLUSION
With knowledge only of the experimental constraints (i.e. the statistical ensemble) and a statistically significant number of observations from only a single set of thermodynamic conditions, distribution-consistent learning (DCL) is able to extract enough information about a physical system to fully predict its behaviour over a wide range of unseen conditions, across different length-scales, and through phase transitions.
DCL can predict the relative energy of micro-states at sufficient accuracy that it can be used to reproduce the energetic cost of excitations between states in a size consistent manner. The model consists of two deep neural network topologies which able to learn co-operatively. By using a combination of deep learning and physical constraints, we have shown that full transferability is possi- ble. We expect that there will be many applications of this new method, including optical lattices, the growth of molecules on surfaces, among others. model, this is ≈ 5 × 10 5 . At the extremes, this is equivalent to reselecting certain configurations over and over again (consistent with what occurs physically). Test and train data were chosen such that they did not overlap. Our RNN are very simple, consisting of a single gated recurrent unit [14] (we also tried up to 4 stacked units, which gave only a modest improvement in our results). The size of the hidden state was between 378-512 neurons. All RNN models converged quickly -results reported here are based on only 30 epoch (learning rates between 2.5 − 5 × 10 −4 and batch sizes between 4000-8000). For all models, we used dropout rate of 0.9. All of our networks are implemented in Tensor-Flow and are available online along with training data (http://clean.energyscience.ca/codes).
For the EDNN, we used 192 × 10 4 random spin configurations for training data and achieved a converged result within only 60 epochs for all models. The EDNN was built using a previously reported architecture [13] (f = 2, c = 1 and 2 fully connected layers with 32 and 64 neurons with respectively). Throughout, we used rectified linear units (reLU) as activation functions. Our goal was to use a simple and consistent set of parameters and training; it is very likely that there exist better choices of hyper-parameters than those presented here.
We also note that the RNN we have used here are very simple in form. Recently, attention mechanisms [15,16] have been shown to improve the performance of sequence models significantly (e.g. reducing the number of needed samples need to achieve fixed fidelity). We expect that more advanced sequence models, including attention only "Transformer" [17] networks and related models could also be of benefit here, particularly with experimental data.