Thermodynamic Machine Learning through Maximum Work Production

Adaptive thermodynamic systems -- such as a biological organism attempting to gain survival advantage, an autonomous robot performing a functional task, or a motor protein transporting intracellular nutrients -- can improve their performance by effectively modeling the regularities and stochasticity in their environments. Analogously, but in a purely computational realm, machine learning algorithms seek to estimate models that capture predictable structure and identify irrelevant noise in training data by optimizing performance measures, such as a model's log-likelihood of having generated the data. Is there a sense in which these computational models are physically preferred? For adaptive physical systems we introduce the organizing principle that thermodynamic work is the most relevant performance measure of advantageously modeling an environment. Specifically, a physical agent's model determines how much useful work it can harvest from an environment. We show that when such agents maximize work production they also maximize their environmental model's log-likelihood, establishing an equivalence between thermodynamics and learning. In this way, work maximization appears as an organizing principle that underlies learning in adaptive thermodynamic systems.


I. INTRODUCTION
A debate has carried on for the last century and a half over the relationship (if any) between abiotic physical processes and intelligence.Though taken up by many scientists and philosophers, one important thread focuses on issues that lie decidedly at the crossroads of physics and intelligence.
Perhaps unintentionally, James Clerk Maxwell laid foundations for the physics of intelligence with what Lord Kelvin (William Thomson) referred to as "intelligent demons" [1].Maxwell in his 1857 book Theory of Heat had argued that a "very observant" and "neat fingered being" could subvert the Second Law of Thermodynamics [2].In effect, his "finite being" uses its intelligence (Maxwell's word) to sort fast from slow molecules, creating a temperature difference that drives a heat engine to do useful work.Converting disorganized thermal energy to organized work energy, in this way, is forbidden by the Second Law.The cleverness in Maxwell's paradox turned on equating the thermodynamic behavior of mechanical systems with the intelligence in an agent that can accurately measure and control its environment.This established an operational equivalence between energetic thermodynamic processes, on the one hand, and intelligence, on the other.
We will explore the intelligence of physical processes, substantially updating the setting from the time of Kelvin and Maxwell, by calling on a wealth of recent results on the nonequilibrium thermodynamics of information [3,4].In this, we directly equate the operation of physical agents descended from Maxwell's demon with notions of intelligence found in modern machine learning.While learning is not necessarily the only capability of a presumed intelligent being, it is certainly a most useful and interesting feature.
The root of many tasks in machine learning lies in discovering structure from data.The analogous process of creating models of the world from incomplete information is essential to adaptive organisms, too, as they must model their environment to categorize stimuli, predict threats, leverage opportunities, and generally prosper in a complex world.Most prosaically, translating training data into a model corresponds to density estimation [5], where the algorithm uses the data to construct a probability distribution.
This type of model-building at first appears far afield from more familiar machine learning tasks such as categorizing pet pictures into cats and dogs or generating a novel image of a giraffe from a photo travelogue.Nonetheless, it encompasses them both [6].Thus, by addressing thermodynamic roots of model estimation, we seek a physical foundation for a wide breadth of machine learning.More to the point, we imagine a future in which the pure computation employed in a machine learning system is instantiated so that the physical properties of its implementation are essential to its functioning.And, in any case, we hope to show that this setting provides a workable, though simplified, approach to the physical and informational trade-offs facing adaptive organisms.
To carry out density estimation, machine learning invokes the principle of maximum-likelihood to guide intelligent learning.This says, of the possible models consistent with the training data, an algorithm should select that with maximum probability of having generated the data.Our exploration of the physics of learning asks whether a similar, thermodynamic principle guides physical systems as they adapt to their environments.
The modern understanding of Maxwell's demon no longer entertains violating the Second Law of Thermodynamics.In point of fact, the Second Law's primacy has been repeatedly affirmed in modern nonequilibrium theory and experiment.That said, what has emerged is that we now understand how intelligent (demon-like) physical processes can harvest thermal energy.They do this by exploiting an information reservoir [7][8][9].That reservoir and the organization of the demon's control and measurement apparatus are how modern physics views the embodiment of its intelligence [10].
Machine learning estimates different likelihoods of different models given the same data.Analogously, in the physical setting of information thermodynamics, different demons harness different amounts of work from the same environmental information.Leveraging this commonality, we introduce thermodynamic learning as a physical process that infers optimal demons from environmental information.Thermodynamic learning selects demons that produce maximum work, paralleling parametric density estimation's selection of models with maximum likelihood.The surprising result is that these two principles of maximization are the same, when compared in a common setting.
Technically, we show that a probabilistic model of its environmental is an essential part of the construction of an intelligent work-harvesting demon.That is, the demon's work production from environmental "training data" is proportional to the log-likelihood of the demon's environment model.Thus, if the thermodynamic training process selects the maximum-work demon for given data, it has also selected the maximum-likelihood model for that same data.In this way, thermodynamic learning is machine learning for thermodynamic machines-it infers models in the same way a machine learning algorithm does.Thus, work itself can be interpreted as a thermodynamic performance measure for learning.In this framing, learning is physical.While it is natural to argue that learning confers benefits, our result establishes that the benefit is fundamentally rooted in the physics of energy and information.Once these central results are presented and their interpretation explained, but before we conclude, we briefly recount the long-lived narrative of the thermodynamics of organization.This places the results in an historical setting and compares them to related works.We must first, however, explain the framework in which thermodynamic learning arises and then lay out the necessary technical background in density estimation, computational mechanics, thermodynamic computing, and thermodynamically-efficient computations.With these addressed, the use of work as a measure of learning performance is explored, ultimately deriving the equivalence between the conditions of maximum work and maximum likelihood.

II. FRAMEWORK
While demons continue to haunt discussions of physical intelligence, the notion of a physical process trafficking in information and energy exchanges need not be limited to mysterious intelligent beings.Most prosaically, we are concerned with any physical system that, while interacting with an environment, simultaneously processes information at some energetic cost or benefit.Avoiding theological distractions, we refer to these processes as thermodynamic agents.In truth, any physical system can be thought of as an agent, but only a limited number of them are especially useful for or adept at commandeering information to convert between various kinds of energy, such as between thermal energy and work.Here, we posit a setting that shows how to find physical systems that are the most capable of processing information to affect thermodynamic transformations.
Consider an environment that produces information in the form of a time series of physical values at regular time intervals of length τ .We denote the particular state realized by the environment's output at time jτ by the symbol y j ∈ Y j .Just as the agent must be instantiated by a physical system, so must the environment and its outputs to the agent.Specifically, Y j represents the state space of the jth output, which is a subsystem of the environment.
An agent has no access to the internals of its environment and thus treats it as a black box.Thus, the agent can only access and interact with the environment's output system Y j over each time interval t ∈ (jτ, (j + 1)τ ).In other words, the state y j realized by the environment's output is also the agent's input at time jτ .For instance, the environment may produce realizations of a two level spin system Y j = {↑, ↓}, which the agent is then tasked to manipulate through Hamiltonian control.
The aim, then, is to find an agent that produces as much work as possible using these black-box outputs.To do so, the agent must know something about the black box's structure.This is the principle of requisite complexity [11]-thermodynamic advantage requires that the agent's organization match that of its environment.We implement this by introducing a method for thermodynamic learning as shown in Fig. 1, which selects a specific agent from a collection of candidates.
Peeking into the internal mechanism of the black box, we wait for a time Lτ , receiving the L symbols y 0:L = y 0 y 1 • • • y L−1 .This is the agent's training data, which is copied as needed to allow a population of candidate agents to interact with it.As each agent interacts with a copy, it produces an amount of work.Work energy is stored such that it can be retrieved later, for instance by raising a mass as depicted in Fig. 1.Then, the agent producing the most work is selected.This is "thermodynamic learning" in the sense that it selects a device based on measuring its thermodynamic performancethe amount of work it extracts.Ultimately, the goal is that the agent selected by thermodynamic learning continues to extract work as the environment produces new symbols.However, we leave analyzing the long-term effectiveness of thermodynamic learning to the future.Here, we concentrate on the condition of maximum-work itself, deriving and interpreting it.
For clarity, note that thermodynamic learning differs from thermodynamic systems that, evolving in time, spontaneously adapt to their environment [12][13][14].Work maximization as described here is thermodynamic in its objective, while these previous approaches are thermodynamic in their mechanism.That said, the perspectives are closely linked.In particular, it was suggested that thermodynamic systems spontaneously decrease work absorbed from driving [13].Note that work absorbed by the system is opposite the work produced.And so, as they evolve over time, these thermodynamic systems appear to seek higher work production, paralleling how thermodynamic learning selects for the highest work production.Moreover, the adaptation by which a thermodynamic system decreases work absorption is often compared to learning [13].Reference [14] goes further, comparing the effectiveness of thermodynamic evolution to maximumlikelihood estimation employing an autoencoder.Notably, it reports that the machine learning technique performs markedly better than the thermodynamic evolution, for the particular physical system considered there.
The framework here also compares thermodynamic learning to machine learning algorithms that use maximum-likelihood to select models consistent with given data.As Fig. 1 indicates, each agent has an internal model of its environment; a connection Sec.VI formalizes.Each agent's work production is then evaluated for the training data.Thus, arriving at a maximumwork agent also selects that agent's internal model as a description of the environment.Moreover and in contrast with Ref. [14], which compares thermodynamic and machine learning methods quantitatively, the framework here leads to an analytic derivation of the equivalence between thermodynamic learning and maximum-likelihood density estimation.

III. BACKGROUND
Directly comparing thermodynamic learning and density estimation requires explicitly demonstrating that thermodynamically-embedded computing and machine learning share the framework just laid out.The following introduces what we need for this: concepts from machine learning, computational mechanics, and thermodynamic computing.(Readers preferring more detail should refer to App.A.)

A. Parametric Density Estimation
Parametric estimation determines, from training data, the parameters θ of a probability distribution.In the present setting, θ parametrizes a family of probabilities Pr(Y 0:∞ = y 0:∞ |Θ = θ) over words of arbitrary length, where Y j is the random variable for the environment's output at time jτ and Θ is the random variable for the model distribution.For convenience, we introduce the new random variables Y θ j that define the model: With training data y 0:L , the likelihood of the model θ is given by the probability of the data given the model: Parametric density estimation seeks to optimize the likelihood L(θ|y 0:L ) [5,15].However, the procedure of finding maximum-likelihood estimates usually employs the log-likelihood instead: (θ|y 0:L ) = ln Pr(Y θ 0:L = y 0:L ) , since it is maximized for the same models, but converges more effectively [16].

B. Computational Mechanics
Given that our data is a time series of arbitrary length starting with y 0 , we must choose a model class whose possible parameters Θ = {θ} specify a wide range of possible distributions Pr(Y θ 0:∞ )-the semi-infinite processes.-Machines, a class of finite-state machines introduced to describe bi-infinite processes Pr(Y θ −∞:∞ ) [17], provide a systematic means to do this.As described in App.A these finite-state machines comprise just such a flexible class of representations; they can describe any semiinfinite process.This follows from the fact that they are explicitly constructed from the process.
A process' -machine consists of a set of hidden states S, a set of output states Y, a start state s * ∈ S, and conditional output-labeled transition matrix θ the hidden states: s→s specifies the probability of transitioning to hidden state s and emitting symbol y given that the machine is in state s.In other words, the model is fully specified by the tuple: s→s } s,s ∈S,y∈Y } .
As an example, Fig. 2 shows an -machine that generates a periodic process with initially uncertain phase.
-Machines are unifilar, meaning that the current causal state s j along with the next k symbols uniquely determines the following causal state through the function: This yields a simple expression for the probability of any word in terms of the model parameters: Thus, in addition to being uniquely determined by the semi-infinite process, the -machine uniquely generates that same process, meaning that our model class Θ is equivalent to the class of possible distributions over time series data.Moreover, knowledge of the causal state of an -machine at any time step j contains all information about the future that could be predicted from the past.In this sense, the causal state is predictive of the process.These and other properties have motivated a long investigation of -machines, in which the memory cost of storing the causal states is frequently used as a measure of process structure.Appendix A gives an extended review.

C. Thermodynamic Computing
Computation is physical-any computation takes place embedded in a physical system.Here, we refer to it as the system of interest.Its physical states, denoted Z = {z}, are taken as the information bearing degrees of freedom [8].The system dynamic evolves the state distribution Pr(Z t = z t ), where Z t is the random variable describing state at time t.Computation over the time interval t ∈ [τ, τ ] addresses how the dynamic maps the system from the initial time t = τ to the final time t = τ .It consists of two components: 1.An initial distribution over states Pr(Z τ = z τ ) at time t = τ .
2. Application of a Markov channel, characterized by the conditional probability of transitioning to the final state z τ given the initial state z τ : Together, these all the logical elements of the computation.In this, z τ is the input to the physical computation, z τ is the output, and M zτ →z τ is the logical architecture.Figure 3 illustrates a computation's physical implementation.The system of interest Z is coupled to a work reservoir, depicted as a mass hanging from a string, that controls the system's Hamiltonian along a trajectory H Z (t) over the interval of the computation t ∈ [τ, τ ] [18].This is the basic definition of a thermodynamic agent.
In a classical system, this control determines each state's energy E(z, t).As a result of the control, changes in energy due to changes in the Hamiltonian correspond to work exchanges between the system of interest and work reservoir.If the system Z follows the state trajectory over the time interval t ∈ [τ, τ ]: where z t is the system state at time t, then work production for that state trajectory is the integrated change in energy due to the Hamiltonian's time dependence [18]: Note that this decomposes the state trajectory z τ :τ into intervals of duration dt, chosen short enough to yield infinitesimal changes in state probabilities and the Hamiltonian.In this way, while the state trajectory z τ :τ mirrors the time series notation used for our training data Assuming the system computes while coupled to a thermal reservoir at temperature T , Landauer's Principle [8] relates a computation's logical processing to its energetics.In its contemporary form, it bounds the average work production, denoted W , by a term proportional to the system's entropy change.Setting H[Z t ] = − z Pr(Z t = z) ln Pr(Z t = z) as the Shannon entropy in natural units, the Second Law of Thermodynamics implies [4]: Here, the average W is taken over all possible trajectories.

IV. ENERGETICS OF COMPUTATIONAL MAPPINGS
This bound concerns the average work production over the ensemble of all possible states.However, thermodynamic learning uses the performance given a particular set of training data.Thus, the following evaluates the work production of a particular computational mapping z τ → z τ , which ignores the details of the state trajectory z τ :τ by only tracking the input z τ and output z τ .To determine performance, we first consider an efficient mapping and then connect efficiency to maximum-likelihood.

A. Efficient Computations
Let's estimate the maximum work production associated with a computational map z τ → z τ .That is, given a system following a state trajectory beginning in state z τ and ending in z τ , what is its associated work production at temperature T ?
To do so, we first derive a useful relation between work W |z τ :τ and entropy production Σ |z τ :τ along a full state trajectory z τ :τ .The total entropy produced resulting from thermodynamic control is the sum of (i) the system of interest's entropy change [19]: and (ii) the thermal reservoir's entropy change.The latter derives from the heat: Thus, recalling the First Law of Thermodynamicssystem energy change is opposite work and heat production ∆E Z = −W − Q-the total entropy production of a particular state trajectory can be expressed in terms of work production: This collects the excess quantities into a change-of-state function, called the pointwise nonequilibrium free energy: Its name derives from the averaged quantity φ(z, t) Pr(Zt=z) = F neq (t), which is known as nonequilibrium free energy [20].
When considering computational maps, we care only about the initial and final states of the system.As such, we take a statistical average of all trajectories beginning in z τ and ending in z τ , which then defines: representing the computational mapping work production for z τ → z τ .This is how much energy is stored in the work reservoir on average when the computation results in this particular input-output pair.Taking the same average conditioned on inputs and output of the entropy production in Eq (2) gives: This suggests a relation between computational mapping work and the change in pointwise nonequilibrium free energy φ(z, t).This relation becomes exact for thermodynamicallyefficient computations.In such scenarios, where average total entropy production over all trajectories vanishes, App.B shows that this, combined with Crook's fluctuation theorem [21], implies that entropy production across any individual trajectory produces zero entropy: Σ |z τ :τ = 0 for any z τ :τ .This is expected from linear response [22].Thus, substituting zero entropy production into Eq.( 2), we arrive at our result: work production for thermodynamically-efficient computations is the change in pointwise nonequilibrium free energy: Substituting Eq. ( 3) then gives: ).This also holds if we average over intermediate states of the system's state trajectory, yielding the work production of a computational mapping: The energy required to perform efficient computing is independent of intermediate properties.It depends only on the probability and energy of initial and final states.

B. Thermodynamics of Misestimation
Even with perfectly-efficient thermodynamic control, misestimating the environment comes at a thermodynamic cost.If we estimate the input distribution Pr(Z θ τ ) and output distribution Pr(Z θ τ ), the natural choice is to design the computation to be efficient for those estimates.By minimizing the entropy production for the estimated distributions, we guarantee that the thermodynamic agent produces as much work as possible when it receives the estimated inputs.However, if it misestimates the input-such that over the computation interval t ∈ [τ, τ ] the actual and estimated input distributions differ: Pr(Z θ t ) = Pr(Z t )-then the computation must at least dissipate a minimum given by [23,24]: If Hamiltonian control misestimates inputs and outputs resulting in the lower bound on entropy being positive, then the computation is inefficient.One consequence is that the work production in Eq. ( 5) is no longer satisfied.Failing Eq. ( 5), how can we find the work production of a protocol designed to be efficient for estimated input Pr(Z θ τ ) and output Pr(Z θ τ )?Fortunately, the work produced by a computational mapping does not depend on the initial or final distribution.This work explicitly conditions on the initial state z τ and final state z τ .And so, it shields the resulting probability of the intervening state trajectory from the initial or final distributions.Thus, since Eq. ( 5) is satisfied when the input and output are the same as expected (Pr(Z θ t ) = Pr(Z t )), it does not change when it receives any other input distribution.As a result, the work production of a computational mapping is entirely determined by the distributions for which the physical computation is designed to be efficient: The subscripts Z θ τ and Z θ τ are added to the work production to indicate which distributions were anticipated by the Hamiltonian control.Equation (6) now gives an explicit relationship between useful work production W , finite data z τ and z τ , and the agent's model θ.This is the first step in establishing work production as a thermodynamic performance metric for learning.
Focusing on the energetic benefit deriving from the information itself, rather than the benefit of changing energy levels, we set the beginning and ending energies to be the same.Thus, ∆E Z = 0 and the resulting work production from a computational mapping is: This measures the energetic gains from a single data realization as it transforms during a computation, as opposed to the ensemble average.Through precise control of the energy landscape, this energetic benefit is achievable.Section VIII describes a method to extract this work from a two-level system using an alternating process of instantaneous quenching, quasistatic evolution, then quenching again.This procedure generalizes to any computation, as shown in App. C. The following uses this result to design efficient agents that harvest energy from a time series.However, before exploring thermodynamic learning from time series, let's apply these results to training on data coming from the system of interest Z itself.What work can be produced from an input z τ , regardless of the output, and how does it depend on the agent's internal model θ?In addition, can an agent maximize the work from each input?
To address these, we design the computational architecture to fully randomize the output-M z→z = 1/|Z|such that the final distribution is uniform Pr(Z τ = z τ ) = 1/|Z|.In this setting, an agent extracts the maximum possible energy by expanding into the state space as much as possible.The resulting work produced on average from a particular input z τ is: Thus, an agent whose model maximizes the probability of particular input data produces the most work from that data.
From the machine learning perspective, the work production of an efficient agent operating on a single system increases proportionally to the log-likelihood (Eq.( 1)) of the model θ given the input data: That is, thermodynamic learning leads to a workmaximizing agent that also maximizes the likelihood of its model of the given input.Thus, by maximizing work production, a designer builds thermodynamic agents that, as in machine learning, employ maximumlikelihood models.

V. WORK PRODUCTION FROM A TIME SERIES
To determine the work production, as just set up, of time series of separate inputs y 0:L , it is tempting to take as our controlled system of interest Z the joint variable of all inputs Y L and then apply a quasistatic channel that simultaneously controls the energy of all inputs simultaneously.However, this violates temporal modularity in the inputs-the fact that each must be interacted with separately.Such a strategy requires a global energy landscape E(y 0:L , t).This is nonsensical, since each y j is made available at a different time jτ .However, this does not mean that correlations between inputs cannot be addressed.
To harness the temporal correlations in a time series, we turn to information ratchets [25,26].These generalize Maxwell's demon to harvest work from a series of inputs.Combining physical inputs with additional agent memory states that store the input's temporal correlations, we find the work production for single-shot short input strings.This contrasts with prior results that focused, instead, on ensemble-average work production [11,[25][26][27][28][29][30][31].
Given a sequence y 0:L of inputs, with y j being the input provided at time t = jτ , the information ratchet strategy for extracting work is to allow each input to interact as part of an autonomous agent that stores memory of past inputs.This modifies our notion of thermodynamic computing only slightly.As Fig. 4 illustrates, over the jth time step the information-bearing system of interest becomes the joint system Z = X ×Y j , consisting of agent state and the jth interaction symbol.In an information ratchet, each symbol Y j interacts with the agent's memory X over the interaction interval [jτ, jτ + τ ], transforming the symbol stored in Y j from input to output.Then, the agent's memory decouples and couples to the next symbol over the interval [jτ + τ , (j + 1)τ ] while its state preserves memory of past interactions.In this way, the agent uses its memory to transform a series of inputs y 0:L to a series of outputs y 0:L .
The functional and energetic components of this procedure occur during the interaction interval [jτ, jτ + τ ], with the time interval [jτ + τ , (j + 1)τ ] serving simply as buffer time between the jth and (j + 1)th interaction.Y j is the interaction-symbol subsystem during the interaction interval and, along with the agent state, it evolves according to the framework for thermodynamic computing laid out in Sec.III C. Specifically, jτ takes the place of the initial time τ and jτ + τ takes the place of the final time τ .The Hamiltonian control over the joint space H X ×Yj (t) updates states according to a Markov transition matrix in the same way: < l a t e x i t s h a 1 _ b a s e 6 4 = " p 6 H q f l E t E 7 B 7 Couple < l a t e x i t s h a 1 _ b a s e 6 4 = " T i 9 w 4 8 q e z G / K h x V s g C c This specifies the logical architecture of the ratchet's operation at each time step.
For convenience, we factor the joint system of interest random variable at the start of the interaction interval into separate components Z jτ = X j × Y j .This gives a shorthand for the jth state of the agent X j and the jth input Y j .We also factor the system into component variables at the end of the interaction interval Z jτ +τ = X j+1 × Y j , with Y j representing the symbol emitted back to the environment and X j+1 representing the agent memory after the jth interaction, which is preserved for the next input.Figure 5 illustrates how this results in a general representation of thermodynamic information transduction [32].Expressed in terms of the newly defined variables for inputs, outputs, and agent memory states, we have the agent's logical architecture: The results on efficient thermodynamic control above now apply to this joint system.If the agent has a model Pr(X θ j = x, Y θ j = y) of its inputs at the beginning of each interaction interval, then it also has estimates of the output at the end of the interaction interval: With the computation designed to be efficient, the estimates determine the work production for a transition: .
When the estimated input distribution matches the actual distribution Pr(X j = x, Y j = y), the average work production takes on a familiar form [30,33]: More to the point, Eq. ( 11)'s computational-mapping work production now allows us to calculate the work produced for particular input sequences.We do this by first considering the work production of a particular sequence of agent memory states x 0:L+1 and outputs y 0:L : .
If the agent is designed to start in a distribution Pr(X θ 0 ) uncorrelated with its estimated input distribution Pr(Y θ 0:L ), then it anticipates the distribution over the sequence of inputs, outputs, and agent states: This gives the estimated distribution over the agent and input at time jτ from the marginal: Though challenging to calculate, the next section illustrates that there are particular agents that have an efficient logical architecture M for which this is straightforward to evaluate.Before exploring the simplifications that arise for efficient agents, it is worth finding a general expression for the average work production resulting from a particular input string y 0:L .We do this by summing over all possible agent-state trajectories x 0:L+1 and output series y 0:L .Using both actual and estimated initial distributions over agent states {Pr(X 0 ), Pr(X θ 0 )}, the agent's logical architecture M xy→x y , and the actual and estimated distribution over inputs {Pr(Y 0:L ), Pr(Y θ 0:L ))}, the average work, in its full detail, is: x 0:j ,y 0:j ,y 0:j x j ,y j M x j ,y j →xj+1,y j x 0:j ,y 0:j ,y 0:j Pr(X θ 0 = x 0 ) Pr(Y θ 0:j+1 = y 0:j+1 ) . This is the average energy harvested by an agent that transduces inputs y 0:L according to the logical architecture M , given that it is designed to be as efficient as possible for input Pr(Y θ 0:L ).This is a far cry from the simple expression for the work production of a single-step computation, expressed as a difference of log-likelihoods.Without simplifications, the difficulty of calculating this quantity scales poorly as the length L of the input in-creases.We include the expression here primarily to reinforce the challenge of calculating work production from the estimated input distribution Pr(Y θ 0:∞ ) and agent distribution Pr(X θ 0 ).Baring simplifications, it seems quite challenging to maximize an agent's work production for long input strings.
Despite unwieldiness, Eq. ( 13)'s work production is a deeply interesting quantity.In point of fact, since transducers are stochastic Turing machines [34], this is the work production for any general form of computation that maps inputs to output distributions Pr(Y 0:L |Y 0:L = y 0:L ) [32].Thus, Eq. ( 13) determines the work benefit possible for universal thermodynamic computing.Prior analyses showed that these ratchet's have thermodynamic functionality including, but not limited to, (i) expending work to generate patterns [26,31] and (ii) harnessing temporal correlations to extract work [11,27].
The following turns to focus on the latter, specifically restricting attention to information engines designed to produce maximum work from correlated inputs.These are the information extractors of Refs.[31,33].This leads to considerable simplifications.

VI. DESIGNING EFFICIENT AGENTS
As Sec.III B noted, our estimated semi-infinite input process Pr(Y θ 0:∞ ) has a unique minimal model-themachine: What does the estimated model θ tell us about an agent that effectively transforms one of those inputs into useful work?The answer is found in the tuple {M, Pr(X θ j , Y θ j )}, which characterizes the agent.Recall that M is the agent's logical architecture and Pr(X θ j , Y θ j ) is the estimated input distribution at time-step j.Together, these fully determine the work production of an efficient agent.For an information extractor to avoid the thermodynamic cost of modularity, the logical architecture M xy→x y must be constructed such that the memory-state variables X j are predictive of the input [33].Thus, the -machine generator provides a prescription for the transitions between the states of the joint input-and-memory variable.
Appendix E shows that the -machine specifies exactly how to construct an agent {M, Pr(X θ j , Y θ j )} that efficiently harvests work from the input Pr(Y θ 0:∞ ).The logical architecture M is given by the -machine's -function that maps histories to causal states: The estimated input is given by the -machine's hidden Markov model: Conversely, the efficient agent, characterized by its logical architecture and anticipated input distributions {M, Pr(X θ j , Y θ j )}, also specifies the -machine's model of the estimated distribution: Through the -machine, the agent also specifies its estimated input process.Figure 6 illustrates an example in which an initiallyuncertain-phase process (left) drives an agent with a matching internal model (middle).The result is a thermodynamically-efficient agent (far right).This demonstrates the bijection between (i) the anticipated input distribution, (ii) the -machine, which is the estimated model, and (iii) the agent that is designed to efficiently harness that input.In this way, the agent's effectiveness at harnessing work from finite data is directly associated with the model that underlies that agent's architecture.And so, from this point forward, when discussing an estimated process or an -machine that generates that guess, we are also describing the unique thermodynamic agent designed to produce maximal work from the estimated process.

VII. WORK AS A PERFORMANCE MEASURE
Recall that our goal is to explore work production as a performance measure for a model estimated from a time series y 0:L .Section V calculated the work production from a time series for an agent.The result, though, was an unwieldy expression with no seeming connection to the agent's underlying model of the data.However, using predictive models θ, Sec.VI conveniently provided the design of agents that efficiently harness work energy with the model θ built in.Appendix D shows that using efficiently-designed predictive agents leads to a consider- < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 P q X h 9 p

Estimated Process
< l a t e x i t s h a 1 _ b a s e 6 4 = " 1 V N L q X 8 = < / l a t e x i t > [25], in order for an information extractor to avoid the thermodynamic cost of modularity, the logical architecture M xy!x 0 y 0 must be constructed such that the hidden state variables X i are predictive of the input.Thus, the predictive ✏-machine generator provides a prescription for the transitions among the joint of the input and hidden states.Specifically, to design the thermodynamic agent to be predictive from the ✏-machine the hidden states of the agent match the states of the ✏-machine at every time step X i = S ✓ i .To do this, the agent states and causal states occupy the same space X = S, and the transitions within the agent M are directly drawn from causal equivalence relation x!x 0 6 = 0, x 0 ,x else.(51) The factor of 1/|Y| is to maximize work production by mapping to uniform outputs.The second term on the right is the probability of the next agent state given the current input and current hidden state Pr(X i+1 = x 0 |Y i = y, X i = x).The top case x 0 ,✏(x,y) gives the probability that the next causal state S ✓ i+1 is x 0 given that the current causal state is S ✓ i = x and output of the ✏-machine is Y i = y.This is contingent on the probability of seeing y given causal state x being nonzero.If it is, then the transitions among hidden states of the agent match the transitions of the causal states of the ✏-machine.In this way, if y 0:L is a sequence that could be produced by the ✏-machine, we've designed the agent to stay synchronrized to the causal state of the input X ✓ i = S i , so that the ratchet is predictive of the process Pr(Y ✓ 0:1 ) and produces maximal work by fully randomizing the outputs In the case where the ✏-machine can't produce y from the causal state x, we arbitrarily choose the next state to be the same x,x 0 .There are many possible choices for this case, but it winds up being irrelevant, because these transitions correspond zero estimated probability, and thus infinite work dissipation, drowning out all other details of the model.However, this particular choice for when y cannot be generated from the causal state x preserves unifilarity and allows the agent to wait in its current state until it receives an input that it can accept.
Up to transitions that are disallowed by our model and will correspond to infinite work dissipation, we have a direct mapping between our estimated input process Pr(Y ✓ 0:1 ) and its ✏-machine ✓ to the logical architecture x!x 6 = 0, x 0 ,x else, (38) and the estimated input is given by the HMM of the ✏machine Conversely, the globally e cient agent also specifies the ✏-machine model of the estimated distribution = Pr(Y ✓ j = y|X ✓ j = s)|Y|M sy!s 0 y .
Through the ✏-machine, the agent also specifies its estimated input.Thus, as shown in Fig. 6, there is a bijection between the 1) guessed input distribution, 2) the ✏-machine, which is the estimated model, and 3) the agent that is designed to e ciently harness that input.This means that the e↵ectiveness of our agent at harnessing work from finite data can be directly associated with the model that underlies that agent's architecture.And so, for the the remainder of this work, when we discuss an estimated process, or an ✏-machine that generates that guess, we are also describing the unique thermodynamic agent which is designed to produce work.

Reconsider this introduction
The agent described in the previous section is maximally e cient for the input process Pr(Y ✓ 0:1 ), and asymptotically harvests the di↵erence between entropy rates of the input and output k B T (h 0 µ h µ ) as L ! 1 [25], achieving the Information Processing Second Law of thermodynamics, if the input and output are stationary.However, what happens when these agents are applied to finite data?
As shown in Appendix C by plugging this transducer into the expression for one-shot work production in Eq. 37, we find a considerable simplification:

|Y|).
(42) a di↵erdistribu-tion Pr(Y ✓ 0:L = y 0:L ) and output distribution Pr(Y This form is very familiar, nearly exactly reproducing the form of Eq. ( 29), which describes the work harvested by an e cient agent that extracts energy from a single realization of a physical system Z ⌧ .However, it should be noted that achieving this work production relies on including the additional agent memory X .That memory allows us to cope with the temporal modularity of the input by storing temporal correlations [19].
Thus, thermodynamically e cient pattern extractors provide a massive simplification in calculating the work production of agents.Much of this advantage comes from unifilarity, which guarantees a single hidden state trajectory x 0:L+1 for an input y 0:L .Even calculating the probability of of the output is reduced to tracking the terms for a particular causal state trajectory s 0:L+1 where s j = ✏(s ⇤ , y 0:j ) = Moreover, the model-dependent term in the work production of thermodynamically e cient agents is a familiar term that can be easily interpreted: the log-likelihood of the model ✓ Because e cient agents can be characterized by their underlying model of their environment (the ✏-machine), Eq. ( 45) provides a suggestive parallel between machine learning and the thermodynamics harnessing work from a finite string.If we treat y 0:L as training data for our model, then the log-likelihood is maximized when our model anticipates the input with highest probability.This is the same condition for the thermodynamic agent producing maximum work.This suggests that the condition for creating a good model of our environment is the same as the producing maximal work.

VIII. TRAINING MEMORYLESS AGENTS
To illustrate the thermodynamic training process, consider the simplest possible agents.These would have only one internal state A that receive binary data y j from a series of two-level systems Y j = {", #}.The internal models e introduce the basic concepts in the fields ne learning, information thermodynamics, and tional mechanics necessary to understand this pt.For readers who would like a deeper explathis background, we recommend they refer to A. Parametric Density Estimation rpose of parametric density estimation is to deodel parameters ✓ of a probability distribution ning data.In this case, ✓ provides probabilities ds of arbitrary length Pr(Y 0:1 = y 0: is the random variable for the environment at and ⇥ is the random variable for the model.For nce, we introduce the new random variables Y ✓ j defines our model ining data y 0:L , the likelihood of the model ✓ is the probability of the data given the model = Pr(Y ✓ 0:L = y 0:L ). ( ric density estimation seeks to optimize [3,10].However, the process of finding -likelihood estimates usually uses the log-

B. Computational Mechanics
Given that our data is a time series of arbitrary length starting at y 0 , we must choose a model class whose possible parameters ⇥ = {✓} can span a wide range of possible semi-infinite processes Pr(Y ✓ 0:1 ).Fortunately, the finite state machines described in App.A compose such a flexible class of models that they can describe any semi-infinite process.This is because they are explicitly constructed from the process using a causal equivalence relation.We refer to these machines as ✏-machines, because, like the ✏-machine that describe bi-infinite processes Pr(Y ✓ 1:1 ) [12], they are determined by a causal equivalence relation.
The parameters of ✏-machines are given by the tuple: a set of hidden states S, a set of output states Y, a start state s ⇤ 2 S, and conditional output-labeled transition matrix between the hidden states

✓ (y)
s!s 0 specifies the probability of transitioning to hidden state s 0 and outputting y given that the machine is in state s 0 .In other words, the model is fully specified by the tuple For example Fig 2 shows how such a machine would be constructed to produce a periodic process with uncertain phase.
able simplification of the sequence work production: This vastly reduces the complexity of the workproduction expression to a difference of the logprobabilities between the input distribution Pr(Y θ 0:L = y 0:L ) and output distribution Pr(Y P 0:L = y 0:L ) = 1/|Y| L .This form is familiar, nearly exactly reproducing that of Eq. ( 8), which determines the work harvested by an efficient agent extracting energy from a single realization of a physical system Z τ .However, it should be emphasized that achieving this work production relies on including the additional agent memory X .That memory allows us to account for the temporal modularity of the input by storing temporal correlations [33].
Thus, thermodynamically-efficient pattern extractors offer a substantial simplification when calculating agent work production.Much of the advantage derives from unifilarity, which guarantees a single hidden state trajectory x 0:L+1 for an input y 0:L .Even calculating the probability of the output reduces to tracking the terms for a particular causal-state trajectory s 0:L+1 , where s j = (s * , y 0:j ): (s * ,y0:j )→ (s * ,y0:j+1) .
Moreover, the model-dependent term in thermodynamically-efficient agent work production is familiar and interpretable-the model θ's log-likelihood: Since efficient agents are characterized by the model of their environment-the -machine-Eq.( 17) suggests a parallel between machine learning and the thermodynamic processes that harness work from a finite string.If we treat y 0:L as training data for the model, then the loglikelihood is maximized when agent's model anticipates the input with highest probability.This is the same condition for the thermodynamic agent extracting maximum work.Thus, the criterion for creating a good model of an environment is the same as that for extracting maximal work.

VIII. TRAINING SIMPLE AGENTS
We now outline a simple version of thermodynamic learning that is experimentally implementable using a controllable two-level system.We first introduce a straightforward method to implement the simplest possible efficient agent.Second, we show that this physical process achieves the general maximum-likelihood result arrived at in the last section.Lastly, we find the agent selected by thermodynamic learning along with it's corresponding model.As expected, we find that this maximum-work producing agent learns the features of its environment.

A. Efficient Computational Trajectories
The simplest possible information ratchets have only a single internal state A and receive binary data y j from a series of two-level systems Y j = {↑, ↓}.These agents' internal models correspond to memoryless -machines, as shown in Fig. 7.The model's parameters are the probabilities of emitting ↑ and ↓, denoted θ Our first step is to design an efficient computation that maps an input distribution Pr(Z jτ ) to an output distribution Pr(Z jτ +τ ) over the jth interaction interval [jτ, jτ + τ ].The agent corresponds to the Hamiltonian evolution H Z (t) = H X ×Yj (t) over the joint space of the agent memory and jth input symbol.The resulting energy landscape E(z, t) is entirely specified by the energy of the two input states E(A × ↑, t) and E(A × ↓, t).
Appropriately designing this energy landscape allows us to implement the efficient computation shown in Fig. 8.The thermodynamic evolution there instantaneously quenches the energy landscape into equilibrium with the estimated distribution at the beginning of the interaction interval Pr(Z θ jτ ), then quasistatically evolves the system in equilibrium to the estimated final distribution Pr(Z θ jτ +τ ), and, finally, quenches back to the default energy landscape.In Fig. 8, the system undergoes a cycle, starting and ending with the same flat energy landscape, such that ∆E Z = 0.This cycle evolves the distribution over the joint states A × ↑ and A × ↓ from Pr(Z θ jτ = {A × ↑, A × ↓}) = {0.8,0.2} to Pr(Z θ jτ +τ = {A × ↑, A × ↓}) = {0.4,0.6}.Note that this strategy can be used to evolve between any initial and final distributions.
We control the transformation over time interval t ∈ (jτ, jτ +τ ) such that the time scale of equilibration in the system of interest is much shorter than the interval length τ .This slow-moving quasistatic control means that the states are in equilibrium with the energy landscape over the interval.In this case, the state distribution becomes the Boltzmann distribution: Pr(Z t = z) = e (F EQ (t)−E(z,t))/kBT .
To minimize dissipation for the estimated distribution, the state distribution must be the estimated distribution Pr(Z t = z) = Pr(Z θ t = z).And so, we set the two-levelsystem energies to be in equilibrium with the estimates: The resulting process produces zero work: and maps Pr(Z jτ ) to Pr(Z jτ +τ ) without dissipation.
With the quasistatic transformation producing zero work, the total work produced from the initial joint state x × y is exactly opposite the change in energy during to the initial quench: minus the change in energy of the final joint state x × y during the final quench: The two-level system's state is fixed during the instantaneous energy changes.Thus, if the joint state follows the computational mapping x × y → x × y the work production is, as expected, directly connected to the estimated distributions: Recall from Sec. V that the ratchet system variable Z θ jτ = X θ j × Y θ j is split into the random variable for the jth agent memory state and the jth input.Similarly, Z θ jτ +τ = X θ j+1 ×Y θ j is split into the (j +1)th agent memory state and jth output.This work production achieves the efficient value described in Eq. (7).
Appendix C generalizes the thermodynamic operation above to any computation M zτ →z τ .While it requires an ancillary copy of the system Z to execute the conditional dependencies in the computation, it is conceptually identical in that it uses a sequence of quenching, evolving quasistatically, and then quenching again.This appendix extends the strategies outlined in Refs.[31,33] to computational-mapping work calculations.

B. Efficient Information Ratchets
With the method for efficiently mapping inputs to outputs in hand, we can design a series of such computations < l a t e x i t s h a 1 _ b a s e 6 4 = " V v K e S 6 6 x 8 K 5 0 T < l a t e x i t s h a 1 _ b a s e 6 4 = " V j f C z + f I 0 W g 0 3 p e q M / 7

K e y 5 h i e o = " >
< l a t e x i t s h a 1 _ b a s e 6 4 = " V j f C z + f I 0 W g 0 3 p e q M / 7 K e y 5 h i e o = " < l a t e x i t s h a 1 _ b a s e 6 4 = " V j f C z + f I 0 W g 0 3 p e q M / 7 K e y 5 h i e o = " < l a t e x i t s h a 1 _ b a s e 6 4 = " V v K e S 6 6 x 8 K 5 0 T < l a t e x i t s h a 1 _ b a s e 6 4 = " V j f C z + f I 0 W g 0 3 p e q M / 7

K e y 5 h i e o = " >
< l a t e x i t s h a 1 _ b a s e 6 4 = " V j f C z + f I 0 W g 0 3 p e q M / 7 K e y 5 h i e o = " to implement a simple information ratchet that produces work from a series y 0:L .As prescribed in Eq. ( 14) of Sec.VI, to produce the most work from estimated model θ, the agent's logical architecture should randomly map every state to all others: since there is only one causal state A. In conjunction with Eq. ( 15), we find that the estimated joint distribution of the agent and interaction symbol at the start of the interaction is equivalent to the parameters of the model: where we again used the fact that A is the only causal state.In turn, the estimated distribution after the interaction is: Thus, assuming the agent has model θ built-in, then Eq. ( 18) determines that the work production for mapping A × y to output A × y for a particular symbol y is: Since A is the only memory state and work does not depend on the output symbol y , the average work produced from an input y is: With the work production expressed for a single input y j , we can now consider how much work our designed agent harvests from the binary training data y 0:L .Summing the work production of each input yields a simple expression in terms of the model θ: Due to the single causal state, the product within the logarithm simplifies to the probability of the word given the model

C. Inferring Memoryless Models
Leveraging the explicit construction for efficient information ratchets, we can search for the agent that maximizes work from the input string y 0:L .To infer a model through work maximization, we label the frequency of ↑ states in this sequence with f (↑) and the frequency of ↓ with f (↓).The corresponding log-likelihood of the model is: Thus, for the corresponding agent, the work production is: Selecting from all possible memoryless agents, the model parameters θ maximizing work production are given by the frequency of symbols in the input: A→A .The resulting work production is: where H[f (↑)] is the Shannon entropy of binary variable Y with Pr(Y =↑) = f (↑) measured in nats.
This simple example of learning statistical bias serves to explicitly lay out the stages of thermodynamic learning.It is too simple, though, to illustrate the full power of the new learning method.That said, it does confirm that thermodynamic work maximization leads to useful models of data in the simplest case.As one would expect, the simple agent found by thermodynamic learning discovers the frequency of zeros in the input and, thus, it learns about its environment.The corresponding work production is the same as energetic gain of randomizing L bits distributed according to the frequency f (↑).
However, this neglects the substantial thermodynamic benefits possible with temporally-correlated environments.To illustrate how to extract this additional energy, we design and analyze memoryful agents [27] in a sequel.

IX. SEARCHING FOR PRINCIPLES OF ORGANIZATION
Introducing a principle of maximum work production comes at a late stage of a long line of inquiry into what kinds of thermodynamic constraints and laws govern the emergence of organization and, for that matter, biological life.So, let's historically place the seemingly-new principle.In fact, it enters a crowded field.
Within statistical physics the paradigmatic principle was found by Kirchhoff [35]: in electrical networks current distributes itself so as to dissipate the least possible heat for the given applied voltages.Generalizations, for equilibrium states, are then found in Gibbs' variational principle for entropy for heterogeneous equilibrium [36], Maxwell's principles of minimum-heat [37, pp. 407-408], and Onsager's minimizing the "rate of dissipation" [38].
Close to equilibrium Prigogine introduced minimum entropy production [39], identifying dissipative structures whose maintenance requires energy [40].However, far from equilibrium the guiding principles can be quite the opposite.And so, the effort continues today, for example, with recent applications of nonequilibrium thermodynamics to pattern formation in chemical reactions [41].That said, statistical physics misses at least two, related, but key components: dynamics of and information in thermal states.
Dynamical systems theory takes a decidedly mechanistic approach to the emergence of organization, analyzing the geometric structures in a system's state space that amplify fluctuations and eventually attenuate them into macroscopic behaviors and patterns.This was eventually articulated by pattern formation theory [42][43][44].A canonical example is fluid turbulence [45]-a dynamical explanation for its complex organizations occupied much of the 70s and 80s.Landau's original theory of incommensurate oscillations was superseded by the mathematical discovery in the 1950s of chaotic attractors [46,47].This approach, too, falls short of leading to a principle of emergent organization.Patterns emerge, but what exactly are they and what complex behavior do they exhibit?
Answers to this challenge came from a decidedly different direction-Shannon's theory of noisy communication channels and his measures of information [48,49], appropriately extended [50].While adding an important new perspective-that organized systems store and transmit information-this, also, did not go far enough as it sidestepped the content and meaning of information [51].Inroads to these appeared in the theory of computation inaugurated by Turing [52].The most direct and ambitious approach to the role of information in organization, though, appeared in Wiener's cybernetics [53,54].While it eloquently laid out the goals to which principles should strive, it ultimately never harnessed the mathematical foundations and calculational tools needed.Likely, the earliest overt connection between statistical mechanics and information, though, appeared with Jaynes' Maximum Entropy [55] and Minimum Entropy Production Principles [56]-a link that in many ways is responsible for modern machine learning.
So, what is new today is the synthesis of statistical physics, dynamics, and information.This, finally, allows one to answer the question, How do physical systems store and process information?The answer is that they intrinsically compute [57].With this, one can extract from behavior a system's information processing, even going so far as to discover the effective equations of motion [58][59][60][61].One can now frame questions about how a physical system reacts to, controls, and adapts to its environment.
All such systems, however, are embedded in the physical world and require resources to operate.More to the point, what energetic resources underlie computation?Initiated by Brillouin [62] and Landauer and Bennett [8,63], today there is a nascent physics of information [4,64].Resource constraints on computing by thermodynamic systems are now expressed in a suite of new principles.For example, the information processing Second Law [26] places a lower bound on the work required to perform a given amount of information processing.The principle of requisite complexity [11] dictates that maximally-efficient interactions require an agent's internal organization match the environment's organization.And, thermodynamic resource costs arise from the modularity of an agent's architecture [33].
To fully appreciate organization in life processes one must also address dynamics of agent populations, first on the time scale of agent life cycles and second on the scale of generational reproduction.In fact, tracking the complexity of individuals reveals that selection pressures spontaneously emerge in purely-replicating populations [65] and replication itself necessarily dissipates energy [66].
As these pieces assembled, a picture has come into focus.Intelligent, adaptive systems learn to harness resources from their environment, expending energy to live and reproduce.Taken altogether, the historical perspective suggests we are moving close to realizing Wiener's cybernetics [53].

X. CONCLUSION
Coming in this historical setting, our main development introduced thermodynamic machine learning-a means for training intelligent agents to maximize work production from complex environmental stimuli supplied as time-series data.This involved constructing a framework to describe thermodynamics of computation at the single-shot level, enabling us to evaluate the work an agent can produce from individual data realizations.Key to the framework is its generality-applicable to agents exhibiting arbitrary adaptive input-output behavior and implemented within any physical platform.We found that the performance of such agents increases proportionally to the log-likelihood of the model they used for predicting their environment.As a consequence, our results show that thermodynamic learning exactly mimics parametric density estimation in machine learning.Thus, work is a thermodynamic performance measure for physically-embedded learning.This result further solidifies the connections between agency, intelligence, the thermodynamics of information-hinting that energy harvesting and learning may be two sides of the same coin.
These connections suggest a number of exciting future directions.From the technological perspective, these results hint at a natural method for design of intelligent energy harvesters-establishing that our present tools of machine learning can be directly mapped towards automated design of more efficient information ratchets and pattern engines [11,31,67].Meanwhile, recent results hint that quantum systems are capable to generating certain complex adaptive behavior with less resources than classical counterparts [68][69][70].The challenge now is to explore how the principle of maximum work production generalizes to quantum agents-Does this lead to new classes of quantum-enhanced energy harvesters or learners?
Ultimately, energy is an essential currency for life.This elevates the question, To what extent is work optimization a natural tendency of driven physical systems?Indeed, recent results indicate physical systems evolve to increase work production [13,14], opening a fascinating possibility.Could the equivalence between work production and learning then indicate that the universe itself naturally learns?The fact that complex intelligent life emerged from the lifeless soup of the universe might be considered a continuing miracle: a string of unfathomable statistical anomalies strung together over eons.It would certainly be wondrous if this evolution then has a physical basis-a hidden Fourth Law of Thermodynamics that guides the universe to creative entities capable of extracting maximal work.estimation [5,71].However, we take the random variable for the estimated distribution Z θ to denote the model for notational and conceptual convenience.
The Shannon entropy [49]: measures uncertainty in nats, a "natural" unit for thermodynamic entropies.The Shannon entropy easily extends to joint probabilities and all information measures that come from their composition (conditional and mutual informations).For instance, if the environment is composed of two correlated subcomponents Z = X × Y, the probability and entropy are expressed: respectively.While there are many other ways to create parametric models θ-from polynomial functions with a small number of parameters to neural networks with thousands [5]-the goal is to match as well as possible the estimated distribution Pr(Z θ ) to the actual distribution Pr(Z).
One measure of success in this is the probability that the model generated the data-the likelihood.The likelihood of the model θ given a data point z is the same as the likelihood of Z θ : training data and assuming independent samples, then the likelihood of the model is the product: This is a commonly used performance measure in machine learning, where algorithms search for models with maximum likelihood [5].However, it is common to use the log-likelihood instead, which is maximized for the same models: (θ| z) = ln L(θ| z) ln Pr(Z θ = z i ) . (A2) If the model Z θ were specified by a neural network, the log-likelihood could be determined through stochastic gradient descent back-propagation [15,72], for instance.The intention is that the procedure will converge on a network model that produces the data with high probability.

Thermodynamics of Information
Learning from data translates information in an environment into a useful model.What makes that model useful?In a physical setting, recalling from Landauer that "information is physical" [73], the usefulness one can extract from thermodynamic processes is work.Figure 3 shows a basic implementation for physical computation.Such an information-storing physical system Z = {z}, in contact with a thermal reservoir, can execute useful computations by drawing energy from a work reservoir.Energy flowing from the system Z into the thermal reservoir is positive heat Q.When energy flows from the system Z to the work reservoir, it is positive work W production. Work production quantifies the amount of energy that is stored in the work reservoir available for later use.And so, in this telling, it represents a natural and physicallymotivated measure of thermodynamic performance.In the framework for thermodynamic computation of Fig. 3, work is extracted via controlling the system's Hamiltonian.Specifically, the system's informational states are controlled via a time-dependent Hamiltonian-energy E(z, t) of state z at time t.For state trajectory z τ :τ = z τ z τ +dt • • • z τ −dt z τ over time interval t ∈ [τ, τ ], the resulting work extracted by Hamiltonian control is the temporally-integrated change in energy [18]: Heat Q |z τ :τ = E(z τ , τ ) − E(z τ , τ ) − W |z τ :τ flows into the thermal reservoir, increasing its entropy: where the thermal reservoir is at temperature T .The Second Law of Thermodynamics states that, on average, any processing on the informational states can only yield nonnegative entropy production of the universe (reservoir and system Z): This constrains the energetic cost of computations per-formed within the system Z.
A computation over time interval t ∈ [τ, τ ] has two components: 1.An initial distribution over states Pr(Z τ = z τ ), where Z t is the random variable of the system Z at time t.
2. A Markov channel that transforms it, specified by the conditional probability of the final state z τ given the initial input z τ : This specifies, in turn, the final distribution Pr(Z τ = z τ ) that allows direct calculation of the system-entropy change [19]: Pr(Z τ = z τ ) Pr(Z τ = z τ ) .
Adding this to the information reservoir's entropy change yields the entropy production of the universe.This can also be expressed in terms of the work production: Σ |z τ :τ ≡ ∆S reservoir Here, φ(z, t) = E(z, t) + k B T ln Pr(Z t = z) is the pointwise nonequilibrium free energy, which becomes the nonequilibrium free energy when averaged φ(z, t) Pr(Zt=z) = F neq (t) [20].Note that the entropy production is also proportional to the additional work that could have been extracted if the computation was efficient.This is referred to as the dissipated work : Pr(Z θ τ = z τ ) Pr(Z θ τ = z τ ) .
Note that, while Pr(Z θ τ = z) is the input distribution for which the computation is efficient, it is possible that other input distributions Pr(Z τ = z) are as well.They are only required to minimize D KL (Z τ ||Z θ τ )− D KL (Z τ ||Z θ τ ) = 0.The physical setting that we take for thermodynamic control is overdamped Brownian motion with a controllable energy landscape.This is described by detailedbalanced rate equations.However, if our physical statespace is limited to Z, then not all channels can be implemented with continuous-time rate equations.Fortunately, this can be circumvented by additional ancillary or hidden states [78].And so, to implement any possible channel, we add an ancillary copy of our original system Z , such that our entire physical system is Z total = Z × Z .
Prescriptions have been given that efficiently implement any computation, specified by a Markov channel M zτ →z τ = Pr(Z τ = z τ |Z τ = z τ ), using quasistatic manipulation of the Z's energy levels and an ancillary copy Z [31,33].However, these did not determine the work production for individual logical trajectories over z τ → z τ during the computation interval (τ, τ ).
The following implements an analogous form of quasistatic computation that allows us to easily calculate the energy associated with implementing the computation M zτ →z τ , assuming the system started in z τ and ends in z τ .This also requires an ancillary copy of Z, denoted Z , to allow for the full range of logical opera-tions.(Note that some are impossible if restricted to a single copy of Z and continuous-time rate-equation evolution.)Due to detailed balance, the rate equation dynamics are partially specified by the energy E(z, z , t) of system state z and ancillary state z at time t.This also uniquely specifies the equilibrium distribution: Pr(Z eq t = z, Z eq t = z ) = e −E(z,z ,t)/kBT z,z e −E(z,z ,t)/kBT .
The normalization constant z,z e −E(z,z ,t)/kBT is the partition function which determines the equilibrium free energy: e −E(z,z ,t)/kBT   .
The equilibrium free energy adds to the system energy.It is constant over the states: E(z, z , t) = F eq (t) − k B T ln Pr(Z eq t = z, Z eq t = z ) .
We leverage the relationship between energy and equilibrium probability to design a protocol that achieves the work production given by Eq. (C2) for a Markov channel M .The estimated distribution over the whole space assumes that the initial distribution of the ancillary variable is uncorrelated and uniformly distributed: See Fig. 10.For all protocol epochs, except for epoch 3 during which the two subsystems are swapped, Z is held fixed while the ancillary system Z follows the local equilibrium distribution.Let's detail these in turn.
1. Quench: Instantaneously quench the energy from E(z, z , τ ) = ξ to E(z, z , τ + ) = k B T ln(|Z|/ Pr(Z τ = z)) over the infinitesimal time interval [τ, τ + ] such that, if the distribution was as we expect, it would be in equilibrium Pr(Z eq τ , Z eq τ ) = Pr(Z θ τ )/|Z|.If the system started in z τ , then the associated work produced is opposite the energy change: |Z θ τ =zτ ,Z θ τ =z τ denotes that the work is produced in the 1st stage, conditioned on the estimated distributions Z θ τ and Z θ τ , and initial and final states z τ and z τ .Note that we also condition on Z θ τ = z τ , since work production in this phase is unaffected by the end state of the computation.
2. Quasistatically evolve: Quasistatically evolve the energy landscape over a third of total time interval (τ, τ 1 ] such that the joint system remains in equilibrium and the ancillary system Z is determined by the Markov channel M applied to the system Z: Pr(Z τ1 = z, Z τ1 = z ) = Pr(Z τ = z)M z→z E(z, z , τ 1 ) = −k B T ln Pr(Z θ τ = z)M z→z .
Also, hold the energy barriers between states in Z high, preventing probability flow between states and preserving the distribution Pr(Z t ) = Pr(Z τ ) for all t ∈ (τ, τ 1 ].Given that the system started in Z τ = z τ , the work production during this epoch corresponds to the average change in energy: dt Pr(Z t = z , Z t = z|Z τ = z τ )∂ t E(z, z , t) .
As the system Z remains in z τ over the interval: Pr(Z t = z , Z t = z|Z τ = z τ ) = Pr(Z t = z |Z t = z)δ z,zτ the work production simplifies to: We can express the energy in terms of the estimated equilibrium probability distribution: E(z τ , z , t) = −k B T ln Pr(Z t = z |Z t = z τ ) Pr(Z θ t = z τ ) .
And, since the distribution over the system Z is fixed during this interval: Pr(Z t = z |Z t = z τ ) Pr(Z θ t = z τ ) = Pr(Z t = z |Z t = z τ ) Pr(Z θ τ = z τ ) .
Plugging these into the expression for the work production, we find that the evolution happens without energy exchange: The resulting joint distribution over the ancillary and primary system matches the desired computation: Pr(Z = z, Z = z ) = Pr(Z τ = z)M z→z .(C2) 3. Swap: Over the time interval (τ 1 , τ 2 ], slowly swap the two systems Z ↔ Z , such that Pr(Z τ1 = z, Z τ1 = z ) = Pr(Z τ2 = z , Z τ2 = z) and E(z, z , τ 1 ) = E(z , z, τ 2 ).This operation requires zero work as well, as it is reversible, regardless of where the system starts or ends: We keep the primary system Z fixed as in epoch 2. And, as in epoch 2, there is zero work production: The result is that the the primary system is in the desired final distribution: having undergone a mapping from its original state at time τ , while the ancillary system has returned to an uncorrelated uniform distribution.5. Reset: Finally, over time interval [τ − , τ ] instantaneously change the energy to the default flat landscape E(z, z , τ ) = ξ.The associated work production, given that the system ends in the state z τ , is:

FIG. 1 .
FIG. 1. Thermodynamic learning generates the maximumwork producing agent: (Left) Environment (green) behavior becomes data for agents (red).(Middle) Candidate agents each have an internal model (inscribed stochastic statemachine) that captures the environment's randomness and regularity to store work energy (e.g., lift a mass against gravity).(Right) Thermodynamic learning searches the candidate population for the best agent-that producing the maximumwork.

FIG. 2 .
FIG.2.-Machine generating the phase-uncertain period-2 process: With probability 0.5, an initial transition is made from the start state s * to state A. From there is emits the sequence 1010 • • • .And, with 0.5 probability, the start state transitions to the B state and outputs the sequence 0101 • • • .

FIG. 3 .
FIG.3.Thermodynamic computing: The system of interest Z's physical states store information, processing it as they evolve.Work energy is supplied by the work reservoir, represented by the hanging mass.And, heat energy is supplied by the thermal reservoir.
FIG. 4. Thermodynamic computing by an agent subject to an input: Information bearing degrees of freedom Z split into the direct product of agent states X and the jth input states Yj.Work and heat are defined correspondingly.
t e x i t s h a 1 _ b a s e 6 4 = " Q 1 2 4 d w h s 0 h K m l O G E 2 8 k O 9 1 S Z I t E = " > A A A B 6 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b R U 9 l t B T 0 W v H i s Y D + k X U o 2 z b a x S X Z J s k J Z + h e 8 e F D E q 3 / I m / / G b L s H b X 0 w 8 H h v h p l 5 Q c y Z N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R W 0 e J I r R F I h 6 p b o A 1 5 U z S l m G G 0 2 6 s K B Y B p 5 1 g c p P 5 n S e q NI v k v Z n G 1 B d 4 J F n I C D a Z 9 D B 4 P B + U K 2 7 V n Q O t E i 8 n F c j R H J S / + s O I J I J K Q z j W u u e 5 s f F T r A w j n M 5 K / U T T G J M J H t G e p R I L q v 1 0 f u s M n V l l i M J I 2 Z I G z d X f E y k W W k 9 F Y D s F N m O 9 7 G X i f 1 4 v M e G 1 n z I Z J 4 Z K s l g U J h y Z C G W P o y F T l B g + t Q Q T xe y t i I y x w s T Y e E o 2 B G / 5 5 V X S r l W 9 e r V 2 d 1 l p 1 P I 4 i n A C p 3 A B H l x B A 2 6 h C S 0 g M I Z n e I U 3 R z g v z r v z s W g t O P n M M f y B 8 / k D k H 6 N 3 w = = < / l a t e x i t > X j+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " k p U Y g 5 W p l X 7 D V u W y e s H j 7 3 M 0 V s M = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B E E p S B T 0 W v H i s Y D + g D W W z 3 b R r N 5 u w O x F K 6 I / w 4 k E R r / 4 e b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R y 8 S p Z r z J Y h n r T k A N l 0 L x J g q U v J N o T q N A 8 n Y w v p 3 5 7 S e u j Y j V A 0 4 S 7 k d 0 q r N b u r y r 1 W h 5 H E U 7 g F M 7 B g 2 u o w x 0 0 o A k M x v A M r / D m J M 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A y 3 m P K Q = = < / l a t e x i t > X j+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " k p U Y g 5 W p l X 7 D V u W y e s H j 7 3 M 0 V s M = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B E E p S B T 0 W v H i s Y D + g D W W z 3 b R r N 5 u w O x F K 6 I / w 4 k E R r / 4 e b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R y 8 S p Z r z r N b u r y r 1 W h 5 H E U 7 g F M 7 B g 2 u o w x 0 0 o A k M x v A M r / D m J M 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A y 3 m P K Q = = < / l a t e x i t > X j < l a t e x i t s h a 1 _ b a s e 6 4 = " r H v b T I w 7 f + f 4 C a t b 4 T R / U f E p R 0 o = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 V 7 Q e 0 o W y 2 k 3 b t Z h N 2 N 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / w 8 u I I 6 3 E E D m s B g D M / w C m 9 O 4 r w 4 7 8 7 H o r X g 5 D P H 8 A f O 5 w 8 / C o 9 1 < / l a t e x i t > Y j+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " H U B c T 3 9 d + W E I U 2 5 B / o Z 6 + 1 z y t FIG.5.Agent interacting with an environment via repeated symbol exchanges: A) At time jτ agent memory Xj begins interacting with input symbol Yj.Transitioning from A) to B), agent memory and interaction symbol jointly evolve according to the Markov channel M xy→x y .This results in B)-the updated states of agent memory Xj+1 and interaction symbol Y j at time jτ + τ .Transitioning from B) to C), the agent memory decouples from the interaction symbol, emitting its new state to the environment.Then, transitioning from C) to D), the agent retains its memory state Xj+1 and the environment emits the next interaction symbol Yi+1.Finally, transitioning from D) to A), the agent restarts the cycle by coupling to the next input symbol.
✓ 0:1 = 010101 • • • ) = 0.5< l a t e x i t s h a 1 _ b a s e 6 4 = " j a 3 R M S 9 p 9 D 1 2 ) It vastly reduces complexity of the expression to a di↵erence of the log-probabilities between the input distribu-tion Pr(Y ✓ 0:L = y y 0 0:L ) =

FIG. 7 .
FIG. 7. Memoryless model of binary data consisting of a single state A and the probability of outputting a ↑ and a ↓, denoted θ (↑) A→A and θ (↓) A→A , respectively.
t e x i t s h a 1 _ b a s e 6 4 = " 2 2 S z e d F x T h + q 7 O b r p b o i G I a z f K 8 = " > A A A B 9 H i c b V B N S 8 N A E N 3 4 W e t X 1 a O X Y B E 8 l a Q W 9 F g Q w W M F + w F t K J v t p F 2 6 2 c T d S T G E / g 4 v H h T x 6 o / x 5 r 9 x 2 + a g r Q 8 G H u / N M D P P j w X X 6 D j f 1 t r 6 x u bW d m G n u L u 3 f 3 B Y O j p u 6 S h R D J o s E p H q + F S D 4 B K a y F F A J 1 Z A Q 1 9 A 2 x / f z P z 2 B J T m k X z A N A Y v p E P J A 8 4 o G s n r I T x h d i t B D d N p v 1 R 2 K s 4 c 9 i p x c 1 I m O R r 9 0 l d v E L E k B I l M U K 2 7 r h O j l 1 G F n A m Y F n u J h p i y M R 1 C 1 1 B J Q 9 B e N j 9 6 a p 8 b Z W A H k T I l 0 Z 6 r v y c y G m q d h r 7 p D C m O 9 L I 3 E / / z u g k G 1 1 7 G Z Z w g S L Z Y F C T C x s i e J W A P u A K G I j W E M s X N r T Y b U U U Z m p y K J g R 3 + e V V 0 q p W 3 M t K 9 b 5 W r t f y O A r k l J y R C + K S K 1 I n d 6 R B m o S R R / J M X s m b N b F e r H f r Y 9 G 6 Z u U z J + Q P r M 8 f Z C + S d Q = = < / l a t e x i t > 0.8 < l a t e x i t s h a 1 _ b a s e 6 4 = " z F W 2 5 z f k K A d I Q c h R 7 H p 1 1 t Q y m W k = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X1 a O X x S J 4 K k k t 2 G P B i 8 e K 9 g P a U D b b S b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n Y 3 N r e 2 d 3 c J e c f / g 8O i 4 d H L a 1 n G q G L Z Y L G L V D a h G w S W 2 D D c C u 4 l C G g U C O 8 H k d u 5 3 n l B p H s t H M 0 3 Q j + h I 8 p A z a q z 0 4 F b q g 1 L Z r b g L k H X i 5 a Q M O Z q D 0 l d / G L M 0 Q m m Y o F r 3 P D c x f k a V 4 U z g r N h P N S a U T e g I e 5 Z K G q H 2 s 8 W p M 3 J p l S E J Y 2 V L G r J Q f 0 9 k N N J 6 G g W 2 M 6 J m r F e 9 u f i f 1 0 t N W P c z L p P U o G T L R W E q i I n J / G 8 y 5 A q Z E V N L K F P c 3 k r Y m C r K j E2 n a E P w V l 9 e J + 1 q x b u u V O 9 r 5 U Y t j 6 M A 5 3 A B V + D B D T T g D p r Q A g Y j e I Z X e H O E 8 + K 8 O x / L 1 g 0 n n z m D P 3 A + f w B b M I 0 k < / l a t e x i t > 0.2 < l a t e x i t s h a 1 _ b a s e 6 4 = " a Z d p X G + r I q 7 s Y b Z u z P S c 9 1 m R C 2 Q = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 R S N u / G x x 6 o x c W G V I w l j b U k g W 6 u + J j E b G T K P A d k Y U x 2 b V m 4 v / e b 0 U w x s / E y p J k S u 2 X B S m k m B M 5 n + T o d C c o Z x a Q p k W 9 l b C x l R T h j a d k g 3 B W 3 1 5 n b R r V e + q W r u v V x r 1 P I 4 i n M E 5 X I I H 1 9 C A O 2 h C C x i M 4 B l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g B S G I 0 e < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " S s z M k l H 8 W z V i G l 6 i G G x 3 T n d N j C U = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p 6 Q 7 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N e G t n 3 G Z p A Y l W y 4 K U 0 F M T O Z f k y F X y I y Y W k K Z 4 v Z W w s Z U U W Z s N i U b g r f 6 8 j p p 1 6 r e V b X W v K 7 U a 3 k c R T i D c 7 g E D 2 6 g D v f Q g B Y w Q H i G V 3 h z H p 0 X 5 9 3 5 W L Y W n H z m F P 7 A + f w B d Z e M q A = = < / l a t e x i t > Energy < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 2 S z e d F x T h + q 7 O b r p b e N j 9 6 a p 8 b Z W A H k T I l 0 Z 6 r v y c y G m q d h r 7 p D C m O 9 L I 3 E / / z ug k G 1 1 7 G Z Z w g S L Z Y F C T C x s i e J W A P u A K G I j W E M s X N r T Y b U U U Z m p y K J g R 3 + e V V 0 q p W 3 M t K 9 b 5 W r t f y O A r k l J y R C + K S K 1 I n d 6 R B m o S R R / J M X s m b N b F e r H f r Y 9 G 6 Z u U z J + Q P r M 8 f Z C + S d Q = = < / l a t e x i t > 0.8 < l a t e x i t s h a 1 _ b a s e 6 4 = " z F W 2 5 z f k K A d I Q c h R 7 H p 1 1 t Q y m W k = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X1 a O X x S J 4 K k k t 2 G P B i 8 e K 9 g P a U D b b S b t 0 s w m 7 G 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X x n W / n Y 3 N r e 2 d 3 c J e c f / g 8 2 n a E P w V l 9 e J + 1 q x b u u V O 9 r 5 U Y t j 6 M A 5 3 A B V + D B D T T g D p r Q A g Y j e I Z X e H O E 8 + K 8 O x / L 1 g 0 n n z m D P 3 A + f w B b M I 0 k < / l a t e x i t > 0.2 < l a t e x i t s h a 1 _ b a s e 6 4 = " a Z d p X G + r I q 7 s Y b Z u z P S c 9 1 m R C 2 Q = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P R S N u / G x x 6 o x c W G V I w l j b U k g W 6 u + J j E b G T K P A d k Y U x 2 b V m 4 v / e b 0 U w x s / E y p J k S u 2 X B S m k m B M 5 n + T o d C c o Z x a Q p k W 9 l b C x l R T h j a d k g 3 B W 3 1 5 n b R r V e + q W r u v V x r 1 P I 4 i n M E 5 X I I H 1 9 C A O 2 h C C x i M 4 B l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g B S G I 0 e < / l a t e x i t > kBT ln(0.8)< l a t e x i t s h a 1 _ b a s e 6 4 = " C r L o Q v z f x S H + i O i j 8 0 3 8 W e s 0 W O 4 = " > A A A B + X i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B a h g p a k C v Z Y 9 O K x Q r + g D W G z 3 b Z L N 5 u w u y m U 0 H / i x Y M i X v 0 n 3 v w 3 b t s c t P X B w O O 9 G W b m B T F n S j v O t 5 X b 2 N z a 3 s n v F v b 2 D w 6 P 7 O O T l o o S S W i T R D y S n Q A r y p m g T c 0 0 p 5 1 Y U h w G n L a D 8 c P c b 0 w A S e 4 R X e r N R 6 s d 6 t j 2 V r z s p m T u E P r M 8 f D b 6 R Q w = = < / l a t e x i t > kBT ln(0.2) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 n 7 u D 3 G r C i l l d v A N m G G B y c m g t m I = " > A A A B + X i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B a h g p a k C n o s e v F Y o V / Q h r D Z b t q l m 0 3 Y 3 R R K 6 D / x 4 k E R r / 4 T b / 4 b t 2 0 O W n 0 w 8 H h v h p l 5 Q c K Z 0 o 7 z Z R X W 1 j c 2 t 4 r b p Z 3 d v f 0 D + / C o r e J U E t o i M Y 9 l N 8 C K c i Z o S z P N a T e R F E c B p 5 1 g f D / 3 O x M q F Y t F U 0 8 T 6 k V 4 K F j I C N Z G 8 m 3 7 c u z f o S b q c 1 F B T r V 2 7 t t l p + o s g P 4 S N y d l y N H w 7 c / + I C / l a t e x i t > 0.6 < l a t e x i t s h a 1 _ b a s e 6 4 = " k 7 S J K c J y j s i r 3 S E N A p j x n W a 8 x E 0 = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 C k k t 6 r H g x W N F + w F t K J v t p l 2 6 2 Y T d i V B K f 4 I X D 4 p 4 9 R d 5 8 9 + 4 b X P Q 1 g c D j / d m m J k X p l I Y 9 L x v Z 2 1 9 Y 3 N r u 7 B T 3 N 3 b P z g s H R 0 3 T Z J p x h s s k Y l u h 9 R w K R R v o E D J 2 6 n m N A 4 l b 4 W j 2 5 n f e u L a i E Q 9 4 j j l Q U w H S k S C U b T S g + d e 9 U p l z / X m IK v E z 0 k Z c t R 7 p a 9 u P 2 F Z z B U y S Y 3 p + F 6 K w Y R q F E z y a b G b G Z 5 S N q I D 3 r F U 0 Z i b Y D I / d U r O r d I n U a J t K S R z 9 f f E h M b G j O P Q d s Y U h 2 b Z m 4 n / e Z 0 M o 5 t g I l S a I V d s s S j K J M G E z P 4 m f a E 5 Q z m 2 h D I t 7 K 2 E D a m m D G 0 6 R R u C v / z y K m l W X P / S r d x X y 7 V q H k c B T u E M L s C H a 6 j B H d S h A Q w G 8 Ay v 8 O Z I 5 8 V 5 d z 4 W r W t O P n M C f + B 8 / g B Y K I 0 i < / l a t e x i t > 0.4 < l a t e x i t s h a 1 _ b a s e 6 4 = " M f Q / U z 5 w T B h n B J 6 W p 3 C p g e I R + x 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 7 d z v P K H S P J a P Z p q g H 9 G R 5 C F n 1 F j p w a 3 W B + W K W 3 U X I O v E y 0 k F c j Q H 5 a / + M G Z p h N I w Q b X u e W 5 i / I w q w 5 n A W a m f a k w o m 9 A R 9 i y V N E L t Z 4 t T Z + T C K k M S x s q W N G S h / p 7 I a K T 1 N A p s Z 0 T N W K 9 6 c / E /r 5 e a 8 M b P u E x S g 5 I t F 4 W p I C Y m 8 7 / J k C t k R k w t o U x x e y t h Y 6 o o M z a d k g 3 B W 3 1 5 n b R r V e + q W r u v V x r 1 P I 4 i n M E 5 X I I H 1 9 C A O 2 h C C x i M 4 B l e 4 c 0 R z o v z 7 n w s W w t O P n M K f + B 8 / g B V I I 0 g < / l a t e x i t > Energy < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 2 S z e d F x T h + q 7 O b r p b o i G I a z f K 8 = " > A A A B 9 H i c b V B N S 8 N A E N 3 4 W e t X 1 a O X Y B E 8 l a Q W 9 F g Q w W M F + w F t K J v t p F 2 6 2 c T d S T G E / g 4 v H h T x 6 o / x 5 r 9 x 2 + a g r Q 8 G H u / N M D P P j w X X 6 D j f 1 t r 6 x u b W d m G n u L u 3 f 3 B Y O j p u 6 S h R D J o s E p H q + F S D 4 B K a y F F A J 1 Z A Q 1 9 A 2 x / f z P z 2 B J T m k X z A N A Y v p E P J A 8 4 o G s n r I T x h d i t B D d N p v 1 R 2 K s 4 c 9 i p x c 1 I m O R r 9 0 l d v E L E k B I l M U K 2 7 r h O j l 1 G F n A m Y F n u J h p i y M R 1 C 1 1 B J Q 9 B e N j 9 6 a p 8 b Z W A H k T I l 0 Z 6 r v y c y G m q d h r 7 p D C m O 9 L I 3 E / / z u g k G 1 1 7 G Z Z w g S L Z Y F C T C x s i e J W A P u A K G I j W E M s X N r T Y b U U U Z m p y K J g R 3 + e V V 0 q p W 3 M t K 9 b 5 W r t f y O A r k l J y R C + K S K 1 I n d 6 R B m o S R R / J M X s m b N b F e r H f r Y 9 G 6 Z u U z J + Q P r M 8 f Z C + S d Q = = < / l a t e x i t > 0.6 < l a t e x i t s h a 1 _ b a s e 6 4 = " k 7 S J K c J y j s i r 3 S E N A p j x n W a 8 x E 0 = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 C k k t 6 r H g x W N F + w F t K J v t p l 2 6 2 Y T d i V B K f 4 I X D 4 p 4 9 R d 5 8 9 + 4 b X P Q 1 g c D j / d m m J k X p l I Y 9 L x v Z 2 1 9 Y 3 N r u 7 B T 3 N 3 b P z g s H R 0 3 T Z J p x h s s k Y l u h 9 R w K R R v o E D J 2 6 n m N A 4 l b 4 W j 2 5 n f e u L a i E Q 9 4 j j l Q U w H S k S C U b T S g + d e 9 U p l z / X m I K v E z 0 k Z c t R 7 p a 9 u P 2 F Z z B U y S Y 3 p + F 6 K w Y R q F E z y a b G b G Z 5 S N q I D 3 r F U 0 Z i b Y D I / d U r O r d I n U a J t K S R z 9 f f E h M b G j O P Q d s Y U h 2 b Z m 4 n / e Z 0 M o 5 t g I l S a I V d s s S j K J M G E z P 4 m f a E 5 Q z m 2 h D I t 7 K 2 E D a m m D G 0 6 R R u C v / z y K m l W X P / S r d x X y 7 V q H k c B T u E M L s C H a 6 j B H d S h A Q w G 8 A y v 8 O Z I 5 8 V 5 d z 4 W r W t O P n M C f + B 8 / g B Y K I 0 i < / l a t e x i t > 0.4 < l a t e x i t s h a 1 _ b a s e 6 4 = " M f Q / U z 5 w T B h n B J 6 W p 3 C p g e I R + x 8 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 7 d z v P K H S P J a P Z p q g H 9 G R 5 C F n 1 F j p w a 3 W B + W K W 3 U X I O v E y 0 k F c j Q H 5 a / + M G Z p h N I w Q b X u e W 5 i / I w q w 5 n A W a m f a k w o m 9 A R 9 i y V N E L t Z 4 t T Z + T C K k M S x s q W N G S h / p 7 I a K T 1 N A p s Z 0 T N W K 9 6 c / E / r 5 e a 8 M b P u E x S g 5 I t F 4 W p I C Y m 8 7 / J k C t k R k w t o U x x e y t h Y 6 o o M z a d k g 3 B W 3 1 5 n b R r V e + q W r u v V x r 1 P I 4 i n M E 5 X I I H 1 9 C A O 2 h C C x i M 4 B l e 4 c 0 R z o v z 7 n w s W w t O P n M K f + B 8 / g B V I I 0 g < / l a t e x i t > kBT ln(0.4)< l a t e x i t s h a 1 _ b a s e 6 4 = " a q D 8 n p j h S a h I F p j P C e q R W v b n 4 n 9 d L d X j t T Z l I U k 0 F W S 4 K U 4 5 0 j O Y p o A G T l G g + M Q Q T y c y t i I y w x E S b r A o m B H f 1 5 b + k X a 2 4 l 5 X q f a 1 U P 8 / i y M M x n E A Z X L i C O t x B A 1 p A I I U n e I F X 6 9 F 6 t t 6 s 9 2 V r z s p m j u A X r I 9 v r l G R F Q = = < / l a t e x i t > kBT ln(0.6)< l a t e x i t s h a 1 _ b a s e 6 4 = " L t 8 N i g f z 6 A t r R X P b 8 a 2 f e U 9 I C 9 0 = " > A A A B + H i c b V D L S s N A F L 2 p r 1 o f j b p 0 M 1 i E C h q S K u q y 6 M Z l h b 6 g D W E y n b R D J 5 M w M x F q 6 Z e 4 c a G I W z / F n X / j 9 L H Q 6 o E L h 3 P u 5 d 5 7 w p r f 8 8 l / S r D j e u V O 5 v y h V T x d x 5 O E Q j q A M H l x B F e 6 g B g 0 g k M E T v M C r 9 W g 9 W 2 / W + 7 w 1 Z y 1 m D u A X r I 9 v s V u R F w = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " S s z M k l H 8 W z V i G l 6 i G G x 3 T n d N j C U = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e C F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y FIG.8.Joint two-level system Z = X × Yj = {A × ↑, A × ↓} undergoing perfectly-efficient computation when it receives its estimated input through a series of operations.The computation occurs over the time interval t ∈ (jτ, jτ + τ ).At panel A) t = jτ and the system has a default flat energy landscape energy E(z, jτ ) = E(x × y, jτ ) = 0.However, it is out of equilibrium, since it is in the distribution Pr(Z θ jτ = {A × ↑, A × ↓}) = {0.8,0.2}.The first operation is a quench, which instantaneously sets the energies be in equilibrium with the initial distribution, as shown in panel B).The associated energy change is work.Then, a quasistatic operation slowly evolves the system in equilibrium, through panel C), to the final desired distribution Pr(Z θ jτ +τ = {A × ↑, A × ↓}) = {0.4,0.6}, shown in panel D).This requires no work.Then, the final operation is another quench, in which the energies are reset to the default energy landscape E(z, jτ + τ ) = 0, leaving the system as shown in panel E).Again, the change in energy corresponds to work invested through control.The total work production for a particular computational mapping A × y → A × y is given by the work from the initial quench W |A×y (jτ ) plus the work from the final quench W |A×y (jτ + τ ).

FIG. 10 .
FIG. 10.Quasistatic agent implementing the Markov chain M zτ →z τ in the system Z over the time interval [τ, τ ] using ancillary copy Z in five steps: Epoch 1: Energy landscape is instantaneously brought into equilibrium with the distribution over the joint system.Epoch 2: Probability flows in the ancillary system Z as the energy landscape quasistatically changes to make the conditional probability distribution in Z reflect the Markov channel Pr(Z τ 1 = z |Zτ 1 = z) = M z→z .Epoch 3: Systems Z and Z are swapped.Epoch 4: Ancillary system quasistatically reset to the uniform distribution.Epoch 5: Energy landscape instantaneously reset to uniform.