Nonequilibrium Statistical Mechanics and Optimal Prediction of Partially-Observed Complex Systems

Only a subset of degrees of freedom are typically accessible or measurable in real-world systems. As a consequence, the proper setting for empirical modeling is that of partially-observed systems. Notably, data-driven models consistently outperform physics-based models for systems with few observable degrees of freedom; e.g., hydrological systems. Here, we provide an operator-theoretic explanation for this empirical success. To predict a partially-observed system's future behavior with physics-based models, the missing degrees of freedom must be explicitly accounted for using data assimilation and model parametrization. Data-driven models, in contrast, employ delay-coordinate embeddings and their evolution under the Koopman operator to implicitly model the effects of the missing degrees of freedom. We describe in detail the statistical physics of partial observations underlying data-driven models using novel Maximum Entropy and Maximum Caliber measures. The resulting nonequilibrium Wiener projections applied to the Mori-Zwanzig formalism reveal how data-driven models may converge to the true dynamics of the observable degrees of freedom. Additionally, this framework shows how data-driven models infer the effects of unobserved degrees of freedom implicitly, in much the same way that physics models infer the effects explicitly. This provides a unified implicit-explicit modeling framework for predicting partially-observed systems, with hybrid physics-informed machine learning methods combining implicit and explicit aspects.


I. INTRODUCTION
Most Earth Science investigations access only a subset of a high-dimensional dynamical system's degrees of freedom due to limited instrumentation. Predicting the future behavior of partially-observed systems is a central challenge for many areas of Earth Science, and one that dates back to the earliest uses of scientific computing [1,2].
Traditional prediction employing physics-based (or process-based ) models relies on explicit representations: systems are modeled via closed-form equations of motion that determine how a system evolves forward in time through interactions among all its degrees of freedom. Predictions are extracted from numerical approximations of solutions of the equations of motion. This requires knowing the full state of the system at each time, but limited instrument measurements of the true system provide only a partial view of the underlying state. Data assimilation is then used to generate a data-image through model inversion. The result is a coarse-grained approximation of the full system state that is most consistent with the instrument observations and assumptions of the underlying physics.
In contrast, data-driven prediction (typically) does not rely on explicit closed-form models and thus does not require interpolated data-images. For the prediction task that evolves only instrument measurements forward in time, data-driven models learn implicit representations for this evolution directly from the observations themselves.
The explicit nature of physics models hinges on our understanding of the underlying physics governing the system being encapsulated in closed-form differential equations-of-motion. This is what explicit representations attempt to approximate. The equivalent governing physics-the "ground truth"-for the evolution of the measurement observables is given by linear, infinitedimensional Koopman operators. The implicit representations of data-driven models thus attempt to learn projections of the Koopman operators' action [3].
In fact, the governing equations of motion for the measurement observables are given by the Mori-Zwanzig equation [4,5], derived from expanding the action of the Koopman operator in terms of projection operators onto the observable degrees of freedom [6,7]. A key insight from the Mori-Zwanzig formalism is that predictive models of partially-observed systems require a history dependence-past observations of the observable degrees of freedom generally contain information relevant for future predictions.
Recently, the connection between the history dependence of predictive models and the intrinsic geometry of delay-coordinate embeddings [8,9] has been explored [10][11][12][13]. Past values of partial observations, in the form of delay embeddings, implicitly stand in for the missing degrees of freedom. This parallels how, for physics-based models, data images act explicitly to fill in the gaps of the missing degrees of freedom when predicting partially-observed systems.
Most data-driven modeling and prediction relies on Hilbert space methods that learn a target function living in a Hilbert space of functions [14]. Optimal Hilbert space models take the form of a conditional expectation of future observations given past observations. This optimum is equivalent to a nonlinear projection of the action of the Koopman operator (that gives the future value of the measurement observables) onto the Hilbert subspace of functions of only the observable degrees of freedom. History-dependent target function models can be expressed as functions of delay-coordinate embeddings using Wiener projections [6]. The optimal model is then the nonlinear Wiener projection of the action of the Koopman operator onto functions of observed delay embeddings.
There is evidence that Wiener projection models may converge to the true dynamics of the measurement observables if sufficient past observations are taken into account [15]. Here, we provide a new perspective on the behavior of history-dependent data-driven models and their relation to the true underlying physics of partial observations. We do so using insights from the logical inference approach to statistical mechanics given by Jaynes' Maximum Entropy principle [16]. This further builds on the connections between nonequilibrium statistical mechanics and optimal prediction of partially-observed systems [17].
Optimal Hilbert space models are typically formulated in terms of an invariant "equilibrium" measure. However, we show there is a natural family of time-dependent "nonequilibrium" measures induced by partial observations using Maximum Entropy and its time-varying generalization Maximum Caliber [18,19]. Constructively, these measures support more general nonasymptotic behaviors-behaviors that cannot be modeled with an invariant measure. Importantly, though, they provide unique insights into the convergence of optimal models to the true governing physics of partial observations. They do this by directly constructing predictive distributionsprobabilities over future observations given past observations. In particular, we express the possible convergence of history-dependent models as a thermodynamic limit in which the variance of predictive distributions vanishes as the length of past observations increases. This again shows how the action of Koopman operators on delay embeddings implicitly account for the effects of unobserved degrees of freedom.
Formulating optimal data-driven models as expectations of predictive distributions suggests a more general stochastic framework for modeling partially-observed systems. Rather than returning the expectation of predictive distributions, optimal stochastic models simply return the predictive distributions themselves [20], which then may be sampled for ensemble forecasts. Our direct construction of predictive distributions using Maximum Caliber measures leads naturally to such optimal stochastic models for partially-observed systems. A se-quel gives this stochastic formulation of optimal prediction of partially-observed systems.

A. Implicit versus explicit representations
The physical insights that emerge shed light on why data-driven models can outperform traditional physics models for predicting systems with relatively few observed degrees of freedom. Indeed, this has become increasingly common for hydrological systems [21][22][23][24]. In these cases, the implicit approach that uses delaycoordinate embeddings is more effective than the explicit approach that uses data assimilation. For example, while many details, e.g., subsurface morphology, are crucial for geophysical prediction, given limited available subsurface measurements, reconstructing informative dataimages for them is exceedingly difficult. This leads to less effective physics-based methods that rely on the latter.
Perhaps unsurprisingly in this light, due to their empirical successes in scientific applications, data-driven predictive models are increasingly employed.
That said, they are widely considered to be an entirely new paradigm-a paradigm with little to no relation with governing physics and physics-based models. We aim to show that they are in fact quite similar.
Our framework, together with numerical examples, shows that data-driven models do implicitly what physics-based models do explicitly to account for unobserved degrees of freedom. We also clarify how the action Koopman and Perron-Frobenius operators on delaycoordinate embeddings may converge to the true system dynamics on the full system state. Generating partitions on maps of the unit interval are discussed as a rigorous example displaying this behavior. Said another way, the physics underlying history-dependent data-driven models is the same as the physics underlying traditional physicsbased models.
The resulting unified modeling framework shows that the distinction is not so much "data-driven versus physics-based", but rather the emphasis should be on where approaches land in the "implicit versus explicit" representation spectrum. The class of physics-informed machine learning models [25][26][27], now rapidly gaining popularity, are thus seen to lie between fully-explicit physics-based models and fully-implicit data-driven models. Such hybrid models explicitly enforce certain physical properties as inductive biases [28,29], with any remaining properties learned implicitly from the data.

B. Synopsis
Our development unfolds as follows. Section II introduces Platonic models as the true dynamics of a given physical system. This is what models attempt to predict. Next, Section III formalizes partial observations and the resulting stochastic processes over the observable degrees of freedom, which we call dynamical processes. These are the main objects of study. To set the stage for the development of implicit data-driven models, Section IV first reviews the explicit physics-based modeling approach. Next, Section V gives the physics of partial observations expressed in terms of Koopman and Perron-Frobenius operators. This section also discusses connections to statistical mechanics and introduces Maximum Entropy measures.
Section VI overviews implicit data-driven models and their Hilbert space formulation for the case of instantaneous prediction. Section VII details the Mori-Zwanzig formalism, motivating history-dependent models. Section VIII discusses histories of past observations in the form of delay-coordinate embeddings. Section IX then expresses the Mori-Zwanzig formalism in terms of delay embeddings using Wiener projections. This provides the formulation of history-dependent Hilbert space models, using both the equilibrium invariant measure and nonequilibrium Maximum Caliber measures. In the nonequilibrium case, the Maximum Caliber measures allow for the direct construction of predictive distributions, providing insights into the convergence behavior of optimal history-dependent models. Section X provides examples demonstrating the ability of data-driven models to implicitly learn the effects of the unobserved degrees of freedom. Finally, Section XI uses the prior development to formally connect implicit data-driven models with explicit physics-based models. This shows the underlying similarity between the two approaches and offers a unified implicit-explicit modeling framework.

II. SYSTEMS AND PLATONIC MODELS
After centuries of intellectual inquiry, physical scientists collectively have come to believe in having a solid grasp of the basic physics governing measurable phenomena. For example, many Earth Science systems are governed by classical field theories. Atmospheric circulation, shown in Fig. 1, is governed by the laws of fluid mechanics and thermodynamics [30].
Saying that one "understands" these system's basic physics means, more specifically, that the governing principles are encapsulated in the form of explicit differential equations-of-motion [31]. Formally, the system state ω evolves according to:ω where the governing equations Φ are a function of ω. For spatially-extended field theories, ω itself is a function of spatial coordinates, too. Φ then typically includes finitely-many spatial derivatives of ω, signifying the state dynamics are governed by local interactions.
The "unreasonable effectiveness of mathematics" in physics has been repeatedly noted since Ref. [32] highlighted the puzzle. Noting that governing equations Φ are almost always given in closed form the effectiveness is all the more intriguing. Our development further highlights that demanding physical systems always be expressed in closed form rather restricts the class of mathematical models used to describe the physical world.
Here, we represent a given physical system as a differential dynamical system (Ω, Φ) that, for a shorthand, we call the Platonic model. A system's true dynamics, given by the Platonic model, may be well approximated with closed-form equations of motion. The Navier-Stokes partial differential equations come to mind as an approximation to the Platonic model of fluid flow. However, we need not assume a particular functional form for Platonic models.
That said, there are three important properties we do assume for Platonic models. Note that we are primarily concerned here with phenomena that occur at classical energy scales, such as found in Earth Systems. The first property is that system states evolve continuously-they are continuous trajectories in the state space over time.
The next two properties define what the system state ω ∈ Ω actually is. The second property assumes Platonic models are Markovian: Determining a later state ω t = Φ t ω 0 only requires knowing the state at a single prior time ω 0 . The third property assumes Platonic models are deterministic: The same initial condition ω 0 always produces the same later state ω t = Φ t ω 0 . The latter two properties impose a closure relationship among the degrees of freedom constituting the system state ω. That is, ω is considered a vector with each component ω i being a degree of freedom. The dynamic Φ(ω) captures the physically-relevant interactions among the degrees of freedom by determining how they evolve forward in time. The system's governing physics is appropriately captured or modeled when, with sufficientlymany degrees of freedom comprising ω, there is a closure in their dynamics: For every ω i , its time evolution is a deterministic and Markovian function of a subset of the other {ω i }, i.e., the system state ω.
As there are many parallels to statistical mechanics, note that there is an important property we are not assuming of (Ω, Φ)-that the system is Hamiltonian. In the partially-observed setting, introduced shortly, the Platonic model (Ω, Φ) is analogous to a "microsystem". Statistical mechanics would take it to be Hamiltonian. This is too restrictive for our purposes. Importantly, Hamiltonian systems are conservative and volume-preserving, via Liouville's theorem [4]. Volume-preserving dynamics admit a natural invariant probability distribution over Ω, known as the microcanonical ensemble in statistical mechanics [4]. While such invariant probability measures are convenient mathematically, many physical systems of interest display transient nonasymptotic behavior that cannot be captured by invariant measures. This is particularly notable for fluid flows.
To accommodate nonasymptotic behaviors within our formalism, we do not assume Platonic models are necessarily volume-preserving, although they may be. More generally, while it is standard to assume the dynamics is measure-preserving such that there is a probability measure over Ω which is invariant under Φ, our formalism does not require an invariant measure. Rather, one of our main contributions is introducing natural timedependent measures for partially-observed systems that can support nonasymptotic behaviors. In the language of statistical mechanics, our approach is a nonequilibrium formalism that generalizes the equilibrium setting using asymptotic invariant measures. For more details on ergodicity, invariant measures, and dissipative systems, see Appendix A.
Additionally, in what follows, we assume a system's dynamic is reversible, so that: This, however, is an assumption for notional convenience and simplicity. It can be lifted without much difficulty. An added advantage of our time-dependent formulation is that we need not assume reversible dynamics. Note though that many systems of interest are reversible in this way, such as all finite-dimensional systems of ordinary differential equations.

III. PARTIAL OBSERVATIONS AND DYNAMICAL PROCESSES
The semigroup formalism of dynamical systems [33,Ch. 7] is particularly apt for our development. Consider a dynamical system (Ω, Σ Ω , ν, Φ). The state space Ω is a Euclidean space or manifold for finite-dimensional systems or a general Hilbert space for spatially-extended systems. Σ Ω is the Borel σ-algebra and ν the Lebesgue reference measure that gives a "volume" to state space.
Φ is the dynamic-the infinitesimal generator of a continuous semigroup of measurable flow maps {Φ t : Ω → Ω} t∈R , with: Thus, the orbits {ω t = Φ t (ω 0 ) : t ∈ R (≥0) } are continuous functions of time t. When the dynamic is specified by a system of differential equations, Φ is the time derivative of the orbits:ω For a given dynamical system under study, let x ∈ X be the subset of system variables that are observ-able, measurable, or generally accessible. Through experimental or observational measurements or numerical simulations, they may be collected in a time series {x 0 , x 1 , . . . , x T −1 }-a time-ordered set of observations of x taken at uniform time intervals {t 0 , t 1 , . . . , t T −1 } with t i = (i − 1)∆t. The observations x are generated by the dynamical system under the continuous and measurable mapping X : Ω → X so that x t = X(ω t ). In practice, the measurement observables are given as a vector of real numbers, so that X = R n .
We are interested in the case of a partially-observed dynamical system for which the map X is many-to-one and not invertible. Due to this, an observation x t is insufficient for determining the full state ω t of the underlying dynamical system at any given time. That is, there are unobservable, unmeasurable, or inaccessible degrees of freedom in ω. And so, measurement data can only ever provide a limited view of the system's true state ω. An important example is weather prediction, shown in Fig. 1.
We refer to collections of arbitrarily-long time series of observables {. . . , x −1 , x 0 , x 1 , . . .} as a dynamical process, signifying that it is a stochastic process derived from a deterministic dynamical system through partial observations. They are the objects we wish to model. If the underlying system is governed by noninvertible dynamics we consider the time index of a dynamical process to correspond to observation time. That is, x 0 is not an initial condition, but rather the present moment of observation. The leading dots then indicate that we allow measurements from arbitrarily far in the past.
Various properties of dynamical processes will be given shortly, using the Koopman and Perron-Frobenius operators. First though, we detail the standard approach for modeling partially-observed systems using physics-based models.

IV. EXPLICIT PREDICTIVE MODELS
Given a physical system's Platonic model-its governing physics-and partial observations from instrument measurements, how do we predict the system's future behavior? Our main interest is to explain the effectiveness of implicit approaches learned by data-driven models. To set the stage, though, we first overview the more familiar explicit approach using physics-based models. Figure 1 shows the relation between data-driven and physics-based methods for modeling systems from partial observations. After formulating the physics underlying implicit data-driven models, a formal connection with explicit physics models is given in Section XI.

A. Physics-Based Models
In essence, physics-based models simply attempt to solve the governing equations that constitute or well- Predicting complex systems (left) from partial observations: Instantaneous data-driven modeling (middle) versus physics-based models (right). Instrument measurements provide a partial view x of the true system state ω through a noninvertible mapping X. From an initial measurement observation x0 = X(ω0), the value of the instruments at later time t is found by letting the actual system evolve, given by the dynamic Φ t , and taking a measurement xt = X(ωt) at this time. Koopman operators U t , provide an alternate point of view by providing a future measurement function xt = U t X(ω0) that gives the instrument readings xt at time t from the current system state ω0. It provides the ground-truth that data-driven models T t try to approximate as functions from current observation x0 to future observation xt. In contrast, physics-based models create a coarse-grained approximation u0 of the full system state ω0 most consistent with past observations using data assimilation and model inversion as u0 = ← − a (x0, x−1, x−2, . . .). Data images are then evolved through numerical approximation ut = Φ t (u0) of the system dynamics ωt = Φ t (ω0). approximate the Platonic model Φ. There are three main challenges when predicting a physical system using differential equation models: nonlinearity, highdimensionality, and calibration. First, most systems of interest are governed by nonlinear equations that cannot be solved analytically. Thus, numerical approximations to solutions are necessary. For complicated systems like those encountered in Earth Sciences, this introduces a second challenge.
Even with the arrival of massive high performance computing, today's largest machines still do not have the computational resources required to fully account for all known physical effects in a system at the necessary scales. This second challenge is certainly the case for numerical models of the atmosphere, as depicted in Fig. 1's right column. The effects that are not directly computed are accounted for using parametrization schemes to replace processes that are too small-scale or complicated to be directly computed in the model. This simplifying procedure produces a deterministic, Markovian closure.
While parametrization schemes are often heuristic choices, increasingly they are being informed by separate models specifically targeting the effects being parameterized. This includes, for example, using cloudresolving models to inform cloud formation parametriza-tions in large-scale atmospheric models. Moreover, for spatially-extended field theories, continuous spatial coordinates must be discretized into a finite grid or mesh and then the effects of subgrid-scale processes must be parameterized using approximate closure models. However they are arrived at, the parametrizations represent a modeler's choices, and these choices necessarily induce conceptual error that affects the model's predictive capabilities. Poor choice of generic model may also lead to conceptual error in prediction.
Assume, for a given physical system, that we know effective differential equations Φ(ω) that govern the system. A generic model of Φ(ω) is an auxiliary set of equations Φ τ α : u t → u t+τ whose solutions {u t } can be solved numerically and that approximate the solutions of Φ. The generic model typically contains a set of parameters α that include those associated with parametrization schemes as well as physical parameters of the model, such as viscosity in the Navier-Stokes equations. The generic model Φ τ α acts on data images u t that are coarse-grained approximations of the state ω t of the physical system. Neglecting numerical round-off error, numerical models are also Markov and deterministic, like the differential equation models they approximate.

B. Data Assimilation
The final challenge in using a generic physics model to predict the future behavior of a partially-observed physical system comes during calibration. The generic model must be made into a specific model that appropriately captures the particular circumstances of the physical system of interest. This includes specification of the parameters α and boundary conditions, as well as initialization of the model. (For simplicity, we include specification of the boundary conditions in α.) The Markov property allows for generating an orbit u t of the specific model from a single initial state u 0 . For this orbit to provide a prediction of the true system's orbit {ω t } requires aligning the model's initial state u 0 as well as possible to the true physical system's initial state ω 0 .
To emphasize the difficulty of initialization in particular, consider the commonly-encountered case of predicting a spatially-extended system using approximated solutions of a classical field theory-i.e., Φ is a set of partial differential equations. It is not possible to determine the system's configuration over a continuum of spatial coordinates. Rather, as depicted in the top of Fig. 1's middle column, measurements derive from a variety of instruments collecting data over a relatively small subset of the spatial domain. However, solving the model equations-say, using a finite element method-requires an initial condition on a grid over the spatial domain. Our instruments, though, do not necessarily provide full coverage over the grid. Thus, the calibration methods produce a data image u 0 (top right of Fig. 1) that represents inferred values over the full grid.
Model calibration, including parameter and boundary condition specification, as well as initialization of the data image u 0 , are carried out using model inversion and data assimilation [34,35]. Since these techniques require multiple past observations, calibration is sometimes also referred to as history matching. Given a history of past observations ← − x k t := {x t , x t−1 , . . . , x t−k }, calibration attempts to find the initial data image u t = ← − a ( ← − x k t ) and parameter set α such that the model output x k t as possible. (Recall that we are assuming reversible dynamics for notational simplicity, but this is not strictly required.) Due to the many-to-one nature of the partial observation map X, the calibration process is typically not unique. Multiple parameter sets and initial data images may produce orbits of data images that are equally consistent with the observations up to time t. Therefore, there will be multiple specific models that are equally consistent with past observations, but make different predictions for future behaviors. Therefore, a specific model used for prediction generally has calibration error. This combined with conceptual error leads to the model's overall prediction error. Prediction error can accumulate rapidly, particularly for deterministic chaotic systems whose inherent instabilities exponentially am-plify small variations. This is one reason why weather is so hard to predict.
We stress here that model parametrizations and initialization u 0 = a(x 0 ) are both means of explicitly accounting for unobserved or unrepresented degrees of freedom not in x = X(ω). For instance, in atmospheric circulation, imagine we do not have instruments on a remote island in the Pacific. As a consequence, atmospheric variables-temperature, pressure, wind speed, and the like-at that spatial location are not in x = X(ω). However, when a data image u is created these variables are approximated at that location.
Note also that the primary concerns are the predictive capability of physics models and how it relates to the Platonic model's true dynamics. In a sense, though, we are agnostic as to whether a specific model is valid or not [36,37]. Loosely speaking, validity measures how well a physical system's specific model approximates its Platonic model. The specific model's predictive skill is, of course, related to how well it approximates the Platonic model. And, this is a question we care about here.
That said, there is a deeper concern about how well a specific model approximates the Platonic model. This involves the question of how much we can infer about the underlying physical and causal processes governing the true system, given a specific model of that system and its predictive capability. That is, how well can we explicitly formulate a Platonic model given a skillful specific model? It is in this deeper mechanistic sense that we are agnostic to the question of model validity. As the saying goes, "All models are wrong, but some are useful" [38].

V. PHYSICS OF PARTIAL OBSERVATIONS
Platonic equation-of-motion models are given in terms of the underlying system state-the full set of degrees of freedom. Due to their explicit nature, the connection between physics models and the system's true governing physics described by Platonic models is clear. Datadriven models of partially-observed systems do not generally attempt to explicitly infer the full Platonic system state, as physics-based models do. And so, it is less clear how they relate to the physics of Platonic models. To discuss Platonic models and the true governing physics in a meaningful way for partially-observed system requires the operator-theoretic formulation of dynamical systems [3,33]. Both Koopman and Perron-Frobenius operators, defined shortly, provide alternative descriptions of a system's temporal evolution: Koopman operators give the evolution of observables, while Perron-Frobenius operators evolve state distributions. These evolution operators are the classical analogs of the Heisenberg and Schrödinger formulations of quantum mechanics, respectively.

A. Koopman Operators
A Koopman operator U t acts on functions of the system state, known as observables f : Ω → R, where f is an element of a function space F. The action of U t : F → F on observable f is given by composition with the dynamic Φ t , also known as the pullback of f along Φ t : That is, U t 's action on observable f ∈ F gives the timeshifted observable f t = U t f whose value at state ω is obtained by evaluating f at the future state ω t = Φ t (ω).
Recall that the flow maps Each U t is a linear infinite-dimensional operator when F is a vector space. As discussed more below in relation to Perron-Frobenius operators, it is most natural to take F = L ∞ (Ω, ν)-the bounded functions of ω-but the square-integrable functions L 2 (Ω, ν) are often used for mathematical convenience. The following uses Koopman operators on L 2 (Ω, ν) since it is a Hilbert space and the development requires orthogonal projections.
Recall that we are interested in observable functions X that are generally multidimensional. With this, an observable function f is a component of a vector-valued observable function; e.g., f = X i . A Koopman operator that acts on an observable X is then the product over the component operators acting on X i . To avoid excessive notation, we denote these product operators as U t . For a dynamical system with initial condition ω 0 at time t 0 , the measurement observable at a later time t > t 0 is given by: The dynamical process, therefore, is a function of the underlying system's (unknown) initial condition: This transparently relates the evolution of partial measurement observations of the dynamical process {. . . , x −1 , x 0 , x 1 , . . .} to the physics of the Platonic model Φ t (ω 0 ) through the action of Koopman operators on the observable map X.

B. Perron-Frobenius Operators
Koopman operators connect dynamical processes to the Platonic model (Ω, Φ) via an unknown initial Platonic state ω 0 . If we do not seek to directly infer ω 0 , as done with physics models, it becomes useful to formulate the problem in terms of distributions over possible ω 0 . The dynamics of these distributions is provided by Perron-Frobenius operators.
In appropriately defined spaces, Perron-Frobenius operators are dual to Koopman operators. It is most common to consider Perron-Frobenius operators acting on L 1 (Ω, ν) densities and, thus, their Koopman duals evolve observables in L ∞ (Ω, ν). However, as often done, the following considers both operators acting on L 2 (Ω, ν) functions. In this case, Perron-Frobenius operators act on L 2 measures. If the L 2 measure ν ρ is absolutely continuous with respect to the reference measure ν, ν ρ is related to the density ρ through the reference measure: For continuous-time dynamical systems there is a continuous semigroup {P t } of Perron-Frobenius operators that evolve measures µ through the pushforward of µ along Φ t : The measure µ t defines the probability space (Ω, Σ Ω , µ t ) that quantifies uncertainty in system state ω t at time t. In turn, this casts observables, given by the measurable map X : Ω → X , as random variables X t distributed according to the pushforward measure: for B X ∈ Σ X . Thus, we can write X t 's distribution in terms of the initial measure µ 0 : This is analogous to writing, as done in Eq. (3), observations x t in terms of the initial state ω 0 and the action of Koopman operators on the measurement observable X. Recall that the two operators are dual, so that these two perspectives are equivalent. If there is initial uncertainty over system states, then the observables become random variables. The Koopman operator then evolves observable random variables that are distributed according to the action of Perron-Frobenius operators on the initial distribution.
Thus, given an initial uncertainty measure over system states, a dynamical process is a stochastic process {. . . , X −1 , X 0 , X 1 , . . . }-a time series of random variables-with realizations {. . . , x −1 , x 0 , x 1 , . . .}. Note that the random variables in {. . . , X −1 , X 0 , X 1 , . . . } are actually (measurable) functions of two variables: X t = X(t, ω 0 ). Fixing ω 0 produces a realization, or sample path, . . . , x −1 , x 0 , x 1 , . . . of the stochastic process. From our setup, the realization . . . , x −1 , x 0 , x 1 , . . . for a given ω 0 is the result of applying the map X to each ω t in the orbit generated by ω 0 . We consider continuous maps X so that realizations We emphasize again that evolution operators are defined in Eqs. (2) and (4) in terms of the Platonic model Φ. As such, they are yet other ways of expressing the true governing physics of a given system. In particular, they provide the true physics of partial observations.

C. Nonequilibrium Statistical Mechanics
Equation (3) expresses the time series of measurement observations in terms of Koopman operators and an unknown initial Platonic state ω 0 . In contrast, Eq. (5) expresses the measurement observables as a continuous stochastic process using Perron-Frobenius operators and an initial probability distribution µ 0 over the Platonic states. To compensate for not knowing the exact initial state ω 0 , one can ask, is there a natural choice for an initial distribution µ 0 over Ω induced by observations?
This key question leads directly to statistical mechanics.
The standard choice for µ 0 is the invariant measure µ * given by P t µ * = µ * . For the ergodic systems considered here µ * is guaranteed to exist and to be reached asymptotically (see Appendix A). The following employs this commonly-invoked "equilibrium case" to review instantaneous data-driven models. Note that, by definition, taking the invariant measure µ * as µ 0 leads to µ t = µ * for all t. Due to this, the random variable observables in Eq. (5) have time-independent distributions. In this case, the stochastic process over measurement observables is a stationary stochastic process. Clearly though, assuming the invariant measure µ * precludes nonasymptotic "nonequilibrium"behaviors that we ultimately wish to also capture.
The preceding defined dynamical processes as stochastic processes generated by deterministic dynamical systems. To set the stage for the nonequilibrium generalization with time-dependent measures used later for historydependent models, recall that underlying system states ω t can not be uniquely identified from an observation x t due to the noninvertibility of the measurement observable function X. This setup admits a natural nonasymptotic measure induced by a single observation x t = X(ω t ) that we now define. Consider a dynamical system (Ω, Σ Ω , ν, Φ) and a single observation x t = X(ω t ) at an arbitrary time t. Since the observation mapping X is not invertible there can be many ω t ∈ Ω yielding the observed value x t under X. (This is directly related to the non-uniqueness of model inversion when assimilating physics-based models). Thus, for a given observation x t define the set B t ∈ Σ Ω as: Note that B t is ν-measurable. Following Refs. [16,19]'s minimal bias argument there is a natural measure dµ t = ρ t dν defined through the density ρ t that is constant over B t and zero elsewhere, so that Pr The Maximum Entropy Principle (MEP) says that the distribution which maximizes entropy subject to known constraints creates the minimally-biased prior distribution that is spread out as much as possible, up to given constraints. If the only constraint given is the support set, MEP reduces to the Principle of Indifference and assigns uniform probability over the set.
In what way is the noninvariant measure µ t a nonequilibrium generalization of the equilibrium measure µ * ? The nonequilibrium behaviors allowed by ergodic systems with dynamics Φ which have no explicit time-dependence are those of relaxation processes [39]. According to the attractor-basin formalism described in Appendix A, these processes limit to equilibrium distributions given by the invariant measure µ * . Theorem 4.5 in Ref. [39] establishes the correspondence between the invariant measure µ * and thermodynamic equilibrium for these systems. Hence, any other measure µ is a nonequilibrium distribution that asymptotically limits to the equilibrium distribution. The measure µ t is a nonequilibrium measure naturally induced through partial observations.
Note that there is a wide range of nonequilibrium phenomena beyond relaxation processes. For instance, an invariant measure may correspond to an equilibrium steady state or a nonequilibrium steady state [40] that absorbs and dissipates energy from its surroundings. These more general far-from-equilibrium processes that include thermal driving require explicit time-dependence in the dynamics [41]. Detailed thermodynamic analysis is not our primary concern as yet, but the formalism introduced here readily extends to such settings by including explicit time-dependence in the Koopman and Perron-Frobenius operators. See, for example, the dynamics governed by Ref. [42]'s time-dependent rate-matrices. In that language, the development here applies in the special case of a fixed time-independent protocol with relaxation to the associated invariant distribution.
As seen shortly, the two measures µ * and µ t represent different sets of assumptions used to motivate and interpret the behavior of data-driven models. Neither is typically known explicitly, but is rather inferred approx-imately from observations. Kernel methods are particularly useful for this in practice [3,43]. The nonequilibrium measure µ t is more closely aligned with the modeling approach of physics-based models. Its construction requires knowledge of the set B t of Platonic states consistent with the observation x t , much like the construction of the data image u t that is the approximation of the Platonic state most consistent with x t . Ultimately though, both µ * and µ t are insightful in their own way for understanding implicit data-driven models, and so both are discussed in detail in what follows.

VI. INSTANTANEOUS IMPLICIT MODELS
With the physics of dynamical processes laid out using the machinery of Koopman and Perron-Frobenius evolution operators, we return to the question of optimal prediction. Section IV outlined the challenges of using physics-based modeling for prediction. Circumventing explicit inference of unobserved degrees of freedom requires learning, directly from observation, evolution rules for the variables that are accessible through instrument measurements. Pushing this further, we explore learning implicit models that predict the evolution of the observables-models given in a more flexible, possibly more abstract, form than differential equations-ofmotion.

A. Instantaneous Predictive Distributions
The most basic form of prediction for a dynamical process is instantaneous: Given a single observation x t = X(ω t ), predict the observable at a single time in the future x t+τ = X(ω t+τ ). Before reviewing the functional Hilbert space approach for learning instantaneous implicit models, we first theoretically analyze the problem using evolution operators. We argue that the maximal instantaneous predictive information available is given in an instantaneous predictive distribution. We will later see that the optimal Hilbert space model for instantaneous prediction is the expectation value of the instantaneous predictive distribution.
Given a single observation x t , Eq. (6) defined B t as the set of all possible Platonic states ω t consistent with the observation x t such that X(ω t ) = x t . This then defines the set of all possible observables x t+τ that may be seen at a later time by evolving each ω t ∈ B t under Φ τ and applying the observable mapping X. Said another way, the set of all possible future observables x t+τ is given through the action of the Koopman operator by applying the time-shifted observable Furthermore, we use the MEP measure µ t over B t and the Perron-Frobenius operator to define the distribution over possible future observables, supported on the set {x t+τ = [U τ X](ω t ), for all ω t ∈ B t }. This distribution-the instantaneous predictive distributionis given as the pushforward of the time-evolved measure µ t+τ = P τ µ t along X, following Eq. (5): To define the instantaneous predictive distribution, we need some initial measure µ t . Without additional information on the system, the choice of a MEP measure is most natural. What matters for our purposes is that µ t is supported on the set B t and, thus, the instantaneous predictive distributions are supported on In practice, these measures are estimated empirically from data and the MEP is not typically invoked for µ t 's empirical construction. Also note that the formalism for instantaneous models we now review is given in terms of the equilibrium measure µ * , as is standard. However, instantaneous predictive distributions cannot be expressed directly in terms of µ * . That said, the nonequilibrium construction of predictive distributions just given is instructive for understanding instantaneous models built using µ * . The equilibrium and nonequilibrium formulations of data-driven models are more closely connected below for history-dependent models using Wiener projections. Our introduction of nonequilibrium measures is most impactful for historydependent models, as they provide insights not available through use of the invariant measure.

B. Instantaneous Data-Driven Models
We now review instantaneous data-driven models and their Hilbert space formalism. Following established practice, we use the invariant measure µ * .
Given the current observation data x t , the goal is to construct a model T τ : X → X , called a target function, that predicts what the instruments will read at a later time t + τ [14]. This is depicted in Figure 1's middle column. On the one hand, recall that for physics-based models we assume the system's governing equations of motion Φ are known, and that one of the main challenges is to infer from partial observations x t = X(ω t ) the underlying Platonic state ω t that the equations evolve. On the other hand, the data-driven paradigm flips this around to work directly with the measurement observations x t , without directly inferring ω t . In point of fact, an appropriate set of governing equations for x t is generally not know a priori. They may not even be desired. Instead, the goal is to learn a model T τ from the measurement data.
In some cases we can learn T τ as closed-form equations using Galerkin projections of Φ onto X [44]. In many cases, though, the evolution of x t cannot be adequately described by a set of closed-form equations [45]. Thus, we seek more general forms for T τ that are measurable mappings from X into X . For example, neural networks [46] are universal function approximators [47] and so are able, in principle, to represent T τ .
As the name suggests, a modeler cannot simply write down an implicit model. Rather, implicit models are implemented algorithmically and learned from data. From the discussion above, the Koopman operator provides the ground truth for prediction-the equivalent of the Platonic model. Following Ref. [14], the mean-squared error for model T τ is given as: From the Koopman operator definition and Fig. 1's commutation relations, the measurement data {x 0 , x 1 , . . . , x T }, used to learn T τ , contain samples of the Koopman operator's action. That is, for an observation x t = X(ω t ), a later observation is: Empirically, the Koopman operator's action is approximated through the action of the shift operator [3,14]. And so, the ground truth for training T τ is found by simply looking up in the observed data {x 0 , x 1 , . . . , x T } what happens after time τ . In this way, given With the ground truth given by x t+τ ∈ {x 0 , x 1 , . . . , x T }, a parametric model (e.g., neural network) T τ is trained by minimizing x t+τ − x t 2 over the training data, 0 ≤ t < (T − τ ).

C. Analog Forecasting
Analog forecasting, dating back at least to Ref. [48], is one of the oldest methods for approximating T τ implicitly from data. The basic procedure is to predict a system's future by finding the value recorded in past observations (the analog) that is most similar to the present observation and then use the following value in the recorded history as the forecast.
Formally, let {x 0 , x 1 , x 2 , . . . , x T } be the finite set of historical observations-the training data set. Then, given the current observation X t = x, with t > T , identify x's analog x a in the training set. This is typically implemented with Euclidean distance: The forecast x t+τ of x for τ time steps into the future is given by the analog forecast x a+τ ∈ {x 0 , x 1 , x 2 , . . . , x T }. That is, the analog forecast simply looks up what happened in the training data set τ time steps after the ana-log: Analog forecasting is used for the data-driven prediction examples given below in Section X.

D. Optimal Hilbert Space Models
As data-driven models, target functions T τ map a single input to a single output. As such, target functions live in a function space. Since inner products and orthogonal projections play an important role in the development, we seek target functions as elements of a Hilbert space. The following reviews the Hilbert space formulation of instantaneous data-driven models [3,14,15].
Recall that partial observations of a dynamical system (Ω, Φ) induce a time-dependent probability measure µ t over Ω. For simplicity when using instantaneous models, in the asymptotic limit we employ the invariant ergodic measure µ * . This leads to the probability space (Ω, Σ Ω , µ * ) and measurement observables as random variables given by the measurable map X : Ω → X . The space X is often referred to as the covariate space and X the covariate map.
More generally, we may consider a response space Y and the (measurable) response map Y : Ω → Y. The target function is then a measurable mapping from covariates to a response: T τ : X → Y. In our dynamical setting, the covariate and response spaces are the samethe observation instrument readings: Y = X = R n . The response map is the time-shifted measurement observable Y = X τ = X •Φ τ : Ω → X . Note that X and Y = X τ are both random variables over the same probability space (Ω, Σ Ω , µ * ) since x t = X(ω t ) and y t = x t+τ = X τ (ω t ). This is what makes T τ predictive.
Consider the Hilbert spaces: V := {g : X → X : g • X ∈ H} , and Note that H = L 2 (µ * )-the set of square-integrable functions of the full system state ω. And, H X is the Hilbert subspace of L 2 (µ * ) containing functions f X that depend only on the observable degrees of freedom x t = X(ω t ). Then V = L 2 (µ X * ), where µ X * is the pushforward of µ * along X, is the set of functions over the measurement observables such that composition with the observable map X is square integrable. Since the observables X ∈ H are what is accessible, the set V is what is at our disposal to build implicit models. However, since Koopman operators U τ act on functions in H, we must compose elements of V with X for proper comparison with U t 's action. Specifically, the ground-truth for the future observation is given by X t+τ = U τ X which lives in the space H, while the functions available for us to learn are in the subspace H X .
Equation (8) identifies the unique minimizer-the optimal T τ . Denote it Z τ . In statistics, this estimator is known as the regression function, as well as the conditional expectation function. That is: The conditional expectation E[·|X] is the (nonlinear) orthogonal projection P X : H → H X from H onto H X . Thus, Z τ = P X U τ X t . This is the best approximation of X t+τ = U τ X ∈ H available using functions restricted to the subspace H X . For a given learned target function (data-driven model) T τ , we decompose its error via the regression function Z τ : The excess generalization error Θ(T τ ) = T τ − Z τ 2 measures how far a given model is from the optimal solution, while Ξ τ X is the intrinsic error due to the partial observations X of a given system (Ω, Φ). For a given physical system with instrument measurements X, the regression function Z τ represents the maximum predictive skill an instantaneous data-driven model can achieve. Ξ τ X is then the unavoidable error incurred from only being able to measure X. Increasing instrument coverage and expanding X can decrease Ξ τ X . The conditional expectation E[U τ X|X] can be expressed as the expectation of a probability measure supported on the set {x t+τ = [U τ X](ω t ), for all ω t ∈ B t }the instantaneous predictive distribution. Although we directly formulated the instantaneous predictive distribution in Eq. (7) using the nonequilibrium measure µ t , these predictive distributions cannot be directly formulated as an L 2 (µ * ) measure [3]. That said, they provide insight into the intrinsic error of Z τ , regardless of which measure is used for the nonlinear projection of Eq. (11). As Z τ is the expectation of this distribution, the intrinsic error is then seen as the variance of the predictive distribution. Later, we will give explicit constructions of history-dependent generalizations of predictive distributions.
Empirical Hilbert Spaces While L 2 (µ * ) is theoretically convenient, it is not a workable space for empirical models [49]. This is because functions in L 2 (µ * ) cannot be distinguished with a finite set of samples, but the latter is what is empirically available. Data-driven algorithms thus typically employ reproducing kernel Hilbert spaces (RKHSs), which have well-defined point evaluations. For more on RKHS methods, as well as empirical sample measures and their convergence, see Refs. [3,14,50,51].

VII. MORI-ZWANZIG FORMALISM OF DYNAMICAL PROCESSES
Although Eq. (11) defines the optimal instantaneous model, the optimal model still has an associated intrinsic error-the variance of the instantaneous predictive distribution. With the same Hilbert space and projection operator formalism used to define Z τ , the Mori-Zwanzig formalism provides the full equations of motion for the observable degrees of freedom by projecting the system dynamics onto those degrees of freedom [5]. The composition of the Mori-Zwanzig equation reveals terms in addition to Z τ that lead to the intrinsic error when not accounted for in instantaneous models. Crucially, the additional terms show that partial observation induces a memory dependence in dynamical processes. This then motivates the use of history-dependent models for increased predictive skill over instantaneous models.
The Mori-Zwanzig setting is identical to that for the partially-observed dynamical systems considered so far. There is an underlying true system (Ω, Φ) and a noninvertible mapping X : Ω → X . The variables x = X(ω) are known as ω's resolved degrees of freedom. Denoting the remaining unresolved degrees of freedom x, then ω = (x, x). The standard formulation of the Mori-Zwanzig equation in statistical mechanics assumes (Ω, Φ) to be Hamiltonian and considers projections of densities ρ(ω) and their time evolution by the Liouville operator [4]. Importantly, the equation can be derived in our more general setting of dissipative systems using the Koopman operator [6,15].
The goal is to predict the future values of the resolved degrees of freedom using only information available from them. That is, the task is to express the evolution of the resolved variables-the dynamics governing the dynamical process-in terms of the resolved variables as much as possible. We do this by projecting the Koopman operator's action onto the resolved degrees of freedom. This is possible since the dynamics of dynamical processes is given in terms of Koopman operators, as shown above.
Referring Eq. (10)'s Hilbert spaces-i.e., H = L 2 (µ * )-the discrete-time derivation expands the Koopman operator U t+1 : H → H via the Dyson formula: In this, P is an orthogonal projection operator from H to a subspace H Ψ ⊆ H X ⊂ H spanned by basis functions Ψ(x) that depend only on the resolved variables x = X(ω). And, Q = I − P is the orthogonal projection to the unresolved variables.
Recall that X ∈ H is the observation function that returns data gathered from measurement recordings of an underlying physical system Ω. Forming new observable functions in the projected space H Ψ uses observation measurements in X and functions ψ(x) of them. This is in contrast to introducing new measure-ment instruments-instruments that would enlarge the resolved-variable space H X . In finite subspace projection algorithms, P projects into the subspace H Ψ spanned by the basis of dictionary functions Ψ.
In discrete time, the dynamics of the resolved variables are generated via: The discrete-time Mori-Zwanzig equation then follows by applying the expansion in Eq. (14) to the unit-shift observable X t+1 = U t+1 X ∈ H. Skipping algebra and notational simplifications [15], this yields: The key is that this expression is exact. It gives the true evolution of the measurement observables, equivalent to the action of the full infinite-dimensional Koopman operator.
The first term M 0 (x t ) describes Markovian evolution. It gives the best Markov approximation of Φ 1 under projection P . That is, M 0 is the best approximation of the unit-step dynamics by a function of the current observable only. It is the optimal target function Z 1 defined above when the nonlinear projection given in Eq. (11) is used in Eq. (14). The last term ξ t+1 (ω 0 ) is the orthogonal term originating from the initial unresolved components. The second term captures longer-range statistical dependencies with a discrete convolution of a memory kernel that depends on the orthogonal terms: (Statistical mechanics refers to this orthogonal dependence of memory as a fluctuationdissipation relation.) All terms depend on the particular projection operator P used. (For example, the terms M 0 and {M k } may be linear-i.e., matrices-for certain choices of P [7].) A comparison is in order: the Mori-Zwanzig perspective of Koopman operator projections in Eq. (15) versus the data-driven approaches for finite-dimensional Galerkin projections of the Koopman operator. The latter are presented in Appendix C; namely, Dynamic Mode Decomposition (DMD) and Extended Dynamic Mode Decomposition (EDMD). Both DMD and EDMD are instantaneous models and, as such, only approximate the Markovian term M 0 . Both do so using linear finite subspace projections onto H Ψ ⊆ H X . DMD uses the simple dictionary Ψ = {f X } consisting of only the identity function f X ; while EDMD uses arbitrary dictionaries Ψ of basis functions. In contrast, while data-driven approaches to Mori-Zwanzig evolution operators also use finite subspace projections H Ψ ⊆ H X [5,7,52], they do so for the memory kernels as well as for the Markovian component.
Comparing further, EDMD seeks a Galerkin approx-imation of the Koopman operator itself, with a single matrix, while Mori-Zwanzig evolution approximates projections of the Koopman operator action specifically on functions of the resolved degrees of freedom. Paraphrasing Ref. [7]: "EDMD seeks a point U τ X Ψ(x t ) in H Ψ that minimizes the error between the point and Ψ(x t+τ ), whereas Mori-Zwanzig simply projects Ψ(x t+τ ) onto H Ψ ".
The crucial insight of the Mori-Zwanzig equation Eq. (15) is that partially observing dynamical systems induces a memory dependence in the observable degrees of freedom. Optimal instantaneous models are thus not fully optimal as data-driven Hilbert space models. History-dependent models will reduce intrinsic error and improve predictive skill. Moreover, the dependence of the memory kernels on the orthogonal unresolved degrees of freedom indicates that the memory dependence accounts for the effects of the unresolved variables on the dynamics of the resolved variables. Recall that much of the effort in physics-based models comes in explicitly inferring the unobserved degrees of freedom and their dynamical effects.

VIII. DELAY-COORDINATE EMBEDDINGS
A key step in bridging physics-based and data-driven approaches to prediction comes through the formulation of memory as reconstruction embeddings [53]. Their intrinsic geometry illuminates how memory of partial observations implicitly encodes effects of the unobserved degrees of freedom.
Starting with a scalar time series {x t , t ∈ N + }, the task is to reconstruct an effective state space of embedding dimension m in which the effective states evolve as a deterministic dynamical system. In short, m is set large enough that the orbits in the reconstructed state space do not intersect. A derivativecoordinate embedding of a measurement observable x t develops a reconstructed state space from ← − . Due to its familiarity we discuss delay-coordinate embeddings, despite the extra required optimization over lag δ that may be required in practice. Unless otherwise stated, we take δ = 1. For continuous-time systems the lag is given in units of the measurement sample rate ∆t.
The original work on coordinate embeddings established that the geometry of the asymptotic attractor of (Ω, Φ) can be reconstructed, up to diffeomorphism, from embeddings ← − x m of partial observations x = X(ω) for sufficiently large m [8,9]. The intuitive idea is that the additional values in the embedding ← − x m essentially act to fill in the degrees of freedom of ω missing from X. The reconstructed orbit ← − x m t of the embedding traces out an attractor that is geometrically equivalent to that generated by the full system state ω t .
Geometrically, embeddings encode the unobserved de-grees of freedom in the histories of the observed degrees of freedom. Moreover, the Koopman operator acting on a delay embedding observables implicitly encodes the unobserved degrees of freedom in a dynamically useful way [11][12][13]. In fact, the Koopman operator acting on delay-coordinate embeddings corresponds to the Laplace-Beltrami operator describing the attractor geometry [12].
In the asymptotic limit with evolution on the attractor, this correspondence allows employing geometric tools, such as heat kernels and diffusion maps, in data-driven modeling [3]. The details of how evolution operators acting on delaycoordinate embeddings dynamically encode the unobserved degrees of freedom will be examined thoroughly below. First though, we show the classical example of how delay embeddings geometrically encode unobserved degrees of freedom with the Lorenz 63 attractor. Later, we will return to this example to demonstrate the dynamical encoding of the unobserved degrees of freedom using analog forecasting.
Example Reconstruction The following gives an empirical demonstration that delay embeddings can geometrically "fill in the gaps" of missing degrees of freedom the three-dimensional Lorenz 63 system:  Fig. 2(b) shows the attractor reconstructed using delay-coordinate embeddings of the x variable alone with δ = 4 and m = 7. The delayreconstructed attractor is a "squished" version of the original, but is (approximately) geometrically equivalent. The simulation and delay embedding reconstruction were performed using the DynamicalSystems.jl package in Julia [54].

IX. HISTORY-DEPENDENT MODELS
Many history-dependent model classes, such as recurrent neural networks [46] , are more readily understood as mappings ← − T τ from pasts (delay embeddings) to future observations, rather than as fitting the paradigms of Markov and memory kernels from the Mori-Zwanzig equation. Generalizing the instantaneous case given in Eq. (8), the mean-squared error of history-dependent Hilbert space models is: As in the instantaneous case, we identify the unique minimizer of Eq. (16) as the optimal history-dependent Hilbert space model. This optimum is achieved by formally connecting the Mori-Zwanzig formalism with delay-coordinate embeddings using Wiener projections of the Koopman operator.
Before detailing Wiener projections, it is helpful to first review two standard orthogonal projections. Both use the L 2 (µ) inner product. For now, we follow Ref. [6] and use the invariant measure µ * : Equation (11) defined the optimal instantaneous model Z τ as the conditional expectation function that minimizes the L 2 (µ * ) norm between g • X and U τ X. This is known as the nonlinear or infinite-rank projection, used by Ref. [55], of U τ X from H into H X .
In contrast, the linear projection used by Ref. [56], also known as a finite-rank or finite-subspace projection, is defined in terms of an orthogonal set Ψ of size N on the space V of functions of the observed variables. That is: . In the infinite-rank limit Ψ → V , this linear projection converges to the nonlinear conditional expectation projection. As with EDMD, the challenge for datadriven methods that employ finite-subspace projections [7,14,15,57] is to find an effective finite basis Ψ.

A. Equilibrium Wiener Projections
Using these instantaneous projections, the standard Mori-Zwanzig formalism embodies history dependence of the observed variables in the collection of memory kernels. In contrast, Wiener projections incorporate history dependence directly into the projection operators via delay-coordinate embeddings. Specifically, the linear Wiener projections replace the single L 2 (µ * ) inner product with: in the linear projection in Eq. (17). That is, L 2 (µ * ) inner products of the reverse-time-shifted observables are taken at all times into the infinite past. Applying a discrete-time Wiener projection P W to the discrete-time Koopman operator, as first introduced in Ref. [6], results in: As expected, there is no longer a temporal convolution over memory kernels; only a Markov term and an orthogonal term. In the setting of quantum statistical mechanics, Ref. [58] gives a similar expression using timedependent projection operators. Furthermore, Ref. [15] argues that, if the conditions of the delay embedding theorem [9] are met, the orthogonal term vanishes. Thus, in the ideal case, Mori-Zwanzig evo-lution with delay-coordinate embeddings reduces to only a single Markov term that corresponds to the nonlinear Wiener projection of U t+1 X: for embedding dimension m sufficiently large to satisfy the delay-embedding theorem.
In the nonideal case, particularly with finite k, the orthogonal term does not vanish and there may be memory effects at Markov order larger than that spanned by finite pasts ← − x k t . Therefore, as with standard (instantaneous) Mori-Zwanzig Markov approximations, the stochastic evolution of finite pasts is modeled by the finite Markov operator ← − M k 0 ( ← − x k ) plus an effective "noise" term: A finite model of this form is found in the HAVOK method [10,13], based on Hankel DMD [11]. This finds the best-fit linear approximation for ← − M k 0 with the leading components of the singular value decomposition of the Hankel matrix, whose columns are time-ordered delay embeddings. The last few components are then fit to the noise.
We emphasize that the Wiener projection approach to Mori-Zwanzig evolution is useful as it provides a direct connection to delay-coordinate embeddings and their intrinsic geometry. Theoretically, though, it merely rearranges memory dependence in the observable degrees of freedom. Delineating the practical advantages or disadvantages of Wiener projections over the standard Mori-Zwanzig formalism requires further investigation. Note, though, that the algorithms given by Ref. [7] for reconstructing the Markov and memory kernels of the latter employ two-time correlation functions of the observed variables. These are closely related to the instantaneous predictive distributions described above.

B. Nonequilibrium Wiener Projections
For another perspective on how the orthogonal term in Eq. (19) may vanish, it is instructive to formulate Wiener projections using nonequilibrium time-dependent measures, rather than the equilibrium invariant measure. In statistical mechanics, in fact, the invariant measure is taken to be exactly that of the equilibrium distribution, with the L 2 inner products being equilibrium correlations and the Mori-Zwanzig equation's validity holding only near equilibrium [41].
First, we introduce the history-dependent generalization of the time-dependent Maximum Entropy measures introduced in Section V. The history-dependent generalization of the Maximum Entropy Principle is known as Maximum Caliber [18,19]. In short, given a time series of constraints up to the present moment, Maximum Caliber constructs the least biased distribution at the current time by maximizing the entropy while accommodating all time-evolving constraints. If the constraints are given in the form of expectation values, as is typical in statistical mechanics, this results in generally intractable spacetime path integrals.
As in the instantaneous case though, constraints for partially-observed systems are simply support sets of possible ω consistent with observations x = X(ω). Now, however, there are multiple time-evolving observations {x t , x t−1 , . . . , x t−k } in the form of delay embeddings to constrain the support sets over Ω.
Equation (6) defined the instantaneous set B t ∈ Σ Ω as the set X −1 (x t ) of all ω t consistent with the observation x t such that X(ω t ) = x t . Rather than a single instantaneous observation, consider now two sequential observations x t−1 and x t . Define ← − B 2 t ∈ Σ Ω as the set of ω t consistent with both observations such that If the two sets are not equal, we say that Given the set ← − B k t , the Maximum Caliber distribution is uniform over ← − B k t and zero elsewhere, as with the instantaneous Maximum Entropy case. For finite k, ← − B k t is ν-measurable and the Maximum Caliber measure ← − µ k t is defined through the density ← − ρ k t that is constant over ← − B k t and zero elsewhere. If lim k→∞ ← − B k t is a discrete set, ← − µ ∞ t is given as a sum of equally-weighted delta distributions.
With the Maximum Caliber measures in hand, we can define nonequilibrium Wiener projections using the nonequilibrium inner products: with f t = f (ω t ) and g t = g(ω t ) to emphasize that the integrals are all carried out over values of ω t for all terms in the sum. Note that this inner product is identical to its equilibrium counterpart in Eq. x k t that then defines the measure ← − µ k t . In particular, we can analyze the sequential refinement behavior of ← − µ k t and their support sets ← − B k t with increasing depth k of the observed past ← − x k t . This is shown in Fig. 3. Moreover, we can directly construct the predictive distributions whose expectation gives Eq. (20)'s conditional expectation using the (nonlinear) nonequilibrium Wiener projections.

C. Predictive Distributions
In the instantaneous case, recall that B t = X −1 (x t ) is the set of all ω t consistent with observation x t such that X(x t ) = ω t . The set of possible observables x t+τ that may be seen at a later time τ are given by the Koopman operator as {x t+τ = [U τ X](ω t ) for all ω t ∈ B t }. This set is the support of the instantaneous predictive distribution µ X t+τ -the pushforward of P τ µ t along X.
For two sequential observations, we may expect that there are some, if not many, ω t in B t that are not in ← − B 2 t . That is, there may be ω t such that X(ω t ) = x t but X Φ −1 (ω t ) = x t−1 . As more observations are recorded, there may be increasingly fewer ω t whose reverse orbit is consistent with the observed values ← − x k t . Therefore, ← − B k t ⊆ B t and, in some cases, is monotonically nonincreasing as k grows and it may decrease for increasing k. Figure 3 illustrates the dynamical refinement of Maximum Caliber support sets.
We are now ready to examine the consequences of refinement on history-dependent predictive distributions. The latter are constructed as for instantaneous predictive distributions, using uniform initial measures over ← − B k t rather than B t . The predictive distribution and is distributed according to the pushforward of P τ ← − µ k t along X, as shown in Fig. 4 (b).
If ν( ← − B k t ) < ν(B t ), then the initial measure ← − µ k t is more constrained than µ t . And so, P τ ← − µ k t is also be more constrained than P τ µ t . In this case, the predictive distribution Pr(U t+τ X| ← − X k t = ← − x k t ) has lower entropy than the instantaneous Pr(U t+τ X|X t = x t ). Here, "entropy" refers the size of a distribution's support.
Note that for distributions with a density ρ, the size of the effective support set-the typical set-is given by 2 h(ρ) , where h(ρ) is the differential entropy of ρ [59].
Taking τ = 1, consider the optimal history-dependent target function and the instantaneous optimal Z = E[U t+1 X|X t = x t ]. From the arguments above, ν( ← − B k t ) < ν(B t ) implies that ← − Z k is a more accurate estimator of x t+1 = [U t+1 X](ω t ) than the instantaneous optimal Z. This follows since there is less variance in Pr(U t+1 X| ← − X k t = ← − x k t )-it is more tightly . Unitlength orbits of ωτ ∈ Ω consistent with the subsequent observation xτ+1 are shown in blue, while those inconsistent with the subsequent observation are red. The depiction shows strict refinement at each time, so that concentrated about its mean than Pr(U t+1 X|X t = x t ). Such a conclusion is in line with the intuition that the intrinsic error of instantaneous models derives from the unobserved degrees of freedom and that including past observations in the form of delay-coordinate embeddings accounts for the missing degrees of freedom. It is natural to ask what happens in the k → ∞ limit of infinitely-many past observations. Reference [15] concludes that, for sufficiently large k, the estimator E[U t+1 X| ← − X t ] = X t+1 is the identity map that yields the true evolution of x t+1 . In the nonequilibrium case, this implies that, as k → ∞, the size ν( ← − B ∞ t ) vanishes, with only a single ω t consistent with the infinite set of observations in ← − x ∞ t . As a consequence, there is a unique value is then a δ-distribution with support on the single x t+1 = X Φ 1 (ω t ) . See Fig. 4(c).
This clearly shows how coordinate embeddings recover the unobserved degrees of freedom and effectively act as an equivalent to the full underlying Platonic state ω. In the ideal case, as the length of an embedding increases to infinity, the corresponding size of the set ← − B k t of possible initial conditions goes to zero, as does the variance of the resulting predictive distribution. Thus, there is an a.e. one-to-one correspondence between infinite-length delay embeddings and Platonic system states ω. The associated predictive distribution converges, in a thermo-dynamic limit, to the true value of the next observable, given by the Wiener projection of U t+1 X.
Consider, for example, the simple harmonic oscillator. The underlying state ω is two-dimensional: ω = (r, p). Assume access only to position: X(ω) = r. The evolution of position is described by sine waves r(t) = sin(t). At a given time instant, we cannot determine the momentum from the instantaneous position alone and, therefore, cannot determine the full underlying state ω. However, there are only two momenta associated with each instantaneous position-call them positive and negative. Following Fig. 4's procedure for constructing predictive distributions, there are two corresponding system states and so two corresponding future position values at time t + dt. Call these up (for positive momentum) and down (for negative momentum). If we include a single infinitesimal past value from time t − dt, this is a.e. sufficient to distinguish if the momentum term is positive or negative at that time. Thus, except for the turning points (that are measure zero), including that single past value produces a δ-distribution for the inferred µ 0 and, thus, gives the exact future prediction using Eq. (20).
This convergence, however, is not guaranteed. The size of ← − B k t may decrease with increasing k, but it does not necessarily always do so. Interestingly, while chaotic instabilities make future predictions challenging, they actually aid in this convergence. Trajectory divergence in forward time [31] means convergence in reverse time.
Generating partitions X G of symbolic processes, detailed in Appendix B, are a rigorous case where this is known x k t . (c) Limiting case, with lim k → ∞ depth past ← − x ∞ t . In all cases, the initial measure µt is uniform on the set of all ωt consistent with observation ← − x k t . The initial measure is then propagated forward in time with the Perron-Frobenius operator P τ and, finally, the predictive distribution is given according to the pushforward of µt+τ = P τ µt along the observable mapping X. (c) depicts the case when there is one and only one initial Platonic state ωt consistent with the ∞-length observation ← − x ∞ t . Then, the prediction converges to the true evolution xt+τ = X Φ τ (ωt) = [U τ X](ωt).
to hold [60]. In that setting, ← − B k t is the element of the dynamical refinement of the generating partition corresponding to the observed symbol sequence ← − x k t . In the limit of infinitely-many observations the size ν( ← − B k t ) of the refined partition elements vanishes and almost-every infinite-length symbol sequence corresponds to a unique system state ω ∈ [0, 1]. Generating partitions on chaotic maps of the unit interval are a rigorous case where the a.e. convergence of ← − B ∞ t to a single {ω t } is achieved.

X. SUPPORTING EXAMPLES
We now provide examples to demonstrate the ability of delay-coordinate embeddings and Wiener projections of Koopman operators to dynamically encode unobserved degrees of freedom in practice. The datadriven models for these demonstrations employ analog forecasting, Eq. (9), due to its simplicity and flexibility. Various classes of analog forecasting target functions are formed based on what inputs are given, with appropriate distances computed to find the analog. Fully-observed instantaneous, partially-observed instantaneous, and partially-observed history-dependent target functions will all be considered.
A. Lorenz 63 First, Fig. 5 provides a dynamical complement to Fig. 2's geometric demonstration of encoding unobserved degrees of freedom in the Lorenz 63 system. That is, analog forecast target functions for both cases shown to be geometrically equivalent in Fig. 2 have the same predictive skill.
As a baseline, consider the fully-observed case with X(ω t ) = ω t = (x t , y t , z t ). Figure 5(a) shows an instantaneous analog forecast that employs the full state variable as T (x t , y t , z t ). This is plotted alongside a numerical integration of the equations of motion in units of Lyapunov time.
For comparison, Fig. 5(b) shows an instantaneous analog forecast via T (x t ) using only the first coordinate x. While the analog forecast using the full state variable (x, y, z) tracks the numerical integration for almost four Lyapunov times, the analog forecast using only the instantaneous x variable diverges immediately. Despite this quantitative divergence, the analog forecast using only x produces a qualitatively consistent forecast, capturing behavior expected of the Lorenz system with oscillations within, and jumps between, the two attractor lobes. Fig. 5(c) shows an analog forecast using delaycoordinate embeddings ← − x m t of the x variable with T (x t , x t−δ , . . . , x t−(m−1)δ ). This is, in fact, the same delay embeddings used above in Fig. 2(b) with δ = 4 and m = 7. As with the forecast shown in Fig. 5(b), the  Fig. 2(b). The forecast using delay-embeddings in (c) is essentially identical to the forecast using the full system state in (a).
forecast in (c) only has access to information in the x variable. However, as can be seen, including information from past observations of x, in the form of delay embeddings, results in a model with essentially the same predictive skill as the fully-observed case shown in (a), with divergence again occurring at about four Lyapunov times. This shows that, at least in simple low-dimensional cases, delay embeddings fill in the gaps and so become effective proxies for the missing degrees of freedom in implicit predictive models. As far as we are aware, the equivalence of predictive skill between an instantaneous full-state model and a history-dependent partiallyobserved model has not been previously demonstrated.

B. Lorenz 96
The Lorenz 63 model is useful to connect the geometric encoding of the unobserved degrees of freedom by delay embeddings with the dynamical encoding of the unobserved degrees of freedom by history-dependent models. However, it is a low-dimensional system, so that even when just a single degree of freedom is accessible, there are only two that are inaccessible.
In this example we demonstrate the effects in increasing-length input pasts with a higher-dimensional system using a 50-dimensional Lorenz 96 model. Each degree of freedom x i in the model evolves according to local interactions as with periodic boundary conditions. We set F = 4.6 and numerical integration is performed with a time step ∆t = 0.01. Again we use analog forecasting as the data-driven model, which has access to only a single degree of freedom x t = X(ω t ) = x 1 t . Figure 6 shows four analog forecast predictions of x 1 t in the Lorenz 96 system using history-dependent target functions of the form T (x t , x t−1 , . . . , x t−k ). Each predicion is made with a different value of past depth k. As delay-coordinate embeddings, the embedding dimension is k and we use a unit lag δ = 1. Like those in Fig. 5, predictions are made iteratively, with the target functions outputting a single prediction at the next time step.
The lowest-memory prediction is made with k = 3 and is shown in Fig. 6(a). The forecast follows the numerical integration for over 200 integration time steps before diverging. The mean-squared error of the prediction over the 1000 time-step window is 3.35. Predictions made with k = 20 are shown in Fig. 6(b). While it diverges from the numerical integration sooner than the k = 3 model, the quasi-periodicity of the Lorenz 96 system allows the forecast to closely track the numerical integration again at later times. The mean-squared error for the k = 20 model is thus slightly lower, at 3.18. Increasing the past depth to k = 80, shown in Fig. 6(c), provides a more noticeable improvement in predictive skill. This is apparent visually, and is reflected in the mean-squared error value of 1.72. Finally, increasing to k = 120, roughly the ideal 2N + 1 embedding dimension, provides a dramatic improvement. The data-driven model closely follows the numerical integration for most of the 1000 timestep prediction window, giving a mean-squared error of 0.15. Similar results are shown in Fig. 4 of Ref. [15] for a 5dimensional Lorenz 96 model with F = 8.0. The authors use the more sophisticated kernel analog forecasting algorithm, first introduced in Ref. [57].
We emphasize again that convergence of historydependent data-driven models to the true dynamics is not guaranteed. The results shown in these experiments should not be expected as generic behavior for datadriven models. However, they serve as clear demonstrations of the ability for data-driven models to implicitly encode the effects of unobserved degrees of freedom. This ability provides a physical basis for the efficacy of datadriven predictive models, and provides a bridge connecting them to physics-based models.

XI. A UNIFIED FRAMEWORK
Taken all together, our development provides a unified framework for modeling complex systems from partial observations. Now, with it laid out, we can connect physics-based and data-driven prediction. We find that the two, seemingly disparate, paradigms fall at two extremes of the same spectrum: physics-based models are fully explicit and data-driven models are fully implicit.
Recall that physics models reconstruct a coarsegrained data image u t = ← − a ( ← − x k t ) that explicitly fills in missing degrees of freedom using data assimilation over past observations. The model dynamic Φ τ α evolves data images u t by explicitly computing the interactions among its degrees of freedom. In this way, the orbits of data images are generated by: This is reminiscent of Hilbert space models ← − T τ ( ← − x k ), although the above technically evolves data images. However, due to the explicit nature of data images we can define simulated measurements x t = X(u t ) that produce instrumental readings for the data images. In this way, we predict instrument readings using physics models via: Again, this is all implemented explicitly in terms of interactions among the observed and inferred unobserved degrees of freedom. Equation (24) now clearly parallels the optimal history-dependent Hilbert space model: Rather than explicitly fill-in the missing degrees of free- dom with assimilated data images, data-driven models use the intrinsic geometry of coordinate embeddings to implicitly fill-in the missing variables. Whereas physics models attempt to directly approximate Platonic differential equations-of-motion, data-driven models attempt to approximate the action of the Platonic Koopman operator on embedding coordinates via Wiener projections. And, as we demonstrated, they may converge to the Platonic model in the limit of infinite-length embeddings. Markovian closure in their dynamics motivated our introducing Platonic models. Model parametrizations, though, are used for physics models if the data images cannot provide adequate closure. Analogously, if a finitedimensional embedding does not provide adequate closure, there is still a nonzero orthogonal component in the Mori-Zwanzig equation resulting from the Wiener projection on the action of the Koopman operator. In this case, a noise term can be added as a stochastic parametrization to alleviate the lack of closure [52].
Between fully-explicit and fully-implicit models lies a spectrum including history-dependent models that combine implicit and explicit modeling. In particular, the spectrum encompasses the recent trend in physicsinformed machine learning (PIML) [25][26][27]. There, known physical constraints are explicitly incorporated into the model, usually in the form of conservation laws. The model is then trained from data to implicitly learn the dynamics while maintaining the explicitly enforced constraints. The unified framework allows clearly evaluating the advantages and disadvantages of various modeling paradigms. Having explicit access to degrees of freedom in physics-based models allows for their direct manipulation in the model. This greatly facilitates, for example, making projections of future climate outcomes under various anthropogenic forcing scenarios. Such uses of physics-based models are becoming increasingly common under the heading of digital twins [61,62].
That said, on the one hand, difficulties arise with physics models. This is particularly the case for predic-tion and especially when confronted with the poor coverage provided by observed degrees of freedom. Said simply, it is challenging to construct good data images that approximate the system state well. Similarly, if there are many important interactions to track, such as in the climate system, it is impossible to explicitly account for all interactions and so parametrizations are required. Compounding these problems, generally, it is not clear how to construct appropriate or effective parametrizations. Due to all these challenges, implicit approaches are increasingly being added to physics models, particularly to provide data-driven parametrizations [63,64].
On the other hand, though famously difficult to interpret, data-driven models often excel at straightforward prediction and forecasting tasks. This is no longer surprising. The unified framework provided a physical explanation for this success. While data-driven models can converge to the Platonic model in the limit, however, in practice, they must be learned from finite resources. Deep learning models, in particular, can be prohibitively computationally expensive to train. Adding known physical constraints, when applicable, can help such models converge more quickly. The lesson is that models should not implicitly learn already-known features; the latter should be incorporated explicitly.

XII. CONCLUSION
Many modeling applications attempt to predict a physical system's future behavior but can access only a small subset of the system degrees of freedom. Historically, predictions with partial observations have relied on physicsbased models that explicitly fill-in missing degrees of freedom using data assimilation and parametrization. In this, the physics models provide approximate solutions to the governing equations of motion-the system's true dynamics or Platonic model. Data-driven approaches, in contrast, learn the dynamics of the observed variables using implicit representations of all the degrees of freedom through delay-coordinate embeddings.
We demonstrated how the maximal predictive information available to a data-driven model-information from past observations of the accessible variables-is given by the predictive distributions. We gave an explicit construction using Koopman and Perron-Frobenius operators. Most data-driven models are Hilbert space models, in the form of a target function, that map past observations forward in time. Maximum Caliber measures were used to develop a nonequilibrium version of Wiener projections for the Mori-Zwanzig formalism. Using this, we showed that optimal Hilbert space models correspond to expectation values of the predictive distributions. Building on the intuition from generating partitions of symbolic processes, this insight illuminated how optimal Hilbert space models converge to the true evolution of the accessible variables, in the limit of infinite-length coordinate embeddings. We also showed how Wiener projections provide a clear theoretical connection between data-driven and physics-based models.
At first blush, the empirical success of data-driven models is counterintuitive. Indeed, by definition, they know nothing of the underlying physics governing the full system. And yet, they still learn, and from only partial observations, to predict the true evolution; i.e., they converge to the Platonic model. Upon reflection, however, we recognize that our understanding and mathematical formulation of physical laws did not spontaneously manifest. They formed and evolved over generations precisely through our observations and interactions with the natural world. The development of science has been datadriven and successful at that. And so, it is not surprising that data-driven models "learn physics" from observations alone.
What is perhaps discomforting, though, is the implicit and often uninterpretable manner in which most data-driven methods learn to approximate the governing physics. We hope that the detailed investigations of implicit models given here alleviates at least some of this puzzle. It must also be remembered that initially there was a great deal of discomfort with explicit numerical physics models. After all, complicated numerical models are very much "black box" in ways similar to data-driven models, particularly deep learning models. The behavior that emerges in complicated physics models often cannot be deduced directly from the inputs given to that model. If a numerical model produces unphysical or otherwise pathological behaviors, it is often not immediately clear how to diagnose and address the concern [36,37].
Returning to our motivating question, Is there a best way to predict a given physical system from partial observations? At present, it does not seem that there is a universally "best" approach. When working with finite data and finite computational resources, all methods have their advantages and disadvantages. We sought to convey the commonality among seemingly disparate approaches to predicting complex systems from partial observations. Our goal was to illuminate the theoretical underpinnings of implicit and explicit models. Finding commonality in a unified predictive framework should help build confidence in the models currently employed. And, hopefully, this will pave the way forward to models with ever more predictive skill and structural interpretability. At which point, the science of complex systems will have moved closer to automated theory building [65]. In contrast to conservative Hamiltonian systemsthe default assumption for statistical mechanics-many physical, chemical, and biological systems display dissipative and nonasymptotic behaviors that demand attention for a full understanding. We now define these behaviors in detail. This, in turn, highlights the application breadth of the unified framework.
A system's phase space consists of all of its allowed configurations. A primary goal in dynamical systems theory is to identify the key state-space structures that guide and constraint a system's complex behaviors [31]. We wish to capture them explicitly in our development. This requires a slightly more general presentation than is usually given for ergodic theory.
Invariant sets are subsets of a system's states that map onto themselves under a system's dynamic. When perturbations from them return, they are stable invariant sets-called attractors. That set of states which tend asymptopically to a given attractor is the attractor's basin of attraction. A given dynamical system can be decomposed into its invariant sets including attractors and their basins and the basin boundaries. Specifying these objects delineates a system's attractor-basin portrait-its comprehensive dynamically-relevant architecture.
Multistable systems are those with multiple attractors. Transient, nonasymptotic behaviors reflect relaxation to an attractor from states starting in its basin. The asymptotic stability of attractors meanwhile allows for the standard long-time analysis of ergodic systems. That is, the standard setup for the measure-preserving and ergodic dynamical systems of interest to us describes the evolution on an attractor.
We now formally define these concepts. Consider the measure space (Ω, Σ Ω , ν) and a dynamic Φ, where Ω is the state space, Σ Ω its Borel algebra, and ν the Lebesgue measure.
Note that all Φ t -invariant sets are necessarily also forward-invariant, but not all forward-invariant sets are Φ t invariant. An attractor of (Ω, Σ Ω , ν, Φ) is a set A ⊂ Σ Ω with the following properties: • There exists an open set B ⊃ A, called the basin of attraction of A such that for every ω ∈ B, lim • There is no proper subset of A with the first two properties.
By definition, an attractor is a forward-invariant set. However, due to the existence of its basin of attraction, an attractor is not a Φ t -invariant set. There are points ω ∈ B \ A that are in the pre-image Φ t −1 (A) but not in A. However, the full basin of attraction B for a given attractor A is Φ t -invariant. (The attractor itself is in its basin A ⊂ B.) Every state in B limits to its attractor A. And so, if there are states in the pre-image of B that are not in B they, by definition, do not limit to A. Therefore, any state not in the pre-image of B is not in B. In fact, an alternative definition of the basin B of attractor A is as the limit of pre-images of A: B = lim t→∞ Φ t −1 A. Attractors and their basins of attraction decompose a dynamical system into its dynamically-independent components-the system's attractor-basin portrait. For a multistable system with multiple attractors, the basins of attraction partition the state space Ω into equivalence classes of states based on the attractor to which they limit since orbits never cross basin boundaries. Without loss of generality, the development considers dynamical systems with a single attractor and Ω its basin of attraction, unless explicitly stated otherwise. For multistable systems, each attractor and its basin may be analyzed separately as if it were its own separate system. The full attractor-basin portrait becomes relevant, though, when one executes independent experimental trials that select a wide range of initial states. Moreover, real-world systems are neven fully isolated and this typically introduces fluctuations that can drive a system between otherwise noncommunicating basins. Decomposing a dynamical system into independent components raises the issues of ergodicity and ergodic measures [33]. A dynamical system (Ω, Σ Ω , µ, Φ) is ergodic and µ is an ergodic measure, if every Φ t -invariant set B is such that µ(B) = 1 or µ(B) = 0. For an ergodic system, all Φ t -invariant sets are trivial subsets of Ω. From the definition of basins of attraction, a dynamical system with a single basin of attraction or a multi-stable system restricted to a single basin is ergodic.
Ergodic theory often considers a dynamical system (Ω, Σ Ω , µ, Φ) with measure µ to also be measurepreserving: µ (Φ t ) −1 (B) = µ(B) for B ⊂ Σ Ω . An equivalent statement is that the measure µ is invariant under the dynamics Φ. This is mathematically convenient for casting the behavior of dynamical processes as stationary stochastic processes. However, it is too restrictive for our purposes, as it does not capture relaxation to an attractor. Transient behavior during relaxation to an attractor A is dissipative if it involves measurable subsets of B not in A, known as wandering sets. In essence, measure is "carried away" by wandering sets, and so the support of an invariant measure cannot include wandering sets. This can also be seen from the definitions of invariant measures and the Perron-Frobenius operator above in Eq. It is often of particular concern whether or not the Lebesgue reference measure ν is invariant under the dynamics. Since ν provides a measure of state space volume, dynamics that preserve ν are said to be volume preserving. Wandering sets, by definition, preclude volume preservation. Hamiltonian systems, on the other hand, are volume preserving due to Liouville's theorem [4]. Because we consider only ergodic systems, there will always be a physical invariant probability measure that may be used, whether the system preserves volume or not. If the Lebesgue measure is invariant (and so volume is preserved), the microcanonical distribution gives the equilibrium invariant probability distribution. If Lebesgue measure is not invariant, there will be still be a unique asymptotic invariant measure. (More on this shortly.) Our formalism works in all cases, but is particularly useful for generalizing to nonasymptotic behaviors of systems that do not preserve phase space volume.
Note that when considering probability measures, the terminology of measure-preserving dynamics should not be confused with what we might call conservation of measure (or conservation of probability). As standard, we assume the dynamics Φ to be nonsingular such that µ(Φ −t (B)) = 0 for all sets B with µ(B) = 0. This ensures the evolution of probability measures by Perron-Frobenius operators are still probability measures.
To include transient behavior (relaxation to an attractor), the following does not assume an invariant measure. However, by restricting to ergodic dynamics (considering single basins of attraction at a time), this guarantees the existence of a unique asymptotic invariant measure: This follows since the L 1 Perron-Frobenius operators have a unique invariant density ρ * for ergodic dynamics: P t ρ * = ρ * [39,Thm 4.5]. (These measures play a role roughly analogous to equilibrium macrostates in thermodynamics.) That is, this measure is preserved by dynamics on the attractor, to which the system is restricted in the limit. Therefore, in the asymptotic limit the ergodic theorem applies and time averages equal state space averages for observables f : In the asymptotic limit, the system trajectories settle on the attractor and the resulting dynamical process is distributed according to the asymptotic invariant measure µ * (B). Thus, by definition, the process is stationary only in the limit: Generally, though, µ t = µ t+τ .

Appendix B: Optimal Finite-Precision Instruments for Continuous Observables
The following temporarily leaves behind the fullycontinuous dynamical processes setting. Instead, it considers discrete-time, discrete-valued symbolic processes [60] and how they relate to discrete measurements of continuous dynamical systems. In this, the mapping X corresponds to a coarse-grain partition P of the state space Ω. As with dynamical processes, X is many-to-one and noninvertible, yielding fully-discrete stochastic processes of observations. Rather than interpreting X as accessing only a subset of accessible degrees of freedom in ω, for symbolic processes X is interpreted as a collection of measurement instruments, each with access to all relevant degrees of freedom, but only report the result of finite-precision observations [66]. To avoid confusion, we denote the observation function for symbolic processes as X P .

Symbolic Processes from Generating Partitions
A symbolic measurement function X P : Ω → A generates a finite partition P of state space Ω, with every ω ∈ Ω mapping to a partition element P i such that P i ∩ P j = ∅ for all P i , P j ∈ P and K i P i = Ω. Each partition element P i carries a label, or symbol a i ∈ A. Without loss of generality, we will take label(P i ) = i, with A = {0, 1, . . . , K − 1} for K partition elements in P. Using this, we can explicitly write the piecewise constant symbolic measurement function X P in terms of the partition elements as: where: is the indicator function for partition element P i . Paralleling our development of dynamical processes, we now consider symbol sequences generated by measuring orbits of the underlying system (Ω, Φ). For an initial value ω 0 there is an initial symbol a 0 = X P (ω 0 )an element of partition P. Similarly, X P • Φ induces a partition over Ω, denoted Φ −1 P, such that each element (Φ −1 P) i is the set of all ω for which X P Φ(ω) = P i . That is, Φ −1 P is a partition over Ω at the initial time t 0 where every ω 0 in the same element of Φ −1 P emits the same symbol a 1 = X P (ω 1 ) = X P Φ(ω 0 ) at the next time t 1 . Each time step t n generates a new partition Φ n P whose elements are all the points ω 0 ∈ Ω such that X P Φ n (ω 0 ) ∈ P i . Importantly, an iterated partition refines the previous partition. For two partitions P and Q, the refinement P ∨ Q = {P i ∩ Q j , for all P i ∈ P and Q j ∈ Q} is also a partition. The first refinement of P under Φ is P ∨ Φ −1 P. Its elements are all the points ω 0 ∈ Ω that emit the same symbol X P (ω 0 ) for time t 0 and that emit the same symbol X P Φ(ω 0 ) at the next time t 1 . Therefore, the refinement P ∨ Φ −1 P maps from Ω to two-symbol sequences a 0 a 1 in A × A. In the limit, the full dynamical refinement P ∨ Φ −1 P ∨ Φ −2 P ∨ · · · maps points in Ω to infinite-length symbol sequences in A × A × A × · · · .
A partition P is generating if there is a one-to-one correspondence, almost everywhere, between an initial condition ω 0 ∈ Ω and the infinite sequence of symbols {X P (ω 0 ), X P Φ(ω 0 ) , X P Φ 2 (ω 0 ) , . . .} generated by ω 0 . Thus, while the initial measurement symbol a 0 = X P (ω 0 ) is far from sufficient to fully determine ω 0 , the full infinite sequence of subsequent symbols does (almosteverywhere) fully determine ω 0 if P is a generating partition. This occurs since the size of the dynamical refinement partition elements goes to zero in the infinitetime limit. And, in turn, this requires the system to be chaotic; exponential spreading of orbits in forward time corresponds to exponential convergence in reverse time.
Due to all this, generating partitions provide a rigorous notion of a "good" measurement device for which information lost by a coarse single-time measurement is recovered through an infinite-time limit of measurement observations. The one-to-one correspondence property of generating partitions emerges above when discussing the potential convergence of data-driven models of partiallyobserved systems. The set ← − B k t is the element of the dynamical refinement of the generating partition corre- sponding to the observed symbol sequence ← − x k t . In the limit of infinitely-many observations the size ν( ← − B k t ) of the refined partition elements vanishes and almost-every infinite-length symbol sequence corresponds to a unique system state ω ∈ [0, 1].
An important bridge between symbolic and dynamical processes arises from the fact that the evolution of partitions Φ −n P (not the dynamical refinements) is governed by discrete-time Koopman operators. The partition Φ −n P is generated by the time-shifted symbolic measurement function X P • Φ n = U n X P . Therefore, again paralleling dynamical processes, the symbol sequences are given by {X P (ω 0 ), [U X P ](ω 0 ), [U 2 X P ](ω 0 ), . . .}.
Note this function is in the general form of Eq. (B1). The single-time evolved partition Φ −1 G, also shown in Fig. 7, is given by the single-time shift symbolic measurement function: The new boundary points 1 − 1 2 /2 and 1 + 1 2 /2 of Φ −1 G are the pre-images {Φ −1 (ω)} of the original boundary point ω = 1/2 of G.
Finally, the first dynamical refinement G∨Φ −1 G, mapping Ω to two-symbol sequences, is also shown in Fig. 7. From Fig. 7 we see that the dynamical refinement adds the boundary points of Φ −1 G to the original boundary point of G.
Beyond rigorously formulating good measurement devices-generating partitions-symbolic processes were historically important for introducing concepts and methods from discrete information and computation theories into dynamical systems and ergodic theory, as noted above. In particular, the Shannon entropy rate of a symbolic process has a (possibly nonunique) supremum over all possible partitions for a given iterated map. This is the Kolmogorov-Sinai entropy. That is, the supremum is achieved for generating partitions [69,70]. Moreover, the Kolmogorov-Sinai entropy is bounded by the positive Lyapunov exponents of the underlying system [71]. This provides a rigorous link between the geometric instabilities of deterministic chaos and observed randomness. And, this explains, in part, why the weather is hard to predict [72,73].  8. EDMD algorithm's commuting diagram for finite-dimensional Galerkin approximation U τ X of U t onto H Ψ ⊆ HX ⊂ H.

observation time series.
A more serious difficulty in using EDMD for prediction comes from its its lack of closure-leakage out of the subspace H Ψ . If H Ψ is not a finite Koopman-invariant subspace [81], then after several iterations U τ [f X • X] eventually no longer lies within H Ψ . Due to this, U τ X 's action differs from the true evolution given by U τ 's action. Note that if H X is not a Koopman invariant subspace, then all instantaneous models T τ accrue a similar prediction error. This is the intrinsic error Ξ τ X discussed above, which is incurred for having only partial observations X of Ω.
In the infinite dictionary limit, Ψ → V and so U τ X [f X • X] = g • X is always in H Ψ = H X , for some g ∈ V . Thus, EDMD converges in the limit to optimal target function-regression function-Z τ in Eq. (11). Given that the Koopman operator provides the ground-truth for data-driven models, it is not surprising that the EDMD approximation method for U τ recovers the optimal instantaneous target function.
The difficulty is that EDMD never reaches the Ψ → V limit. Therefore, generally the leakage of U τ X [f X •X] out of H Ψ may still lie within H X . Unlike the Ψ → V limit, with leakage out of H X , this leakage is avoidable, given a better choice of or larger dictionary Ψ. Moreover, the prediction error from the leakage compounds over time. The choice of Ψ thus substantially impacts EDMD's predictive skill. In general, a finite invariant subspace cannot be determined a priori. And, for that matter, may not exist for a given physical system Ω with a given X-the set of measurements that can be made on Ω. Recent deep learning approaches [77][78][79]82] attempt to learn an optimal Ψ from data. Similarly, kernel methods [83][84][85] are used to create a very large, implicitly-defined, dictionary.
Note that employing the trivial dictionary Ψ = {f X }, which includes only the identity, yields the exact Dynamic Mode Decomposition (DMD) algorithm [86]. For prediction DMD finds the optimal matrix (i.e. linear) solution for T τ which minimizes the instantaneous target function error in Eq. (8).
For complex, nonlinear systems, using a linear model for prediction may seem like a bad idea. Interestingly, though, "linear plus noise" models, such as Linear Inverse Modeling (LIM) [86], can be reasonably effective and are frequently used in climate science [87]. We are not aware of attempts to generalize this to an Extended Linear Inverse Model that implements EDMD plus noise. The efficacy of LIM models suggests the tolerance induced by noise may help alleviate the effects subspace leakage.
Equations of Motion From Data A popular approach for data-driven modeling learns an explicit closed-form equation model for T τ . This is referred to as equation discovery. The most common approach performs a dictionary regression; sometimes also called symbolic regression [45,88]. Like EDMD, a dictionary Ψ = [ψ i , . . . , ψ k ] of functions is chosen and a (typically sparse) regression is performed to find the best-fit coefficients a i that minimize ẋ t − i a i ψ i (x t ) 2 . In fact, the dictionary regression approach to equation discovery is a special case of gEDMD [80].
Whatever form of equation discovery is used, the ultimate goal is to approximateẋ = Φ X (x) with a closedform expression for Φ X . For partially-observed dynamics, though, it is not guaranteed that Φ X will be wellrepresented by closed-form equations of motion, even if Φ is [45]. The insight that dictionary regression is a special case of the gEDMD algorithm for approximating the Koopman generator illustrates that forcing T τ to be closed-form is an unnecessary restriction. There are certainly many advantages to having closed-form models, including interpretability and extracting adjustable physical parameters. In contrast, for prediction our unified framework demonstrates that it is often advantageous to use implicit models for T τ .