Phase Transitions in Nonlinear Filtering

It has been established under very general conditions that the ergodic properties of Markov processes are inherited by their conditional distributions given partial information. While the existing theory provides a rather complete picture of classical filtering models, many infinite-dimensional problems are outside its scope. Far from being a technical issue, the infinite-dimensional setting gives rise to surprising phenomena and new questions in filtering theory. The aim of this paper is to discuss some elementary examples, conjectures, and general theory that arise in this setting, and to highlight connections with problems in statistical mechanics and ergodic theory. In particular, we exhibit a simple example of a uniformly ergodic model in which ergodicity of the filter undergoes a phase transition, and we develop some qualitative understanding as to when such phenomena can and cannot occur. We also discuss closely related problems in the setting of conditional Markov random fields.


Introduction
Let (X k , Y k ) k≥0 be a bivariate Markov chain. Such a model represents the setting of partial information: it is presumed that only (Y k ) k≥0 can be observed, while (X k ) k≥0 defines the unobserved dynamics. In order to understand the behavior of the unobserved process given the observations, it is natural to 'lift' the unobserved dynamics to the level of conditional distributions, that is, to investigate the nonlinear filter π k := P[X k ∈ · |Y 1 , . . . , Y k ].
Under standard assumptions on the observation structure (cf. section 2), the process (π k ) k≥0 is itself a measure-valued Markov chain. The fundamental question that arises in this setting is to understand in what manner the probabilistic structure of the model (X k , Y k ) k≥0 'lifts' to the conditional distributions (π k ) k≥0 .
Of particular interest in this context is the behavior of ergodic properties under conditioning. It is natural to suppose that the ergodic properties of (X k , Y k ) k≥0 will be inherited by the filter (π k ) k≥0 : for example, if X k forgets its initial condition as k → ∞, then the optimal mean-square estimate of X k (and therefore the filter π k ) should intuitively possess the same property. Such a conclusion was already conjectured by Blackwell as early as 1957 [1], and a proof was provided by Kunita in 1971 [28]. Unfortunately, both the proof and the conclusion are erroneous: it is elementary to construct a finite-state Markov chain (X k , Y k ) k≥0 that is 1-dependent (as strong an ergodic property as one could hope for) with observations of the form Y k = h(X k−1 , X k ) such that the corresponding filtering process (π k ) k≥0 is nonergodic, see Example 2.1 below. 1 Despite the appearance of counterexamples already in the most elementary setting, recent advances have provided a surprisingly complete picture of such problems in a general setting. On the one hand, it has been shown under very general assumptions [49,45] that ergodicity of the underlying model is inherited by the filter when the observations are nondegenerate, that is, when the conditional law of each observation P[Y k ∈ · |X] has a positive density with respect to some fixed reference measure. This is a mild condition in classical filtering models that serves mainly to rule out the singular case of noiseless observations: for example, the addition of any observation noise to the above counterexample would render the filter ergodic. On the other hand, even in the noiseless case, ergodicity is inherited in the absence of certain symmetries that are closely related to systems-theoretic notions of observability [47,48,50,9]. One can therefore conclude that while there exist elementary examples where the ergodicity of the model fails to be inherited by the filter, such examples must be very fragile as they require both a singular observation structure and the presence of unusual symmetries, either of which is readily broken by a small perturbation of the model.
The theory outlined above provides a satisfactory understanding of conditional ergodicity in classical filtering models. Some care must be taken, however, in interpreting this conclusion. The ubiquitous applicability of the theory hinges on the notion that most filtering models possess observation densities, an assumption made almost universally in the filtering literature (cf. [13] and the references therein). This assumption is largely innocuous in finite-dimensional systems. The situation is entirely different in infinite dimension, where singularity of probability measures is the norm. There exists almost no mathematical literature on filtering in infinite dimension, despite the substantial practical importance of infinite-dimensional filtering models in data assimilation problems that arise in areas such as weather forecasting or geophysics [43]. The aim of this paper is to draw attention to the fact that, far from being a technical issue, the infinite-dimensional setting gives rise to new probabilistic phenomena and questions in filtering theory that are fundamentally different than those that have been studied in the literature to date, and whose understanding remains limited.
Remark 1.1. The singularity of measures in infinite dimension is already a significant hurdle in the ergodic theory of infinite-dimensional Markov processes, as it precludes application of standard methods based on Harris recurrence (e.g., [33]). An infinite-dimensional unobserved process is not, however, the primary source of the problems considered in the sequel: as long as the observations are nondegenerate, it was shown in [45] that ergodicity is still inherited by the filter. The problem is that nondegeneracy is a restrictive assumption on the observations in infinite dimension: it implies that the observations must be effectively finite-dimensional [45,Remark 5.20]. It is in models were the observations are genuinely infinite-dimensional that new phenomena appear.
To model a filtering problem in infinite dimension, we suppose that (X k , Y k ) k≥0 is a Markov chain in the product state space E V × F V , where E, F are local state spaces and V is a countably infinite set of sites (for concreteness, we fix V = Z d throughout). Each element of V should be viewed as a single dimension of the model. A more practical interpretation is that V defines a spatial degree of freedom and that (X k , Y k ) k≥0 describes the dynamics of a time-varying random field, as is the case in data assimilation applications. In accordance with this interpretation, we will assume that the dynamics of the state X k and the observations Y k are local in nature: that is, the conditional distributions of the local state X v k given the previous state X k−1 , and of the local observation Y v k given the underlying process X, depend only on X w k−1 and X w k for sites w ∈ V that are neighbors of v. In essence, our basic model therefore consists of an infinite family of local filtering models (X v k , Y v k ) k≥0 whose dynamics are locally coupled according to the graph structure of V = Z d . The general model is described in detail in section 2.
In section 3 we investigate the natural infinite-dimensional version of Blackwell's Example 2.1. Recall that it was crucial in the finite-dimensional setting that the observations Y k = h(X k−1 , X k ) are noiseless: the addition of any noise renders the observations nondegenerate and then ergodicity is preserved. This is no longer the case in infinite dimension: even if the local observations Y v k are nondegenerate, the failure of the filter to inherit ergodicity can persist. In fact, we observe a phase transition: the filter fails to be ergodic when the noise is small, but becomes ergodic when the noise strength exceeds a strictly positive threshold. The remarkable feature of this phenomenon is that no qualitative change of any kind occurs in the ergodic properties of the underlying model: (X v k , Y v k ) k≥0,v∈V is a 1-dependent random field for every value of the noise parameter. We are therefore in the surprising situation that complex ergodic behavior emerges in an otherwise trivial model when we consider its conditional distributions. Such conditional phase transitions cannot arise in finite dimension.
The above example indicates that our intuition about inheritance of ergodicity, which fails in classical filtering models only in pathological cases, cannot be taken for granted in infinite dimension even under local nondegeneracy assumptions. This raises the question as to whether there are situations in which the inheritance of ergodicity is guaranteed. In view of the finite-dimensional theory, it is natural to conjecture that this might be the case under a symmetry breaking assumption. As will be discussed in section 4, the existing observability theory is not satisfactory in infinite dimension, and even in its simplest form the verification of this conjecture remains an open problem. However, we will show in section 4.2 that the problem can be resolved in translation-invariant models for a (not entirely natural) modification of the usual filtering process using tools from multidimensional ergodic theory. In section 4.3 the analogous problem will be considered in continuous time, where the method of section 4.2 can be extended to provide a full proof of the conjecture in the translation-invariant case.
In section 5 we turn our attention to the counterpart of the filter stability problem in the setting of Markov random fields. Such problems provide a simple setting for the investigation of decay of correlations in filtering problems, and are of interest in their own right as models that arise, for example, in image analysis. Here the natural question of interest is whether the spatial mixing properties of random fields are inherited by conditioning on local observations. A direct adaptation of the example of section 3 shows that the answer is negative in general. However, we will show in section 5.3 that the conditional random field inherits mixing properties of the underlying model when the latter possesses certain monotonicity properties. This provides an entirely different mechanism for conditional ergodicity than in the observability theory of section 4.
As will rapidly become evident, the above results repeatedly exploit connections between filtering in infinite dimension, the statistical mechanics of disordered systems [5], and multidimensional ergodic theory [10,21]. Some further connections with other probabilistic problems will be discussed in the final section 6 of this paper.
The consideration of filtering problems in infinite dimension is far from esoteric. It is a long-standing problem in applied probability to develop efficient algorithms to compute nonlinear filters in high-dimensional models (current algorithms require a computational effort exponential in the dimension, a severe problem in data assimilation applications). Improved understanding of the ergodic and spatial mixing properties of the filter may be essential for progress on such practical problems. In recent work [39,40], we have shown that local filtering algorithms can attain dimension-free approximation errors in models that exhibit conditional decay of correlations. The machinery developed in [39,40] to investigate high-dimensional filtering problems is complementary to the present paper: the former provides strong quantitative results under strong ('high-temperature') assumptions, while our aim here is to address fundamental issues that arise in infinite dimension. Regardless of any practical implications, however, the investigation of emergent phenomena that arise from conditioning is arguably of fundamental interest to the understanding of conditional distributions, and provides a compelling motivation for the investigation of conditional phenomena in probability theory.

Filtering in infinite dimension
The goal of this section is to set up the basic filtering problem that will be studied in the sequel. We begin by defining a general setting for nonlinear filtering and introduce and discuss the basic ergodicity question in section 2.1. We subsequently introduce our canonical infinite-dimensional filtering model in section 2.2.

Nonlinear filtering and ergodicity
Throughout this paper, we model dynamics with partial information as a Markov chain (X k , Y k ) k≥0 that has the additional property that its transition kernel factorizes as for given transition kernels P and Φ: the factorization corresponds to the assumption that (X k ) k≥0 is a Markov chain in its own right, and that the observations (Y k ) k≥0 are conditionally independent given (X k ) k≥0 . Such models are frequently called hidden Markov models. While not essential for the development of our theory (see, e.g., [44] for a more general setting), the hidden Markov model setting is convenient mathematically and is ubiquitous in practice as a model of noisy observations of random dynamics.
For the time being, we assume that X k and Y k take values in an arbitrary Polish space (we will define a more concrete infinite-dimensional setting in section 2.2 below). The nonlinear filter is defined as the regular conditional probability We are interested in the question of whether (π k ) k≥0 inherits the ergodic properties of the underlying dynamics (X k ) k≥0 . There are several different but closely connected ways to make this question precise (cf. Remark 2.3 below). For concreteness, we will focus attention on one particularly elementary formulation of this question that will serve as the guiding problem to be investigated throughout this paper.
We will assume in the sequel that the Markov chain (X k ) k≥0 admits a unique invariant measure λ. As P[X k , Y k ∈ ·|X k−1 , Y k−1 ] does not depend on Y k−1 due to the hidden Markov structure, the invariant measure λ extends uniquely to an invariant measure for the chain (X k , Y k ) k≥0 , and we denote the unique stationary law of this process as P. By stationarity, we can assume in the sequel that (X k , Y k ) k∈Z is defined also for k < 0.
The ergodic property of (X k ) k≥0 that we will consider is stability in the sense that for every measurable set A: that is, the law of X k 'forgets' the initial condition X 0 as k → ∞. The analogous conditional property is filter stability in the sense that for every measurable set A: that is, the conditional distribution of X k given the observed data 'forgets' the initial condition X 0 as k → ∞. It is natural to suppose that stability of the underlying dynamics will imply stability of the filter. This conclusion is incorrect, however, as is illustrated by the following classical example [1].
This evidently defines a stationary hidden Markov model with P (x, ·) = (δ 1 + δ −1 )/2 and Φ(x ′ , x, ·) = δ xx ′ . Note that We can therefore easily compute for every k ≥ 0 Thus the filter is certainly not stable. On the other hand, underlying dynamics (X k ) k≥0 is an i.i.d. sequence, and is therefore stable in the strongest possible sense: Moreover, even the process (X k , Y k ) k≥0 is stable in the strongest possible sense: it is a 1-dependent sequence, so that Example 2.1 shows that the inheritance of ergodicity under conditioning cannot be taken for granted. Nonetheless, the phenomenon exhibited here is very fragile: if the observations are perturbed by any noise (for example, if we set Y k = X k X k−1 ξ k with P[ξ k = −1] = 1 − P[ξ k = 1] = p and any 0 < p < 1), the filter will become stable. The inheritance of ergodicity is therefore apparently obstructed by the singularity of the observation kernel Φ. To rule out such singular behavior, it is natural to require that the observation kernel Φ possesses a positive density with respect to some reference measure ϕ. A model with this property is said to possess nondegenerate observations. One might now expect that nondegeneracy of the observations removes the obstruction to inheritance of ergodicity observed in Example 2.1. Unfortunately, this is still not the case in complete generality, as is demonstrated by an esoteric counterexample in [51]. However, the conclusion does hold if we use a stronger uniform notion of stability. 49]). Suppose that the following hold.
1. The underlying dynamics is uniformly stable in the sense 2. The observations are nondegenerate in the sense Then the filter is uniformly stable in the sense This result, together with the mathematical theory behind its proof (cf. section 6.1), provides a very general qualitative understanding of the inheritance of ergodicity in classical filtering models. However, as will be explained below, this theory breaks down completely in infinite-dimensional models. In the remainder of this paper, we will see that new phenomena arise in the infinite-dimensional setting.
Remark 2.3. The question of inheritance of ergodic properties under conditioning can be formulated in a number of different ways. For concreteness, we focus our attention in this paper on the elementary formulation introduced above. As the choice of problem is somewhat arbitrary, let us briefly describe a number of alternative formulations.
In the setting of stability of the filter, we have considered 'forgetting' of the initial condition X 0 under the stationary measure. Similar problems can be formulated, however, in a more general setting. Denote by P µ the law of the process (X k , Y k ) k≥0 with the initial distribution X 0 ∼ µ. A natural notion of stability is to require that in a suitable topology on probability measures. If we define the filter started at µ as π µ k := P µ [X k ∈ ·|Y 1 , . . . , Y k ], we can now investigate the general filter stability problem for a suitable class of measures µ, ν, γ and functions f . The formulation that we consider in this paper corresponds to the special case ν = λ and µ = γ = δ x for x outside a λ-null set. Nonetheless, our formulation proves to be equivalent in a rather general setting to stability for general initial measures µ, ν, γ, cf. [13,Chapter 12] and [49,45]. A different and perhaps more natural formulation dates back to Blackwell [1] and Kunita [28]. Using the Markov property of the underlying model, it is not difficult to show that the measure-valued stochastic process (π k ) k≥0 is itself a Markov chain, cf. [51, Appendix A]. One can now ask whether the ergodic properties of the Markov chain (X k ) k≥0 'lift' to ergodic properties of the Markov chain (π k ) k≥0 . For example, if (X k ) k≥0 admits a unique stationary measure, does (π k ) k≥0 admit a unique stationary measure also? Similarly, if (X k ) k≥0 converges to its stationary measure starting from any initial condition, does the same property hold for (π k ) k≥0 ? Remarkably, while these questions appear in first instance to be quite distinct from the question of filter stability, such properties again prove to be equivalent in a very general setting to the notion of filter stability that we consider in this paper, cf. [28,42,9,13,51].
A third formulation of inheritance of ergodicity under conditioning is obtained when we consider, rather than the filter, the conditional distribution of the entire process X = (X k ) k∈Z given the infinite observation sequence Y = (Y k ) k∈Z . Using the Markov property of the underlying model, it is not difficult to establish that X is still a Markov process under the conditional distribution P[ · |Y ], albeit time-inhomogeneous and with transition probabilities that depend on the realized observation sequence Y : that is, the conditional process is a Markov chain in a random environment. One can now ask whether the process X inherits its ergodic properties under P when it is considered under the conditional distribution P[ · |Y ]. Once again, this apparently distinct formulation proves to be equivalent in a general setting the formulation considered in this paper, a fact that is exploited heavily in the theory of [49,45].
It is now well understood that the properties described above are equivalent in classical filtering models. While some of these arguments extend directly to the infinite-dimensional setting, others do not, and it remains to be investigated to what extent these equivalences remain valid in infinite dimension. Nonetheless, the problem formulation considered here is arguably the most elementary one, and provides a natural starting point for the investigation of conditional phenomena in infinite dimension.
Remark 2.4. Even when the underlying dynamics (X k ) k≥0 is not stable, it may be the case that the filter is stable. For example, using the trivial observation model Y k = X k , the filter is stable regardless of any properties of the underlying model. More generally, the filter is expected to be stable when the observations are 'sufficiently informative,' which is made precise in [47,48,50] in terms of nonlinear notions of observability. Such results are in some sense the opposite of Theorem 2.2: the latter shows that ergodicity is inherited by the filter, while the former show that the filter can be ergodic regardless of ergodicity of the underlying model (even without nondegeneracy). None of these results prove to be satisfactory in infinite dimension: it appears that a general theory for ergodicity of the filter will require both ergodicity of the underlying model and some form of observability, as will become evident in the following sections.

The infinite-dimensional model
The aim of this paper is to investigate conditional phenomena that arise in infinite dimension. So far, no assumptions have been made on the model dimension: we have set up our theory in any Polish state space. Nonetheless, while no explicit dimensionality requirements appear, for example, in Theorem 2.2, the assumptions of previous results can typically hold only in finite-dimensional situations. To understand the problems that arise in infinite dimension, and to provide a concrete setting for the investigation of conditional phenomena in infinite dimension, we presently introduce a canonical infinite-dimensional filtering model that will be used in the sequel.
The practical interest in infinite-dimensional filtering models stems from problems that have spatial in addition to dynamical structure. To model this situation, let us assume for concreteness that the spatial degrees of freedom are indexed by the infinite lattice Z d . We also define Polish spaces E and F that describe the state of the model at each spatial location. We now assume that X k and Y k are random fields that are indexed by Z d and take values locally in E and F , respectively, for every time k: that is, Each v ∈ Z d should be viewed as a single 'dimension' of the model. 2 We now define a hidden Markov model that respects the spatial structure of the problem by assuming that both the underlying dynamics and the observations are local : that is, we assume that the transition and observation kernels P and Φ factorize as Such a model should be viewed as a hidden Markov model counterpart of probabilistic cellular automata [29] or interacting particle systems [30] that have been widely investigated in the literature as natural models of space-time dynamics. Alternatively, one might view such a model as an infinite collection (X v k , Y v k ) k≥0 of hidden Markov models whose dynamics and observations are locally coupled to their neighbors in Z d .
While problems of this type have been rarely considered in filtering theory, the infinitedimensional model that we have formulated is in principle a special case of the general model described in the previous section. However, its structure is such that the assumptions of a result such as Theorem 2.2 typically cannot hold. Let us consider, for example, the setting where each local observation Y v has a positive density of the form Φ v (x, z, dy v ) = g(z v , y v ) ϕ(dy v ), so that the observations are locally nondegenerate. Choose two values e, e ′ ∈ E such that g(e, ·) = g(e ′ , ·), and define the constant configurations z, z ′ as z v = e and z ′v = e ′ for all v ∈ Z d . Then the measures Φ(x, z, ·) and Φ(x, z ′ , ·) are two distinct laws of an infinite number of i.i.d. random variables, and are therefore mutually singular. This immediately rules out the possibility that the observations are nondegenerate in the sense of Theorem 2.2. It is precisely this problem that lies at the heart of the difficulties in infinite-dimensional models: probability measures in infinite dimension are typically mutually singular, even when they admit densities locally (that is, for any finite-dimensional marginal). In the absence of densities, classical results in filtering theory cannot be taken for granted, and the study of filtering in infinite dimension gives rise to fundamentally different problems than have been studied in the literature to date. We initiate the investigation of such problems in the sequel.
Remark 2.5. The singularity of measures in infinite dimension is problematic not only for the nondegeneracy of observations, but also for the ergodic theory of Markov chains. For example, the uniform stability property in Theorem 2.2 will rarely hold in infinite dimension: it is often the case that the law of X k is singular with respect to λ for all k < ∞, which rules out total variation convergence (see [45,Example 2.3] for a simple illustration). However, this issue is surmounted in [45] using a form of localization: by performing the analysis of Theorem 2.2 locally (that is, to finite-dimensional projections of the original model), we can avoid the singularity of the full infinite-dimensional problem. This allows to extend the conclusion of Theorem 2.2 to a wide range of infinite-dimensional models with nondegenerate observations. In practice, this implies that much of the classical filtering theory extends, at least in spirit, to models where X k is infinite-dimensional but Y k is (effectively) finite-dimensional. It is only when the observations Y k are also infinitedimensional that new phenomena arise.
Remark 2.6. Let us note that we have used the term 'infinite-dimensional' to denote the situation where there are infinitely many independent degrees of freedom, which is the key issue in our setting. The problem of dimension is unrelated to the linear algebraic or metric dimension of the state space: indeed, even each of the local state spaces E and F in our model can itself be an arbitrary Polish space. Conversely, it is possible to have infinitedimensional systems that are 'effectively finite-dimensional' in the sense that only finitely many degrees of freedom carry significant information. This is common, for example, in stochastic partial differential equations (see, e.g., [45]).
At the same time, it should be noted that even in finite-dimensional systems where results such as Theorem 2.2 technically apply, the qualitative information contained in such statements may be misleading from the practical point of view: in finite but highdimensional systems, phenomena that arise qualitatively in infinite dimension are still manifested in a quantitative fashion (see [39] for quantitative results and discussion on filtering in high dimension). For example, if the filter is not stable for the infinite-dimensional model, it will often still be the case that the filter is stable for every finite-dimensional truncation of the model; however, the quantitative rate of stability will vanish rapidly as the dimension is increased. Conversely, if the filter is stable for the infinite-dimensional model, then the rate of stability of the filter for the finite-dimensional models will be dimension-free. As it is ultimately the quantitative behavior of filtering algorithms that is of importance in practice, the qualitative phenomena investigated here in infinite dimension can still provide more insight into the behavior of practical filtering problems in high dimension than classical results in filtering theory.

Model and results
The goal of this section is to develop a simple example of the general infinite-dimensional setting of section 2.2 where we observe nontrivial behavior of the inheritance of ergodicity. This model, to be described presently, is a natural infinite-dimensional variation on Blackwell's counterexample (Example 2.1 above).
Throughout this section, 1}) Z are binary random fields in one spatial dimension. We let This evidently corresponds to a model of the form discussed in section 2.2. In words, the underlying dynamics is of the simplest possible type: each time and each spatial location is an independent random variable. When p = 0, the observations reveal for each site whether its current state differs from its state at the previous time and from the states of its two neighbors at the present time. When p > 0, each observation is subject to additional noise that inverts the outcome with probability p. By symmetry, it will suffice to consider the case p ≤ 1/2, which we will do from now on.
The model that we have constructed is evidently a direct extension of Example 2.1 to infinite dimension. As in Example 2.1, the process (X k , Y k ) k∈Z is ergodic in the strongest sense, so that even the uniform stability assumption of Theorem 2.2 is satisfied. When p = 0, it is easily seen by the same reasoning as in Example 2.1 that the filter is not stable. However, in Example 2.1 the addition of observation noise with error probability p > 0 would yield nondegenerate observations, and thus filter stability by Theorem 2.2.
In the present setting, on the other hand, nondegeneracy fails for any p. Nonetheless, the observations are locally nondegenerate when p > 0, and one might conjecture that this suffices to ensure inheritance of ergodicity. This is not the case.
Theorem 3.1. For the model of this section, there exist constants 0 < p ⋆ ≤ p ⋆ < 1/2 such that the filter is stable for p ⋆ < p ≤ 1/2 and is not stable for 0 ≤ p < p ⋆ .
Remark 3.2. We naturally believe that one can choose p ⋆ = p ⋆ in Theorem 3.1, but we did not succeed in proving that. The proof yields some explicit bounds on p ⋆ and p ⋆ . Theorem 3.1 shows that local nondegeneracy does not suffice to ensure inheritance of ergodicity in infinite dimension: ergodicity of the filter undergoes a phase transition at a strictly positive signal to noise ratio of the observations. Remarkably, the underlying model does not seem to exhibit any qualitative change in behavior: (X v k , Y v k ) k,v∈Z is a one-dependent random field for every value of the error probability p. Thus it is evidently possible in infinite dimension that complex ergodic behavior emerges in an otherwise trivial model when we consider its conditional distributions.
The remainder of this section is devoted to the proof of Theorem 3.1. The proof relies on standard tools from statistical mechanics [5,21]: a Peierls argument for the low noise regime and a Dobrushin contraction method for the high noise regime.

Proof of Theorem 3.1: low noise
We begin by noting that as ( , and we therefore have To prove that the filter is not stable, it therefore suffices to show that To show this, we begin by reducing the problem to finite dimension. Proof. Let β := log (1 − p)/p > 0. We begin by noting that Define the probability measure Q such that Then under Q, the observationsŶ m ℓ andŶ −m−1 ℓ , 1 ≤ ℓ ≤ k are symmetric Bernoulli and independent from all the remaining variables in the model, while the remainder of the model is the same as defined above. In particular, this implies that We therefore obtain using the Bayes formula for any A ∈ σ{X 0 , Y 1 , . . . , Y k , X v 1 , . . . , X v k : |v| ≤ m}. Define Z 0 := (X 0 1 , . . . , X 0 k ) and Z −m := (X m 1 , . . . , X m k , X −m 1 , . . . , X −m k ) for m ≥ 1. Due to the conditional independence structure of the infinite-dimensional filtering model, for every m ≥ 0. Thus (Z m ) m≤0 is a Markov chain under any regular version of the conditional distribution P[ · |X 0 , Y 1 , . . . , Y k ] (almost surely with respect to the realization of X 0 , Y 1 , . . . , Y k ). Moreover, the above estimate shows that the (random) transition kernels of this Markov chain satisfy the Doeblin condition [34,Theorem 16.2.4], so for all m ≥ 0. This completes the proof. Lemma 3.3 reduces our problem to a finite-dimensional one. Indeed, it is clear that the filter is not stable for p = 0 (for precisely the same reason as in Example 2.1), so we will assume without loss of generality in the sequel that 0 < p ≤ 1/2. Applying Lemma 3.3, it follows that in order to prove that the filter is not stable, it suffices to show that inf k,m≥1 But the conditional independence structure of the infinite-dimensional filtering model implies that the conditional expectation inside this expression depends only on X v ℓ and Y v ℓ for 0 ≤ ℓ ≤ k and |v| ≤ m + 1. We are thus faced with the problem of obtaining a lower bound on this finite-dimensional quantity that is uniform in k, m.
To lighten the notation, it will be convenient to view (X v k ) k,v∈Z not as a sequence of spatial random fields on Z, but rather as a single space-time random field on Z 2 . To this end, we will write X q := X v k for q = (k, v) ∈ Z 2 . We will similarly write Y qr :=Ȳ v k and ξ qr :=ξ v k if q = (k − 1, v) and r = (k, v), and Y qr :=Ŷ v k and ξ qr :=ξ v k if q = (k, v) and r = (k, v + 1) (the order of the indices q, r is irrelevant, that is, Y qr := Y rq etc.) In this manner, we can view X = (X q ) q∈Z 2 as a random field on the lattice Z 2 , with observations Y qr attached to each edge {q, r} ⊂ Z 2 with q − r = 1.
Proof. By the conditional independence structure of the filtering model, we have The joint distribution of the random variables that appear in this expression is where |A| denotes the cardinality of a set A. The result now follows readily from the Bayes formula and the fact that Y qr = X q X r ξ qr by construction.

Lemma 3.4 shows that the conditional distribution
has a familiar form in statistical mechanics: it is (up to the change of variables or gauge transformation σ q = x q z q ) an Ising model with random interactions, also known as a random bond Ising model or an Ising spin glass, with inverse temperature β = log (1 − p)/p. The failure of stability of the filter for large β can now be addressed using a standard method in statistical mechanics [5, section 6.4]. For concreteness, we include the requisite arguments in the present setting, which completes the proof.
Proof. Let us fix k, m ≥ 1 throughout the proof, and define 0 := (k, 0) ∈ J. We will prove below the following claim: there exists an absolute constant 0 < p ⋆ < 1/2 such that whenever 0 < p < p ⋆ : that is, when the noise is sufficiently small, the conditional dis- assigns a large probability to the actually realized value of X 0 k at least half of the time (recall Lemma 3.4). Let us complete the proof assuming this claim. Note that Σ x ({z : where we have used Lemma 3.4 and the fact that {X q } and {ξ qr } are independent. The proof is now completed by a straightforward estimate. It remains to prove the above claim. To this end, we use a Peierls argument. Fix for the time being a configuration z ∈ {−1, 1} J . For any J ′ ⊆ J, define the boundary edges We will denote the set of contours as C z,x (note that the definition of a contour depends on the given configurations z and x). If z 0 = −x 0 , then there must exist a contour J ′ ∈ C z,x such that 0 ∈ J ′ : indeed, construct J ′ by choosing the maximal connected subset of J such that 0 ∈ J ′ and z q = −x q for all q ∈ J ′ , and then 'fill in the holes' to make J ′ simply connected. Thus Now note that, by the definition of a contour, x q z q = −1 whenever {q, r} ∈ EJ ′ with q ∈ J ′ , and x q x r z q z r = −1 if in addition r ∈ J\J ′ . Thus the existence of a contour implies the presence of many such edges. The basic idea of the proof is that the probability that this occurs is small under Σ x due to Lemma 3.4. Let us make this precise.
Proof. Assume without loss of generality that J ′ is simply connected. Let us use for simplicity the convention that z r = x r for r ∈ ∂J. Define the events Then we evidently have by Lemma 3.4 An elementary computation shows that .
But the ratio in this expression is unity, as the exponential term inside the sums is invariant under the transformation z q → −z q for all q ∈ J ′ . The proof is complete.
Lemma 3.6 allows us to estimate Using a standard combinatorial result [21, Lemma 6.13] as well as the simple bound But we can now evidently choose p ⋆ > 0 sufficiently small such that c 1 ≤ 1/4 and c 2 ≤ 1/2 whenever p ≤ p ⋆ , which readily yields the desired estimate.

Proof of Theorem 3.1: high noise
We now turn to proving that the filter is stable when the noise is strong. We begin by noting that it suffices to prove stability of finite-dimensional marginals of the filter.
for every function f and every m ≥ 1. Then the filter is stable.

Proof. Fix any measurable subset
We can estimate By stationarity the first term does not depend on k, and the assumption gives Letting m → ∞ and using the martingale convergence theorem concludes the proof.
We will in fact prove a much stronger pathwise bound than is required by the above lemma. The basic tool we will use for this purpose is the Dobrushin comparison theorem [21,Theorem 8.20], which we state here in a convenient form.
and assume that sup j∈I i∈I C ji < 1.
Then D := ∞ n=0 C n exists (in the sense of matrix algebra), and We will apply this result pathwise to compare the filters with and without conditioning on the initial condition. To this end, we must compute the quantities that arise in the Dobrushin comparison theorem for suitably chosen regular conditional probabilities.  [52, p. 95-96] or [45,Lemma 3.4]. That each statement in the Lemma holds for P-a.e. (x, y) can therefore be read off from Lemma 3.4. As there are countably many statements, they can be assumed to hold simultaneously on a set A of unit measure.

Then there is a set
We can now complete the proof of filter stability for p > p ⋆ .
by Theorem 3.8 provided that the condition on the matrix C is satisfied. We proceed to estimate the matrix C using Lemma 3.9. Evidently On the other hand, note that by Lemma 3.9 It follows readily that We can now evidently choose 0 < p ⋆ < 1/2 such that 4e tanh(4β) < 1/2 for p ⋆ < p ≤ 1/2. Then the condition of Theorem 3.8 is satisfied. Moreover, as · * is a matrix norm Thus we obtain As our estimates are valid for P-a.e. (x, y), the proof is complete.
Remark 3.11. It is natural to conjecture that one can choose p ⋆ = p ⋆ in Theorem 3.1.
While we certainly believe this to be true, we were not able to prove this fact using standard methods. The difficulty can be seen in Lemma 3.4, as we presently explain. Lemma 3.4 shows that that the conditional distribution of X 1 , . . . , X n given Y 1 , . . . , Y n can be viewed as an Ising model in the spin variables σ q := x q z q with independent random interactions ξ qr . An Ising model is called ferromagnetic if all the interactions are positive. In the ferromagnetic case, it is standard to establish the existence of a unique phase transition point by monotonicity arguments [21, p. 100]. Unfortunately, while our model is 'ferromagnetic on average' as P[ξ qr = 1] > P[ξ qr = −1], there are always infinitely many interactions of either sign. Thus correlation inequalities cannot be used, and in their absence it is not clear how to prove the existence of a simple phase boundary. Monotonicity arguments will play a central role in section 5.3 below to rule out conditional phase transitions in a somewhat different context.
While we have made no attempt to optimize the estimates for p ⋆ and p ⋆ that can be extracted from the proof of Theorem 3.1, the methods used here are not expected to yield realistic values of these constants. On the other hand, the strong control provided by the Dobrushin comparison theorem has proved to be a powerful tool for the quantitative analysis of filtering algorithms in high dimension, cf. [39,40].

Symmetry breaking and observability
Theorem 3.1 shows that inheritance of ergodicity under conditioning cannot be taken for granted in infinite dimension even when the model is locally nondegenerate. Are such phenomena prevalent in infinite dimension, or are they restricted to some carefully constructed examples? We would like to understand in what situations such phenomena can be ruled out, both from the mathematical perspective and in view of the importance of filter stability (as well as spatial decay of correlations in infinite dimension) for the performance of practical filtering algorithms [39].
It is not difficult to understand the mechanism that causes the filter to be unstable in Theorem 3.1. In this model, the observations possess a global symmetry: the conditional law of Y is unchanged under the transformation X → −X. This symmetry renders the filter trivially unstable in the absence of observation noise, in precise analogy with Example 2.1. In the finite-dimensional case, however, Theorem 2.2 shows that the addition of any observation noise suffices to ensure that ergodicity of the underlying model is not broken by the additional symmetry introduced by conditioning. The surprise in infinite dimension is that the qualitative effect of the added symmetry still persists in the presence of observation noise. Thus local nondegeneracy in itself does not suffice to ensure the inheritance of ergodicity under conditioning.
On the other hand, the phenomenon exhibited in Theorem 3.1 evidently cannot arise in models that do not possess observation symmetries. It seems natural to conjecture that the presence of such symmetries is the only possible obstruction to inheritance of ergodicity under conditioning: that is, inheritance of ergodicity is ensured once observation symmetries are ruled out. It is not entirely obvious, however, how such a principle can be rigorously formulated. On the other hand, even in the absence of a general definition, this intuitive notion should certainly be satisfied in many elementary observation models. For example, let us state the following simple conjecture, which encapsulates the essence of the above intuition in the simplest possible setting.
If the underlying process (X k ) k∈Z is stable, then the filter is stable.
The idea behind this conjecture is that the direct observation structure 2 is trivial as then Y ⊥ ⊥ X; we will therefore assume p = 1 2 in the sequel). Thus any mechanism of the type exhibited by Theorem 3.1 is ruled out, and it seems hard to imagine another mechanism by which ergodicity of the underlying process could be obstructed due to conditioning on such informative observations. Despite the seemingly obvious nature of this conjecture, we were not able to prove such a result in a general setting.
The idea that stability of the filter is related to the absence of symmetries is not new in the infinite-dimensional setting. It arises already in classical filtering models for a somewhat different reason: it may happen that the filter is stable even when the underlying model is not ergodic. In such situations, stability properties can emerge under the conditional distribution due to the informative nature of the observations; in essence, the filter will 'forget' its initial distribution as the information contained therein is superseded by the information in the observations. This phenomenon was made precise in the papers [47,48,50]. While the theory developed in these papers is closely related to the symmetry breaking properties that we aim to exploit here, these results are not satisfactory in infinite dimension as will be explained below.
In this section, we will aim to extend such observability arguments to translationinvariant systems in infinite dimension by exploiting a technique from multidimensional ergodic theory [10]. Somewhat surprisingly, the problem proves to be more tractable in the continuous-time setting, for which will establish validity of the natural analogue of Conjecture 4.1. In its original discrete time formulation, however, our ultimate result falls short of establishing Conjecture 4.1 even for translation-invariant models. Nonetheless, the theory developed here provides one possible mechanism for symmetry breaking in conditional ergodic theory. An entirely different mechanism will be discussed in the context of conditional random fields in section 5.3 below.
The remainder of this section in organized as follows. In section 4.1 we will recall some basic ideas from the general observability theory developed in [47,48,50], and explain why these results are not satisfactory in infinite dimension. We also outline a slightly stronger entropic formulation that will provide the basis for a partial extension to translationinvariant systems in infinite dimension, which is developed in section 4.2. In section 4.3 we consider the continuous-time analogue of Conjecture 4.1, and provide a complete proof in this setting for the translation-invariant case.

General observability theory
Let (X k , Y k ) k≥0 be a general hidden Markov model as in section 2.1, and denote by P µ the law of the model with initial distribution X 0 ∼ µ. Define the prediction filter (note that we are conditioning the state at time k only on observations prior to time k). The basic observation behind the observability theory of [47,48,50] is as follows. 48,50]). Suppose that X k takes values in a compact metric space and that (X k , Y k ) k≥0 is Feller. Suppose that the following observability assumption holds: Then we have for every bounded and continuous function f Note that no ergodicity or nondegeneracy assumptions are imposed in this result: the only assumption is that of observability, that is, that distinct initial laws give rise to distinct observation laws. The latter can evidently be viewed as a very mild type of symmetry breaking assumption. A more general form of the result that does not require compactness or Feller assumptions can be found in [50].
There is nothing in Theorem 4.2 that prohibits its application in infinite-dimensional systems. In fact, it is readily verified that the observability assumption of Theorem 4.2 is automatically satisfied in the setting of Conjecture 4.1 (provided p = 1 2 ). Nonetheless, unlike in most finite-dimensional problems, the conclusion of Theorem 4.2 is not satisfactory in the infinite-dimensional setting, as we will presently explain.
There are two ways in which the conclusion of Theorem 4.2 falls short of the filter stability property as it was introduced in section 2.1. On the one hand, the conclusion applies to the prediction filterπ k rather than to the filter π k . This is, however, not a major issue: one could argue that the prediction filter is of equal interest in practice as the filter itself, and thus we would be quite content in first instance to resolve a variant of Conjecture 4.1 for the prediction filter. 3 Much more serious, however, is the fact that stability is only obtained for initial measures µ, ν that give rise to absolutely continuous observation laws.
If the state space of X k is countable, this is not a restriction: then δ x ≪ λ whenever λ({x}) > 0, so choosing µ = δ X0 and ν = λ in Theorem 4.2 yields the natural counterpart of the stability property of section 2.1 for the prediction filter. In a continuous state space, this argument does not work, as a point mass is singular with respect to any nonatomic measure. Nonetheless, this proves to be a minor problem in most finite-dimensional models: when the hidden Markov model possesses transition and observation densities, one can generally deduce absolute continuity of the observation even when µ and ν are mutually singular. This is discussed in detail in [13, section 11.5]. In infinite dimension, on the other hand, the existence of transition and observation densities is out of the question, and thus the conclusion of Theorem 4.2 is severely limited. In particular, there is no hope to deduce the very natural filter stability property as introduced in section 2.1, or its prediction filter counterpart, by using the result or the method of proof of Theorem 4.2.
by the Bayes formula, and thus the difference between µ and ν is entirely determined by the marginal on I). Thus the stability property for absolutely continuous initial measures does not capture the dissipation of an initial error in infinitely many coordinates, and is therefore of little relevance to the analysis of any reasonable approximation method that might arise in practice.
The proof of Theorem 4.2 in [48,50] relies on a classical martingale argument due to Blackwell and Dubins [2]. Instead of explaining this approach, let us present an alternative proof of stability, using the notion of entropy, in a special case: the finite-dimensional counterpart of Conjecture 4.1. The key mechanism behind the proof is equivalent to that of [48,50]. Nonetheless, the entropic formulation will be crucial to developing a nontrivial infinite-dimensional extension in section 4.2 below.
Then the prediction filter is stable in the sense that for every set A, provided that p = 1 2 . Proof. We use standard ideas from information theory [12]. Let H(X) denote entropy and H(X|Y ) denote conditional entropy of discrete random variables X, Y . Then where we have used the chain rule for entropy and symmetry of mutual information H(X)− H(X|Y ) = H(Y ) − H(Y |X) =: I(X; Y ). As H(X 0 ) ≤ r is independent of n, which is precisely the expected relative entropy between the conditional distributions . Thus by Pinsker's inequality for every function g. Now note that, by the conditional independence structure of the observations, we have Thus we can write using the tower property . . , x r ). But it is readily seen that if p = 1 2 , then each operator T i is invertible. Thus every function f is of the form f (x) = E[g(Y k )|X k = x] for some function g, and the proof is evidently complete.
Let us emphasize the two key ideas of the proof. We first establish that the conditional distribution of the next observation Y k given the observation history Y 1 , . . . , Y k−1 'forgets' the initial condition X 0 as k → ∞. That is, unlike the prediction filter which aims to predict the next hidden state X k , stability of predicting the next observation holds regardless of any properties of the model: it is a simple consequence of the finite entropy property H(X 0 ) < ∞ of the initial condition (this fails in the continuous case, and for this reason the absolute continuity assumption is essential in the general setting of Theorem 4.2). However, due to the absence of symmetry in the observations, predicting the next observation Y k is equivalent to predicting the hidden state X k . It is here that observability enters the picture.

Translation-invariant systems
The proof of Proposition 4.4 relies crucially on the fact that X k and Y k take values in a finite state space {−1, 1} r with r < ∞, so that the entropies H(Y 1 , . . . , Y n ) and H(X 0 ) are finite. In the infinite-dimensional setting of Conjecture 4.1 where X k and Y k take values in {−1, 1} Z , this is no longer the case. Nonetheless, entropy arguments are ubiquitous in the study of infinite-dimensional systems that are translation-invariant : in such systems, the role of entropy is replaced by that of the entropy rate (or 'specific entropy'), which plays an central role in the ergodic theory of measurable dynamical systems, cf. [23,Part 2], and in statistical mechanics where it gives rise to thermodynamic formalism, cf. [21,. The aim of this section is to develop an infinite-dimensional counterpart to Proposition 4.4 using this formalism.
Let (X k , Y k ) k∈Z be a stationary infinite-dimensional hidden Markov model as in section This model is the same as in Conjecture 4.1. In addition, however, we will assume throughout this section that the model is translation-invariant: that is, that the local transition kernels P v (x, dz) = P w (x, dz) for all v, w ∈ Z (cf. section 2.2), and that We now state the main result of this section.  Remark 4.6. Note that Theorem 4.5 does not impose any ergodicity assumption on the underlying hidden Markov model: only observability was used to establish the result. In this sense, this result goes beyond the spirit of Conjecture 4.1, which states that ergodicity of the underlying model is inherited by the filter in models with informative observations. Indeed, the result of Theorem 4.5 cannot be interpreted as establishing the inheritance of ergodicity, as ergodicity plays no role in the argument; rather, the intermediate filter is rendered stable here entirely due to the the absence of observation symmetries, even when the underlying model is not ergodic. One might therefore expect that ergodicity should play no role in Conjecture 4.1 either: after all, the mechanism by which we are exploiting the absence of observation symmetries appears to be independent of the ergodic properties of the model. However, it is possible that ergodicity must nonetheless enter the picture in order to extend the conclusion of Theorem 4.5 from the intermediate filter to the filter, so that neither ergodicity nor observability suffices by itself to ensure stability of the filter in infinite dimensional models. Some evidence for this possibility will be discussed in section 6.3. On the other hand, the continuous-time results of section 4.3 below could be viewed as evidence to the contrary. New ideas appear to be needed to resolve these questions.
We now turn to the proof of Theorem 4.5. To this end we begin by proving the following result, which replaces the key step in the proof of Proposition 4.4.  The key to the proof of Proposition 4.7 is that the above ideas admit a multidimensional extension. Let (Z q ) q∈Z 2 be a translation-invariant random field such that Z q takes values in a finite set. In this setting, the entropy rate (or 'specific entropy') h(Z) can be expressed in terms of the lexicographic order ≺ on Z 2 [17,10]: where B n is the centered box in Z 2 with radius n and Z Bn = {Z q : q ∈ B n }, Z ≺q = {Z u : u ≺ q}. The random field analogue to the above entropy identity was obtained by Conze [10, eq. (20) for any v ∈ Z. The proof of Proposition 4.7 follows from this identity.
Remark 4.9. By arguing as in the previous remark, the result of Proposition 4.7 can be rewritten in terms of conditional entropy rates. If we define then the conclusion of Proposition 4.7 can be expressed as follows: This could be interpreted as a direct extension of the key step in the proof of Proposition 4.4 to translation-invariant systems in infinite dimension; the finite-dimensional notion of entropy is simply replaced by its infinite-dimensional counterpart, the entropy rate.
Proof of Proposition 4.7. By stationarity, it suffices to show that First, we note that where we have used the identity of Conze [10, eq. (20)] (cf. Remark 4.8 above). As But by the hidden Markov model structure, and the proof is complete.
Proposition 4.7 concerns the stability of prediction of the next observation. As in the proof of Proposition 4.4, we will transform such properties into stability properties of the filter by using the informative nature of the observations.
Proof of Theorem 4.5. By translation-invariance and Lemma 3.7, it suffices to show that in L 1 as k → ∞ for every set B, m ≥ 1 and v ∈ Z. Suppose first that v ≤ 1. By the chain rule for entropy and Proposition 4.7, we have Following verbatim the second part of the proof of Proposition 4.4 yields in L 1 as k → ∞ for every set C, and thus the result follows. Now suppose that v > 1. We can assume without loss of generality that v ≤ m (otherwise the conclusion follows from the result for m = v). We may also assume that 0 < p < 1 (otherwise the conclusion is trivial). By the Bayes formula, satisfies the analogous expression. As 0 < inf g ≤ sup g < ∞, it follows readily that for all x, and thus the proof is complete. However, we do not know how to establish the validity of Theorem 4.5 in either limit.
To obtain some insight into this idea, let us rewrite Theorem 4.5 in a measure-theoretic manner. Using the Markov property and translation invariance, we obtain as in the proof of Proposition 4.7. Letting k → ∞ and using Theorem 4.5 yields where we defined the σ-fields X k = σ{X k , X k−1 , . . .} and Y v − = σ{Y <v 0 , Y −1 , Y −2 , . . .}. Now let us attempt to take the limit as v → −∞. This yields This does not suffice to establish Conjecture 4.1 for the prediction filter. In order to deduce the latter, we would need to establish the identity . .} mod P (the notation F = G mod P indicates that the P-completions of the σ-fields F and G coincide). Indeed, if this is the case, then we obtain by Jensen's inequality In order to establish Conjecture 4.1, we would now need the identity the remainder of the argument proceeding in the same manner as for v → −∞. Neither of the above measure-theoretic identities appears to be obvious; indeed, the problem of establishing such identities is closely related to the filter stability problem itself (cf. section 6.1). Nonetheless, the conclusion of Theorem 4.5 appears to be tantalizingly close to establishing Conjecture 4.1 for translation-invariant systems, and the fact that the latter does not appear to follow directly from the former provides one more indication of the delicacy of the filter stability problem in infinite dimension. A very similar argument will be used in the following section to resolve the continuous-time analogue of the problem; the key distinction in this case is that an appropriate measure-theoretic identity can in fact be established (Lemma 4.15 below).

Continuous time
In the previous section we have developed a partial result on filter stability in the case of translation-invariant models with direct observation structure. Unfortunately, that result concerns a quantity intermediate between the filter and prediction filter, and therefore falls short of resolving Conjecture 4.1 even in the translation-invariant setting. Surprisingly, however, it turns out that this problem can be resolved if we consider the natural continuous time analogue of Conjecture 4.1, providing a complete proof of filter stability in the translation-invariant setting for continuous-time models with direct observations. This idea will be developed in the remainder of this section.
To define the continuous-time counterpart of the filtering model of Conjecture 4.1, we begin by considering a stationary Markov process X = (X t ) t∈R with càdlàg paths with values in {−1, 1} Z . We will assume that X satisfies the Feller property, that is, that is a quasilocal function for every t and bounded quasilocal function f (a function is called quasilocal if it is the uniform limit of functions that depend on a finite number of coordinates). 4 This mild condition ensures that the dynamics are local in a very weak sense (as compared to the much stronger local structure of the model in section 2.2, which was however not used in the previous section). Markov processes of this type arise broadly in the literature on interacting particle systems, cf. [30]. To define the local observations (Y t ) t∈R , we introduce a 'white noise' model of the form where (W v t ) t∈R are i.i.d. two-sided Brownian motions independent of X, and σ > 0 denotes the noise strength. The process (X t , Y t ) t∈R is a natural continuous-time analogue of the model of Conjecture 4.1, and will be used in the remainder of this section.
Remark 4.11. The details of the present model are not essential for our results. The proof is easily extended to random fields on Z d with values in a finite state space, and to observation models other than the usual white noise model (cf. [48]). For concreteness and to avoid additional notation, we will work here in the simplest setting defined above.
As in section 4.2, we will further assume that the model is translation-invariant (note that, due to the additive nature of the observations, it is the increments of Y that are translation invariant and not Y itself). In this setting, we obtain the following result.
This result evidently resolves, in the translation-invariant case, the continuous-time analogue of Conjecture 4.1. We remark once more that stability of X is not assumed.
To prove Theorem 4.12 we require a sharper version of the entropy identity of Conze that was used in the proof of Theorem 4.5 (cf. Remark 4.8). The proof of the requisite identity, which we state presently, can be found in [10, p. 17, §8].
Let us note that the identity in Remark 4.8 follows immediately as n → ∞. The problem in continuous time is that we no longer have a discrete random field as in Lemma 4.13. We address this by an appropriate discretization method, which yields the following continuous-time counterpart of Proposition 4.7.
Proof. Let us choose v = 0 for simplicity. The result for arbitrary v follows immediately by translation invariance. In the following, we fix δ > 0 and m ∈ N.
Define the random field Z = (Z q ) q∈Z 2 with Z q = (X r k ,Ỹ r k ) for q = (k, r) ∈ Z 2 as Then evidently Z is translation-invariant andX r k is finite-valued, butỸ r k takes values in the space C 0 ([0, δ]; R m+1 ) of continuous paths ω : [0, δ] → R m+1 with ω(0) = 0 (which is Polish when endowed with the topology of uniform convergence and the associated Borel σ-field). Thus we cannot directly apply Lemma 4.13.
To surmount this problem, we employ a straightforward discretization procedure. Let {A j } j≥1 be a countable generating class for the Borel σ-field F of C 0 ([0, δ]; R m+1 ), and define the functions κ j : C 0 ([0, δ]; R m+1 ) → {0, 1} j as κ j := (1 A1 , . . . , 1 Aj ). Then F j := σ{κ j } is an increasing family of σ-fields such that j F j = F. Now defineỸ r k (j) := κ j (Ỹ r k ). Then the random field (X r k ,Ỹ r k (j)) k,r∈Z is translation-invariant and finite-valued for every j ≥ 1. Thus we can apply Lemma 4.13 to obtain In particular, as the left-hand side of this expression is an expected relative entropy (see the proof of Proposition 4.4), we can estimate for every i ≤ j. Letting j → ∞ and using that conditioning reduces entropy gives To proceed, we write the left hand side as an expected relative entropy as in the proof of Proposition 4.4. Using the continuity of the relative entropy in information (e.g., [14,Lemma 4.4.15]) and monotone convergence, this yields It therefore follows immediately that It remains to note that where we have used the Markov property of (X t , Y t ) t≥0 to obtain the latter equality. This establishes that E[D(ρ s ||ρ)] → 0 along the subsequence s = −(n + 1)δ, and therefore as s → −∞ as E[D(ρ s ||ρ)] is decreasing in s (as is easily verified by Jensen's inequality).
We can now complete the proof of Theorem 4.12.
Proof of Theorem 4.12. By translation-invariance and Lemma 3.7, it suffices to show for every set A and m ≥ 1. But by the martingale convergence theorem, and as {X s , Y s } s≤−t is conditionally independent of {X s , Y s } s≥−t given X −t , Y −t by the Markov property, it suffices to show for every set A and m ≥ 1 that By the martingale convergence theorem, this can be formulated equivalently as The key distinction between the continuous-and discrete-time settings is the following.
Proof. Let us prove the first statement (the second statement follows readily in the same manner). It evidently suffices to prove the stronger identity δ>0 σ{Y ≤δ , X ≤−t } = σ{Y ≤0 , X ≤−t } mod P.
To this end, it suffices to show that for every bounded random variable Z. By a standard approximation argument, it suffices to consider Z of the form Z s = f (X s+t1 , Y s+t1 , . . . , X s+tn , Y s+tn ) for n ∈ N, t 1 , . . . , t n ∈ R, and f bounded and local. Now note that by stationarity, As X is Feller, it is quasi-left continuous [41, p. 101], and therefore X t−δ → X t a.s. as δ ↓ 0. In particular, as f is local and Y is continuous by construction, we have Z s−δ → Z s a.s. as δ ↓ 0. On the other hand, as f is bounded, we obtain ] for every set A, m ≥ 1, and δ > 0: indeed, letting δ ↓ 0 the yields the expression before the statement of Lemma 4.15. We will deduce this fact from Proposition 4.14. To this end we require a lemma that replaces the analogous argument in Proposition 4.4. We now recall that as C is compact, every continuous function on C is contained in the closure of {(g * ξ n )| C : g ∈ C b (R m+1 )} with respect to the uniform convergence topology on C (here C b (R n+1 ) is the family of bounded continuous functions on R n+1 ). This follows from an elementary Hahn-Banach argument, cf. [48,Remark 5]. We can therefore choose, for each n, a bounded continuous function g n such that Then we evidently have ] for every bounded continuous function h and s ∈ [0, δ]. But this follows readily from Proposition 4.14 using Pinsker's inequality and martingale convergence.

Conditional random fields
Thus far we have considered infinite-dimensional counterparts of classical stability problems in nonlinear filtering. However, new questions arise in infinite dimension beyond stability that are of interest in their own right. In particular, it is of significant interest (cf. [39]) to understand the spatial mixing and decay of correlations properties of conditional distributions in infinite dimension, which could be viewed as spatial counterparts to the filter stability property. Such questions already arise in the absence of dynamics, and thus we proceed in this section to introduce such problems in the most basic setting of conditional random fields (that is, in models with only spatial degrees of freedom). Our motivations for such questions are threefold: 1. Random fields provide the simplest possible setting to investigate the spatial mixing properties of conditional distributions.
2. Conditional random fields are of practical interest in their own right, for example, in Bayesian image analysis applications [53,20].
3. Even in the more classical setting of the previous sections, the random field viewpoint proves to be fundamental to the understanding of filter stability in infinite dimension: indeed, the proofs in both sections 3 and 4 above and in [39,40] exploit the idea that (X v k , Y v k ) k∈Z,v∈Z d can be viewed as a space-time random field. The remainder of this section is organized as follows. In section 5.1, we recall some basic notions from the theory of Markov random fields. In section 5.2, we develop basic properties of conditional random fields and introduce some of the relevant questions. Finally, in section 5.3, we develop a general result that ensures the inheritance of ergodicity under conditioning in random fields that possess certain monotonicity properties. The latter provides a mechanism for the resolution of the random field counterpart of Conjecture 4.1 that is quite distinct from the observability theory of section 4.

Markov random fields
A random field is a collection of random variables X v that are indexed by the spatial degree of freedom v. For simplicity, we will assume in the sequel that v ∈ Z d (but see section 6.3 below) and that each X v takes values in a finite set E.
In the following, we define for any V ⊆ Z d If V is a finite subset of Z d , we will write V ⊂⊂ Z d . We now recall a basic definition.
Definition 5.1. X = (X v ) v∈Z d is called a Markov random field if it possesses the (local) Markov property, that is, P[X V ∈ ·|X V c ] depends only on X ∂V for every V ⊂⊂ Z d .
Just as Markov chains are defined by transition probabilities, Markov random fields are defined by a family of local transition kernels called a specification [21, Chapter 1].
is called a specification. A Markov random field X is said to be specified by γ if we have P(X ∈ A|X V c ) = γ V (X, A) for every measurable set A and V ⊂⊂ Z d . The family of all laws of Markov random fields specified by γ is denoted G (γ).
where Z is the appropriate normalization factor. It is easily verified that γ = (γ V ) V ⊂⊂Z d defines a specification. The potentials ψ v and ϕ {v,w} describe the local external and interaction forces between different sites, and are defined directly in terms of the physical parameters of the problem. For example, if E = {−1, 1}, ϕ {v,w} (σ, σ ′ ) = βJσσ ′ , and ψ v (σ) = βµσ with β, J > 0 and µ ∈ R, this is the well known ferromagnetic Ising model with inverse temperature β, interaction strength J and magnetic field strength µ. The construction in terms of potentials will be inessential in the sequel, however.
Given a specification γ, there always exists a random field in G (γ) under our assumptions. However, just as a Markov chain with given transition probabilities may admit more than one stationary distribution, the random field associated to a given specification need not be unique. In fact, the structure of the set G (γ) is closely related to the spatial mixing properties of the associated random fields, as is shown by the following result [21, section 4.4, Proposition 7.11, Theorem 7.7]. To interpret the notion of extremality that arises here, note that if P and Q are the laws of two random fields in G (γ), then λP + (1 − λ)Q is also in G (γ) for 0 ≤ λ ≤ 1 [21, Chapter 7]; thus G (γ) is a convex set, and a random field is called extremal if it is an extreme point of this set.
Theorem 5.4. For a given specification γ, the following hold.
2. Uniqueness ⇔ uniform mixing: |G (γ)| = 1 iff a random field in G (γ) satisfies 5,6 Here we used the suggestive notation P[X ∈ C|X W c = x W c ] := γ W (x, C) to emphasize the significance of the mixing property. Note that P[X ∈ C|X W c ] = γ W (X, C) holds a.s. by the definition of G (γ), but the equivalence between uniqueness and uniform mixing is false if a null set is omitted in the supremum over x. 6 The notation lim W a W denotes the limit of the net {a W }, where {W ⊂⊂ Z d } is directed by inclusion.
for every set A and V ⊂⊂ Z d .
3. Extremality ⇔ mixing: the random field X is an extreme point of G (γ) iff for every set A and V ⊂⊂ Z d .
The mixing property in Theorem 5.4 is a direct spatial analogue of the stability property of a Markov chain introduced in section 2.1. Indeed, a Markov chain is stable if it forgets its initial condition after a long time: that is, the Markov chain has a 'finite memory.' Similarly, a random field is mixing if the distribution of any finite set of sites V is insensitive to knowledge of the configuration of the field outside a larger set W when the distance between V and W c is large. This implies in particular that distant sites are nearly independent, that is, the field has 'finite correlation length.' The uniform mixing property is a strictly stronger notion, where the forgetting property holds uniformly in the boundary configuration x ∂W (recall that by the Markov property of the random field, P[X ∈ C|X W c = x W c ] depends on x ∂W only).

Conditional random fields and conditional mixing
In the following, let us fix a specification γ and a Markov random field X = (X v ) v∈Z d that is specified by γ. In order to investigate the conditional distributions of random fields, we must introduce a suitable observation structure. To this end, in analogy with section 2.2, let us fix for each v ∈ Z d a transition kernel Φ v from the state space E of the random field to a measurable space F in which the observations take their values. We now construct the observations Y = (Y v ) v∈Z d such that that is, each site of the underlying field is observed independently with P[Y v ∈ A|X v ] = Φ v (X v , A). The resulting model (X v , Y v ) v∈Z d is called a hidden Markov random field.
Remark 5.5. For notational simplicity, we have formulated our model such that the observations are attached to individual sites v ∈ Z d . One could also consider more general models, for example, where an observation Y {v,w} is attached to every edge {v, The results of this section will continue to hold in this setting with minor modifications.
We can now formulate the natural counterpart of the filter stability property in hidden Markov random fields: the model is said to be conditionally mixing if the conditional distribution of the underlying process in a finite set of sites given the observations is insensitive to knowledge of the configuration of the field at distant sites.
Definition 5.6. The hidden Markov random field ( The basic question to be addressed in this setting is therefore: when is the mixing property inherited by conditioning, that is, when does the mixing property of the random field X imply the conditional mixing property of (X, Y )?
It will be insightful to reformulate the problem in different terms. For simplicity, we will assume in the sequel that the observations are locally nondegenerate, that is, that .
Then the following hold.
Proof. We begin by verifying that γ y is a specification. To this end, let W ⊂ V ⊂⊂ Z d .
Thus γ y V γ y W = γ y V , and the remaining properties of a specification hold trivially. Next, we show that P[X ∈ · |Y ] is in G (γ Y ) a.s. To this end, let us fix any regular version P Y of the conditional distribution P[ · |Y ]. We must show that for a.e. observation record y, we have P y [X ∈ A|X V c ] = γ y V (X, A) for all A, that is, we must show that for every measurable A and B ∈ σ{X V c } holds for P-a.e. y. Is easily seen by the definition of a hidden Markov random field that We therefore have holds P-a.s. for a fixed choice of A, B ∈ σ{X V c }, and thus simultaneously for a countable family of sets A and B ∈ σ{X V c }. By choosing the countable family to be a generating class (note that all our σ-fields are countably generated), the above identity holds simultaneously for every A and B ∈ σ{X V c } by a monotone class argument. As there are only countably many V ⊂⊂ Z d , we have proved that P[X ∈ · |Y ] is in G (γ Y ) a.s. Finally, we consider the conditional mixing property. As the limit in the definition of (conditional) mixing is over a decreasing net (by Jensen's inequality), it suffices to consider the limit along any fixed cofinal increasing sequence W n ⊂⊂ Z d . Thus by the martingale convergence theorem, the conditional mixing property holds if and only if , the conditional mixing property is equivalent to lim n→∞ E y |P y [X ∈ A|X W c n ] − P y [X ∈ A]| = 0 for P-a.e. y for every V ⊂⊂ Z d and A ∈ σ{X V }. But by the martingale convergence theorem Thus we can again use a monotone class argument as above to remove the dependence of the P-null set on V and A.
holds for P-a.e. y, which is precisely the mixing property of P[X ∈ · |Y ].
Proposition 5.7 shows that the conditional distribution P[X ∈ · |Y ] defines again a (random) Markov random field, and gives an explicit expression for its specification γ Y . The inheritance of ergodicity can now be formulated in terms of the ergodic properties of the conditional field. In particular, we can pose two natural questions: 1. If P[X ∈ · ] is extremal in G (γ), when is P[X ∈ · |Y ] extremal in G (γ Y ) a.s.?
The first question is evidently the direct spatial analogue of the filter stability problem: when is the mixing property inherited by the conditional distribution? The second question is analogous, but for the uniform mixing property. It is evident from Theorem 5.4 that |G (γ Y )| = 1 a.s. implies the conditional mixing property. The stronger conclusion |G (γ Y )| = 1 a.s. is perhaps less natural from the point of view of conditional distributions, but is of practical relevance in its own right as it is closely connected with the computational complexity of MCMC methods for Bayesian image analysis [20].
As in the filter stability problem, local nondegeneracy of the observations does not suffice to obtain an affirmative answer to either of the above questions. In fact, we have a direct analogue of the example given in section 3.
Example 5.8. Let E = F = {−1, 1}, and define the random field (X v ) v∈Z 2 such that X v are i.i.d. symmetric Bernoulli random variables. It is evident that this model is uniformly mixing in the most trivial sense (thus uniqueness and extremality both hold).
We now attach an observation Y {v,w} to each edge {v, w} ⊂ Z d , v − w = 1 by setting Y {v,w} = X v X w ξ {v,w} with ξ {v,w} i.i.d. and independent of X with P[ξ {v,w} = −1] = p. In this manner, we evidently obtain a direct counterpart of the model of section 3. While the observations in this model are defined on the edges rather than on the vertices as we have done in this section, a result that is entirely analogous to Proposition 5.7 holds in this setting (see also Remark 5.5 above and Remark 5.9 below).
We can now proceed identically as in the proof of Theorem 3.1 to show that there exists 0 < p ⋆ < 1/2 such that the hidden Markov random field (X, Y ) fails to be conditionally mixing for p < p ⋆ . In fact, this is precisely the idea behind the proof of Theorem 3.1 in the first place: the model (X v k , Y v k ) k,v∈Z is considered as a space-time random field, and the problem is addressed using classical methods from statistical mechanics.
The present example could be considered as a toy model in image analysis. The underlying field X represents a grid of black or white pixels of an image, and the observations Y correspond to noisy measurements of the gradient of the image at each point. Thus we see that the ability to reconstruct the image based on the noisy gradient information undergoes a phase transition at a positive signal-to-noise ratio.
Remark 5.9. The use of edge observations in Example 5.8 is merely cosmetic: the same example can be reformulated in terms of vertex observations. Indeed, let us define the random field ( where X v and ξ {v,w} are as in Example 5.8. ThenX is still a uniformly mixing Markov random field, the observationsỸ are locally nondegenerate, and P[X 1 ∈ · |Ỹ ] = P[X ∈ · |Y ]. In particular, the above conditional phase transition arises identically in this formulation. In view of the above, the inheritance of mixing properties of random fields under conditioning cannot be taken for granted. Just as in the filter stability problem, however, it is natural to expect that conditional mixing will hold in the absence of observation symmetries. Such a conjecture is often implicit in work on Bayesian image analysis (cf. [20, p. 6]). For example, we can formulate the natural analogue of Conjecture 4.1.
If the underlying random field X is mixing, then the model is conditionally mixing.
We do not know how to prove such a conjecture in a general setting. However, we will presently establish the validity of such a result under monotonicity assumptions on the underlying field. This provides an entirely different mechanism for the inheritance of ergodicity than the observability theory that was developed in section 4 above.

Monotonicity
The goal of this section is to prove a variant of Conjecture 5.10 under monotonicity assumptions on the underlying field. In such models, the direct observation structure Y v = X v ξ v ensures that monotonicity properties are preserved under conditioning on the observations, which greatly facilitates the analysis of the conditional random field. The arguments used here are directly inspired by the methods used in [25,18,16] to investigate the global Markov property of random fields.
In this section, we will assume that E = F = {−1, 1}. Fix a specification γ and a Markov random field X = (X v ) v∈Z d specified by γ, and introduce the observations For simplicity, we will assume throughout this section that 0 < p v ≤ 1/2 for all v. 7 The following monotonicity property is the key assumption of this section. This property has various useful characterizations, cf. [27,Theorem 2.27]. Here a function f : Definition 5.11. A specification γ for a {−1, 1}-valued random field is called monotone if for every bounded increasing function f and V ⊂⊂ Z d , the function γ V f is increasing.
We can now formulate the main result of this section.
Theorem 5.12. For the model of this section, suppose that the specification γ is monotone. Then |G (γ)| = 1 implies that |G (γ Y )| = 1 a.s. In particular, under the monotonicity assumption, uniqueness of the underlying random field implies conditional mixing.
Under the monotonicity assumption, Theorem 5.12 essentially resolves the analogue of Conjecture 5.10 for uniqueness rather than extremality (this is the special case p v = p = 1/2 for all v). A partial result on the inheritance of extremality when the field is not unique can be deduced from the proof as well, see Remark 5.17 below.
Nonetheless, the result of Theorem 5.12 is arguably quite different in spirit from Conjecture 5.10, as it does not require the absence of observation symmetries. For example, we could choose the noise parameters p v such that p v = 1 2 for alternating sites in the lattice Z d : then there is certainly a large class of observation symmetries, as only every other site in the lattice is observed. Thus Theorem 5.12 is not addressing a symmetry-breaking phenomenon of the type that motivated the observability conjecture in section 4. On the other hand, as will become clear in the proof of Theorem 5.12, the present approach directly implements the idea that ergodicity is inherited from the underlying model to the conditional distribution, in contrast with the observability theory of section 4 which does not exploit at all the ergodic properties of the model. Remark 5.13. It would be interesting to obtain a counterpart to Theorem 5.12 for the filter stability problem. Unfortunately, this does not appear to be possible, at least in the setting of section 2.2. The proof of Theorem 5.12 relies on the fact that the conditional distributions of the random field given the observations are monotone, which is essentially equivalent to the validity of Definition 5.11. However, in the filtering model of section 2.2, the associated space-time random field must generally fail to be monotone in the sense of Definition 5.11, cf. [31] for a discussion in the continuous time setting. While the spacetime distributions of interacting particle systems with monotone transition probabilities do satisfy some weaker monotonicity properties, cf. [29, section 3.1] or [31], such weaker properties do not suffice for the proof of Theorem 5.12.
We now turn to the proof of Theorem 5.12. We begin by establishing the inheritance of monotonicity by the conditional specification γ y .
2. y → γ y V f is increasing for every V ⊂⊂ Z d and bounded increasing f . Proof. It will be convenient to write Define for every y ∈ F Z d and W, V ⊂⊂ Z d the transition kernel on .
Evidently γ y V = γ y V,V and γ V = γ y V,∅ , and for simplicity we will write γ y ∅,W f = f . We now prove that (γ y V,W f )(x) is increasing in x, y for every bounded increasing function f by induction on |V | + |W |. The claim is obviously true for |V | + |W | = 1 by the monotonicity of γ. In the remainder of the proof, we suppose that the claim is true for all V, W such that |V | + |W | = m, and we proceed with the induction step.
Fix V, W such that |V | + |W | = m + 1. The claim is trivially true if W ∩ V = ∅ by the monotonicity of γ. Otherwise, fix v ∈ W ∩ V and a bounded increasing function f . Then \{v},W f holds by the same argument as in the proof of Proposition 5.7. Moreover, by the induction hypothesis, (γ y V \{v},W f )(x) is increasing in x, y, and by construction (γ y V \{v},W f )(x) depends only on x v and x V c , y. To show that (γ y V,W f )(x) is increasing in x, y, it thus suffices to show that γ y V,W (x, A) is increasing in x, y for 1 A (x) = 1 xv=1 . But note that .
As γ y V,W \{v} (x, A) is increasing in x, y by the induction hypothesis, γ y V,W (x, A) is also increasing in x, y. Thus the induction step is established, and the proof is complete.
A classical consequence of the monotonicity of γ is that it implies that G (γ) has a maximal and a minimal element. This substantially simplifies the characterization of uniqueness. The following standard result (see, for example, [5, section 4.3.3]) makes this idea precise. In the sequel, we define the maximal and minimal configurations +, − ∈ {−1, 1} Z d as is said to be local if it depends only on a finite number of coordinates. Lemma 5.15. Suppose that γ is monotone. There exist laws P + , P − in G (γ) such that for every P in G (γ) and local increasing function f . In particular, |G (γ)| = 1 iff P + = P − .
Proof. Let f be a bounded increasing function and let W ⊂ V ⊂⊂ Z d . As γ W f is increasing and γ V = γ V γ W , we readily obtain γ W f (−) ≤ γ V f ≤ γ W f (+). Thus the net (γ V f (−)) V ⊂⊂Z d is increasing and the net (γ V f (+)) V ⊂⊂Z d is decreasing. Now note that {−1, 1} Z d is a compact and metrizable (for the product topology). Thus (γ V (−, · )) V ⊂⊂Z d and (γ V (+, · )) V ⊂⊂Z d are precompact for the weak convergence topology. In particular, we can choose a cofinal increasing sequence V n ⊂⊂ Z d such that γ Vn (−, · ) → P − and γ Vn (+, · ) → P + weakly for some probability measures P − and P + , respectively. By the above monotonicity, it follows readily that for every local increasing function f . Moreover, for any measure P in G (γ), we have for every local increasing function f , and similarly for the upper bound. We now argue that P − and P + are in fact in G (γ). To this end, fix any W ⊂⊂ Z d . As W ⊂ V n for all n sufficiently large (as {V n } is cofinal), and as γ W f is local if f is local (by the Markov property of the random field), we can write for all local functions f, g.
. The conclusion for P + follows identically.
Finally, we argue that |G (γ)| = 1 iff P + = P − . If P + = P − , then evidently |G (γ)| ≥ 2. On the other hand, if P + = P − , then the expectation of every local increasing function must coincide for all γ-specified random fields. But then all γ-specified random fields coincide, as the local increasing functions are measure-determining (the moment generating function E[e λ1Xv 1 +···+λmXv m ] determines uniquely the joint law of X v1 , . . . , X vm , and f (x v1 , . . . , x vm ) = e λ1xv 1 +···+λmxv m is local and increasing for every λ 1 , . . . , λ m ≥ 0). We call P + and P − the maximal and minimal elements of G (γ). Now recall that if γ is monotone, then so is γ y . Thus G (γ y ) also has a maximal and a minimal element. The key step in the proof of Theorem 5.12 is the following observation due to Föllmer [18]. Lemma 5.16. Suppose that γ is monotone. Let P + and P − be the maximal and minimal element of G (γ), and let P y + and P y − be the maximal and minimal element of G (γ y ). Then Proof. Let us prove the result for P + ; the conclusion for P − follows in the same manner. Let f, g be local increasing functions. It suffices to show that for all local increasing functions f, g. By Lemma 5.15 Now note that (γ y V f )(+) is a local increasing function of y by Lemma 5.14. It is easily verified (as is an increasing function of X V for every increasing function h. Thus we obtain using Lemma 5.15 where the last equality follows as the quantity inside the infimum is decreasing in both V and W . But note that we have for every V sufficiently large that g(y) = g(y V ) Taking the infimum over V , it follows that and the proof is complete.
We can now easily complete the proof of Theorem 5.12.
Remark 5.17. Even when the underlying field is not unique, that is, when P + = P − , Lemma 5.16 immediately shows that the maximal and minimal models P + and P − are both conditionally mixing. Indeed, Lemma 5.16 implies that the conditional distributions P + [X ∈ · |Y ] and P − [X ∈ · |Y ] are maximal and minimal in G (γ Y ) a.s., and must therefore be extremal a.s. (so that conditional mixing follows by Proposition 5.7). Lemma 5.15 might lead one to expect that if γ is monotone, then P + and P − are the only extremal random fields in G (γ). If a monotone specification γ satisfies this property, then we have shown that mixing implies conditional mixing, that is, we have resolved Conjecture 5.10 for monotone random fields. However, it turns out that there may exist extremal random fields defined by monotone specifications that differ from P + , P − . For example, for the ferromagnetic Ising model, the situation is as follows. On the twodimensional lattice Z 2 , the maximal and minimal fields P + , P − are the only extremal fields.
However, in higher dimensions Z d , d ≥ 3 there exist extremal fields other than P + , P − . On the other hand, in any dimension, P + , P − are the only translation-invariant extremal fields. These are highly nontrivial results, and we refer to [11] and the references therein for further details. By combining these facts, we can resolve several variants of Conjecture 5.10 for the ferromagnetic Ising model. However, the situation for general monotone fields remains unclear.

Discussion
Our main aim in this paper has been to draw attention to the existence of surprising probabilistic phenomena in nonlinear filtering that arise in infinite dimension, and that are fundamentally different in nature from the problems that have been investigated in filtering theory to date. In particular, we have exhibited the existence of conditional phase transitions, and we made some first steps toward a general theory. It is only fair to emphasize, however, that none of our general results are entirely satisfactory, and it appears in particular that we are far from understanding the issues surrounding Conjectures 4.1 and 5.10 at any level of generality. It is likely that new ideas are needed to develop a deeper understanding of these questions.
Throughout this paper, we have repeatedly exploited connections between filtering in infinite dimension and problems in statistical mechanics and in multidimensional ergodic theory. In this final section, we will briefly discuss some further connections with other probabilistic problems that did not arise in the previous sections.

Measure-theoretic identities and conditional ergodicity
In this paper we have formulated the filter stability problem as one of inheritance of ergodicity. It is insightful, however, to consider the problem from a different perspective: following Kunita [28], the filter stability problem can be equivalently rephrased in terms of the validity of a simple measure-theoretic identity (see [51] for further discussion and references). In view of this connection, the counterexamples given above provide some new insight into the failure of such measure-theoretic identities.
We discuss this issue briefly here. Let us begin by rephrasing the filter stability property. Let (X k , Y k ) k∈Z be a stationary hidden Markov model as in section 2.1. Then In particular, validity of the measure-theoretic identity is sufficient for filter stability. In many cases, this identity can also be shown to be necessary [51]. In precisely the same manner, it is not difficult to show that the unobserved Markov chain (X k ) k∈Z is stable if and only if the tail σ-field Thus we can formulate a measure-theoretic version of the filter stability problem: It is tempting to conclude that this is always the case: as Y 0 −∞ does not depend on k, −∞ and thus the filter stability property would automatically follow from stability of the underlying model. This (incorrect) reasoning was used by Kunita [28] in the proof of his main result. The conclusion is already contradicted, however, by Example 2.1! The fundamental issue is that the exchange of intersection ∩ and supremum ∨ of σ-fields is not permitted in general.
An entirely analogous measure-theoretic formulation appears in the random field setting of section 5. Indeed, let (X v , Y v ) v∈Z d be a partially observed random field model as in section 5.2, and define Then the question whether mixing implies conditional mixing can be phrased as: Once again, Example 2.1 already shows this is not always the case (even for d = 1).
Establishing the validity of the exchange of intersection and supremum of σ-fields is a measure-theoretic conundrum that poses a tantalizing problem in a diverse range of probabilistic questions [52,51]. As this problem evidently lies at the heart of the ergodic theory of nonlinear filters, it is interesting to view results in this area from a measuretheoretic perspective. For example, as the inheritance of ergodicity has been established under nondegeneracy of the observations (Theorem 2.2 and generalizations developed in [45]), one might ask whether this result is a manifestation of a more general measuretheoretic principle that enables the exchange of intersection and supremum of σ-fields. Such a hope could be justified as follows. If G n is a decreasing filtration and F is a σfield, it is easily established that n (G n ∨ F) = ( n G n ) ∨ F mod P when G 0 and F are independent. Using the Bayes formula, one can readily deduce that the conclusion still follows if P ≪ Q such that G 0 and F are independent under Q. The nondegeneracy assumption, however, implies a weaker property: it implies that the laws of (X k ) |k|≤K and (Y k ) |k|≤K can be made independent by an equivalent change of measure for every finite K (but not for infinite K as would be required to apply the above principle). Thus Theorem 2.2 might suggest the existence of a more general local form of the above measure-theoretic principle in the setting of stationary processes.
Unfortunately, this hope proves to be unfounded: a counterexample in [51] shows that the exchange of intersection and supremum may fail even when the observations are nondegenerate (note that this does not contradict Theorem 2.2, as its uniform ergodicity assumption is strictly stronger than the stability property: thus the stronger hypothesis is essential to utilize nondegeneracy). However, the counterexample of [51] is decidedly esoteric. In contrast, the results of the present paper show that the validity of the exchange of intersection and supremum of σ-fields can fail in a very robust manner in infinite dimension. For example, Example 5.8 defines an almost trivial model of a stationary and ergodic random field with values in a finite set for which all finite-dimensional marginals admit a density that is bounded away from zero, while the validity of exchange of intersection and supremum of σ-fields exhibits a phase transition. These new measure-theoretic examples illustrate the highly nontrivial nature of the exchange of intersection and supremum problem even in very regular models.

Spin glasses and conditional Gibbs measures
We have seen in section 5.2 that when a Markov random field is conditioned on partial observations, the conditional distribution again defines a Markov random field, albeit with a random specification (through dependence on the observations whose realization is random). This idea has played a central role throughout this paper, both in section 5 where the objects of interest were the conditional random fields themselves, and in sections 3 and 4 where the space-time random field was a basic tool in the investigation of the corresponding filtering problems in infinite dimension.
Random fields whose specification depends on external random variables have been widely studied in the statistical mechanics of disordered systems. The randomness, which is typically assumed to be i.i.d. across sites, is used here to model the irregular spatial distribution of impurities in materials such as magnetic alloys. Such disordered materials, known as spin glasses, exhibit many remarkable properties and complex behavior; we refer to [5] for an introduction to the mathematical study of spin glasses. One of the chief difficulties in spin glasses is that the random interactions in the system do not favor a particular alignment of the configurations at different sites, a phenomenon known as 'frustration'. This creates a very complicated energy landscape that gives rise to many unusual properties of spin glasses, with a large number of mathematical mysteries remaining to be understood (see, e.g., [5, section 6.5]).
In a sense, our conditional random fields could be interpreted as a form of spin glass models. For example, if the underlying field X is an Ising model (cf. Example 5.3), then the conditional random field in Example 5.8 can be viewed as a form of random bond Ising model or Ising spin glass, while the setting of Conjecture 5.10 leads to a form of random field Ising model, cf. [5,22]. There is an essential difference, however, with the usual definition of a spin glass: the random variables that enter the conditional specification are not defined by an exogenous i.i.d. disorder, but rather by observations whose law is generated by the underlying random field. Thus the statistics of the randomness is perfectly matched to the underlying model. In particular, the conditional distribution should favor the same configurations as the underlying model, so that one might expect that the effect of frustration is mitigated. It is not clear, however, how such ideas could be made precise. More generally, this highlights the fundamental challenge in investigating conditional random field problems from the point of view of spin glasses: while conditional random fields are expected to exhibit much 'nicer' behavior than general spin glasses due to the precise matching of the randomness to the model, it is entirely unclear how the latter can be taken into account in the analysis of such problems.
Remark 6.1. It should be noted that the above intuition for the special nature of the problems considered in this paper fails to hold if we were to consider model misspecification. For example, if the conditional random field in Example 5.8 is computed assuming an error probability p that is misspecified with respect to the true error probability of the underlying observations, then there is no reason to expect improved behavior with respect to general spin glasses. The misspecified setting is completely unnatural from the point of view of conditional distributions: if P p denotes the law of the model with error probability p, then P p [X ∈ · |Y ] is not even defined under P q for p = q as then P p ⊥ P q . Such problems are therefore of a fundamentally different nature than those considered in this paper. On the other hand, one can never expect to have perfect knowledge of the underlying model in real-world applications, so that the present example highlights the need for caution in the practical interpretation of our results. Remark 6.2. Unlike in most spin glass models, the observations that enter the specification of the conditional random field fail to be independent in almost any model of interest. However, in models that possess certain symmetries, such as Example 5.8, it is possible to transform the conditional random field to a spin glass model with i.i.d. randomness by means of a gauge transformation. This fact was exploited implicitly in the proofs in section 3 to analyze the phase transition behavior in this model. The converse observation has been noted in the physics literature: in spin glasses that possess gauge symmetries, it is possible to make a special choice of the randomness so that various quantities become explicitly computable. This restricts consideration to a distinguished subset of the parameter space of the model called the Nishimori line, which corresponds precisely to the situation where the model describes a conditional distribution (after gauge transformation). We refer to [38] and the references therein.
An entirely different manner in which conditional random fields arise in statistical mechanics is in the study of renormalization group transformations. In section 5 we have investigated Markov random fields where P[X V ∈ ·|X V c ] depends on X ∂V only. In renormalization group analyses, one aims to perform computations at the level of a transformation Y of the random field X obtained, for example, by working on a sublattice. In general, such a transformation is no longer Markov, that is, P[Y W ∈ ·|Y W c ] depends on all sites outside W ⊂⊂ Z d . However, one still expects that the resulting model is local in the sense that P[Y W ∈ ·|Y W c ] only depends significantly on Y w for w in some finite neighborhood of W . This idea is formalized by the requirement that the functions y → P[Y W ∈ A|Y W c = y W c ] are quasilocal, that is, uniform limits of local functions, for every W ⊂⊂ Z d . Such random fields are called Gibbsian. In order for the renormalization procedure to be well defined, one therefore faces the question whether a transformation Y of a Markov random field X is guaranteed to be a Gibbsian.
Surprisingly, this natural property proves to be false in general: it may be that Y fails to be Gibbsian even when X is Markov. Such renormalization group pathologies are discussed at great length in the paper [46]. This issue proves to be closely connected, at least in monotone fields, with the uniqueness property of the conditional random field P[X ∈ ·|Y ], and thus with the type of phenomena investigated in section 5. We refer for further details to the paper [16], which also develops the connection with global Markov property [25,18] (already mentioned in the proof of Theorem 5.12).

Trees and other graphs
Throughout this paper, we have considered spatial variables indexed by the lattice Z d . However, both infinite-dimensional filtering models and partially observed Markov random fields can be meaningfully formulated on any infinite graph of finite degree G = (V, E). While the lattice setting is a natural prototype for the investigation of infinite-dimensional systems, the consideration of other graphical structures could provide complementary insights on the phenomena considered in this paper.
Of particular interest is the case where the spatial graph G is a tree. Such models are motivated by applications in information theory, statistical physics, and mathematical biology, cf. [15] and the references therein. More importantly, trees play a special role in the theory of Markov random fields, as many problems that are exceedingly difficult in general graphs are amenable to exact computations on a tree by means of recursive formulas (see, for example, [21,Chapter 12]). It is therefore possible that significant insight could be obtained on conjectures that arise from this paper by investigating these on trees using the special tools that are available in this setting. We have not systematically pursued this avenue of investigation. Nonetheless, the statistical mechanics literature on spin glass models on Bethe lattices (infinite regular trees in which every vertex has exactly q ≥ 3 neighbors) already provides some suggestive evidence towards questions that arise from this paper, as we will now briefly discuss.
1. The main conjecture of this paper is that stability properties are preserved by conditioning in the absence of symmetries. However, the theory of section 4 raises the question whether stability of the underlying model even plays a role in this setting: it is possible that the stability of conditional distributions follows already from the symmetry-breaking assumption alone even when the underlying model is not stable. On the other hand, the theory of section 5.3 does require stability of the model, suggesting that the latter should play a role after all.
Some insight into this question can be obtained from results in [4] on the random field Ising model on the Bethe lattice, which can be interpreted as the conditional random field for the variant of the model of section 5.3 where the underlying field X is a ferromagnetic Ising model on the Bethe lattice. It is shown in [4] that if the underlying field X is not unique and the error probability p is sufficiently close to 1 2 , then the conditional random field is also not unique for any realization of the observations. Thus the conclusion of Theorem 5.12 fails in this case, even though the symmetrybreaking condition holds here and the model is monotone and translation invariant (in the sense appropriate to the Bethe lattice). Thus, at least in the setting of trees, the symmetry-breaking property alone is not sufficient to ensure the inheritance of ergodicity in the sense of uniform mixing. This provides some evidence towards the necessity of the stability assumption in our conjectures.
2. Section 5 introduced two different notions of conditional ergodicity for Markov random fields: uniform mixing (uniqueness) and mixing (extremality) of the conditional random field. It is not clear whether inheritance of the mixing property is connected to inheritance of the uniform mixing property: for example, is it possible for the conditional distribution of a uniformly mixing random field to be mixing but not uniformly mixing? While we do not have any concrete evidence in the present setting, related results on the ferromagnetic Ising model on the Bethe lattice [3] show that it is indeed possible, albeit in a different setting, that the uniqueness and extremality properties appear at different phase transition points.
3. The conditional random field for the analogue of Example 5.8 on the Bethe lattice can be viewed as a random bond Ising model on the Bethe lattice. This model has been systematically investigated in the statistical mechanics literature [8,6,7], which leads to a detailed description of the phase diagram and, in particular, the exact location of the phase transition point on the Nishimori line (cf. Remark 6.2).
Another widely studied class of models in statistical mechanics are mean-field models, where the spatial graph is chosen to be the complete graph on a finite number n of spatial locations (that is, every pair of sites interacts with each other). The large-scale properties of such models do not arise at the level of a fixed graph, but rather in the limit as n → ∞ where the strength of the interactions are scaled down appropriately to obtain nontrivial limiting behavior. Mean-field analogues of Example 5.8 arise naturally in applications in statistics and computer science [37] and in information theory [35,36]. However, it is arguably less surprising that nontrivial phase transition phenomena arise in this setting, as these problems are concerned with the limiting behavior of a sequence of conditional distributions where the underlying model is scaled in a nontrivial manner. In contrast, in the problems we have investigated in this paper, the emergent phase transition phenomena arise from conditioning alone.

Stability in time vs. spatial mixing
We have discussed in this paper two different questions of inheritance of ergodicity: the stability property in time of dynamical filtering models, and the mixing property in space in the setting of static random fields. While we have utilized the notion of a space-time random field even in the dynamical setting, it is not at all clear to what extent the problems of inheritance of spatial and temporal ergodicity are related to one another. In particular, we do not know whether it is possible that there exist filtering models for which the filter is stable, but does not also exhibit a spatial mixing property. From the practical perspective, the design of filtering algorithms in high dimension relies crucially both on the temporal stability and spatial mixing of the filter [39].
Connections between spatial and temporal mixing properties arise frequently in the ergodic theory of interacting Markov chains. For example, [26] shows a correspondence in the translation-invariant setting between invariant measures of an infinite-dimensional Markov chain and random fields corresponding to the specification of the space-time system (related ideas in the context of the global Markov property can be found in [24,19]). Similarly, connections between exponential mixing in space and time are well known, see for example [32] and the references therein. We do not know, however, whether similar connections are likely to arise for conditional distributions. Let us emphasize in this context that the arguments that were used in this paper to obtain filter stability (section 4) and conditional mixing (section 5) are of a very different nature, but this likely to be a limitation of our analysis rather than a genuine phenomenon.