Probabilistic aspects of entropy ∗

We give an overview of some probabilistic facets of entropy, recalling how entropy shows up naturally in various dierent situations ranging from information theory and hypothesis testing over large deviations and the central limit theorem to interacting random elds and the equivalence of ensembles.


Entropy as a measure of uncertainty
As is well-known, it was Ludwig Boltzmann who first gave a probabilistic interpretation of thermodynamic entropy.He coined the famous formula which is engraved on his tombstone in Vienna: the entropy S of an observed macroscopic state is nothing else than the logarithmic probability for its occurrence, up to some scalar factor k (the Boltzmann constant) which is physically significant but can be ignored from a mathematical point of view.I will not enter here into a discussion of the history and physical significance of this formula; this is the subject of other contributions to this volume.Here I will simply recall its most elementary probabilistic interpretation.
Let E be a finite set and µ a probability measure on E. † In the Maxwell-Boltzmann picture, E is the set of all possible energy levels for a system of particles, and µ corresponds to a specific histogram of energies describing some macrostate of the system.Assume for a moment that each µ(x), x ∈ E, is a multiple of 1/n, i.e., µ is a histogram for n trials or, equivalently, a macrostate for a system of n particles.On the microscopic level, the system is then described by a sequence ω ∈ E n , the microstate, associating to each particle its energy level.Boltzmann's idea is now the following: The entropy of a macrostate µ corresponds to the degree of uncertainty about the actual microstate ω when only µ is known, and can thus be measured by log N n (µ), the logarithmic number of microstates leading to µ.
Explicitly, for a given microstate ω ∈ E n let ‡ be the associated macrostate describing how the particles are distributed over the energy levels.L ω n is called the empirical distribution (or histogram) of ω ∈ E n .Then , the multinomial coefficient.In view of the n-dependence of this quantity, one should approximate a given µ by a sequence µ n of n-particle macrostates and define the uncertainty H(µ) of µ as the n → ∞ limit of the "mean uncertainty of µ n per particle".Using Stirling's formula, we arrive in this way at the well-known expression for the entropy: (1.3) Entropy as degree of ignorance: Let µ and µ n be probability measures on E such that µ n → µ and n µ n (x) ∈ Z for all x ∈ E. Then the limit lim n→∞ A proof including error bounds is given in Lemma 2.3 of [5].
Though we have used the terms uncertainty and ignorance, the entropy H(µ) should not be considered as a subjective quantity.It simply counts the number of possibilities to obtain the histogram µ, and thus describes the hidden multiplicity of "true" microstates consistent with the observed µ.It is therefore a measure of the complexity inherent in µ.
To summarize: In Boltzmann's picture, µ is a histogram resulting from a random phenomenon on the microscopic level, and H(µ) corresponds to the observer's uncertainty of what is really going on there.

Entropy as a measure of information
We will now approach the problem of measuring the "uncertainty content" of a probability measure µ from a different side suggested by Shannon [35].Whereas Boltzmann's view is backwards to the microscopic origins of µ, Shannon's view is ahead, taking µ as given and "randomizing" it by generating a random signal with alphabet E and law µ.His question is: How large is the receiver's effort to recover µ from the signal?This effort can be measured by the number of yes-or-no questions to be answered on the average in order to identify the signal (and thereby µ, after many independent repetitions).So it corresponds to the receiver's a priori uncertainty about µ.But, as observed by Shannon, this effort measures also the degree of information the receiver gets a posteriori when all necessary yes-or-no questions are answered.This leads to the following concept of information: The information contained in a random signal with prescribed distribution is equal to the expected number of bits necessary to encode the signal.
Specifically, a binary prefix code for E is a mapping f : E → ≥1 {0, 1} from E into the set of all finite zero-one sequences which is decipherable, in that no codeword f (x) is a prefix of another codeword f (y).(Such an f can be described by a binary decision tree, the leaves of which correspond to the codewords.)Let #f (x) denote the length of the codeword f (x), and µ(#f ) the expectation of the random variable #f under µ.A natural candidate for the information contained in the signal is then the minimal expected length I p (µ) = inf µ(#f ) : f binary prefix code for E of a binary prefix code for E. This quantity is already closely related to H(µ), but the relationship becomes nicer if one assumes that the random signal forms a memoryless source, in that the random letters from E are repeated independently, and one encodes signal words of length n (which are distributed according to the product measure µ n ).In this setting, I p (µ n )/n is the information per signal letter, and in the limit n → ∞ one obtains the (2.1) Source coding theorem for prefix codes: The information contained in a memoryless source with distribution µ is For a proof of a refined version see Theorem 4.1 of [5], for example.An alternative coding scheme leading to a similar result is block coding with small error probability.A binary n-block code of length with error level α > 0 is a mapping code of length at level α be the minimal length of a binary n-block code with error level α.The following result then gives another justification of entropy.
(2.2) Source coding theorem for block codes: The information contained in a memoryless source with distribution µ is independently of the error level α > 0.
The proof of this result (see e.g.Theorem 1.1 of [5]) relies on an intermediate result which follows immediately from the weak law of large numbers.It reveals yet another role of entropy and is therefore interesting in its own right: (2.3) Asymptotic equipartition property: For all δ > 0, In other words, most ω have probability µ n (ω) ≈ e −nH(µ) .This may be viewed as a random version of Boltzmann's formula (1.1).
To conclude this section, let us mention that the entropy H(µ) admits several axiomatic characterizations which underline its significance as a measure of uncertainty and information; cf.e.g. the discussion on pp.25-27 of [5].However, compared with the previous genuine results these characterizations should rather be considered as a posteriori justifications.
3 Relative entropy as a measure of discrimination Let E still be a finite set, and consider two distinct probability measures µ 0 and µ 1 on E. Suppose we do not know which of these probability measures properly describes the random phenomenon we have in mind (which might again be a random signal with alphabet E).We then ask the following question: How easy is it to distinguish the two candidates µ 0 and µ 1 on the basis of independent observations?This is a standard problem of statistics, and the standard procedure is to perform a test of the hypothesis µ 0 against the alternative µ 1 with error level α.In fact, if we want to use n independent observations then we have to test the product measure µ n 0 against the product measure µ n 1 .Such a test is defined by a "rejection region" R ⊂ E n ; if the observed outcome belongs to R one decides in favor of the alternative µ 1 , otherwise one accepts the hypothesis µ 0 .There are two possible errors: rejecting the hypothesis µ 0 although it is true (first kind), and accepting µ 0 though it is false (second kind).The common practice is to keep the error probability of first kind under a prescribed level α and to choose R such that the error probability of the second kind becomes minimal.The minimum value is Consequently, it is natural to say that µ 0 and µ 1 are the easier to distinguish the smaller ρ n (α; µ 0 , µ 1 ) turns out to be.More precisely: The degree to which µ 1 can be distinguished from µ 0 on the basis of independent observations can be measured by the rate of decay of ρ n (α; µ 0 , µ 1 ) as n → ∞.
An application of the weak law of large numbers completely similar to that in the source coding theorem (2.2) gives: (3.1) Lemma of C. Stein: The measure for discriminating µ 1 from µ 0 is This result was first published in [2]; see also Corollary 1.2 of [5] or Lemma 3.4.7 of [6].D(µ 0 | µ 1 ) is known as the relative entropy, Kullback-Leibler information, Idivergence, or information gain.
. Hence relative entropy is a generalization of entropy to the case of a non-uniform reference measure, at least up to the sign.(In view of the difference in sign one might prefer calling D(µ 0 | µ 1 ) the negative relative entropy.Nevertheless, we stick to the terminology above which has become standard in probability theory.)Stein's lemma asserts that the relative entropy D( • | •) measures the extent to which two probability measures differ.Although D( • | •) is not a metric (neither being symmetric nor satisfying the triangle inequality), it can be used to introduce some kind of geometry for probability measures, and in particular some kind of projection of a probability measure on a convex set of such measures [3].As we will see in a moment, these so-called I-projections play a central role in the asymptotic analysis of the empirical distributions (1.2).But first, as some motivation, let us mention a refinement of Stein's lemma for which the error probability of the first kind is not held fixed but decays exponentially at a given rate.The answer is in terms of L ω n and reads as follows.(3.2) Hoeffding's theorem: Let 0 < a < D(µ 1 | µ 0 ), and consider the test of µ 0 against µ 1 on n observations with the rejection region Then the error probability of the first kind decays exponentially with rate a, i.e., and the error probability of the second kind satisfies the exponential bound with optimal exponent.
Hoeffding's original paper is [16]; see also p. 44 of [5] or Theorem 3.5.4 of [6].It is remarkable that the asymptotically optimal tests R n do not depend on the alternative µ 1 .One should note that K. Pearson's well-known χ 2 -test for the parameter of a multinomial distribution (see e.g.[31]) uses a rejection region similar to R n , the relative entropy D(L ω n | µ 0 ) being replaced by a quadratic approximation.Hoeffding's theorem is in fact an immediate consequence of a much more fundamental result, the theorem of Sanov.This elucidates the role of relative entropy for the asymptotic behavior of the empirical distributions L ω n .The basic observation is the identity which holds for any probability measure µ on E and any ω ∈ E n .In view of our first assertion (1.3), it follows that whenever ν n → ν such that nν n (x) ∈ Z for all x and n.This can be viewed as a version of Boltzmann's formula (1.1) and leads directly to the following theorem due to Sanov [34], cf. also p. 43 of [5] or Theorem 2.1.10 of [6].
(3.4) Sanov's large deviation theorem: Let µ be any probability measure on E and C a class of probability measures on E with dense (relative) interior, i.e., C ⊂ cl int C.
Sanov's theorem provides just a glimpse into large deviation theory in which (relative) entropies of various kinds play a central role.(More on this can be found in [6] and the contributions of den Hollander and Varadhan to this volume.)Its meaning can be summarized as follows: Among all realizations with histogram in C, the most probable are those having a histogram closest to µ in the sense of relative entropy.
We will return to this point later in (5.3).Needless to say, Sanov's theorem can be extended to quite general state spaces E, see [4] or Theorem 6.2.10 of [6].

Entropy maximization under constraints
The second law of thermodynamics asserts that a physical system in equilibrium has maximal entropy among all states with the same energy.Translating this into a probabilistic language and replacing entropy by the more general relative entropy, we are led to the following question: Let C be a class of probability measures on some measurable space (E, E) and µ a fixed reference measure on (E, E).What are then the probability measures in The universal significance of such minimizers has been put forward by Jaynes [19,20].As noticed above, they come up also in the context of Sanov's theorem (3.4).In the present more general setting, the relative entropy can be defined by D(ν | µ) = sup P D(ν P | µ P ), where the supremum extends over all finite E-measurable partitions P and ν P stands for the restriction of ν to P. Equivalently, D(ν | µ) = ν(log f ) if ν is absolutely continuous with respect to µ with density f , and D(ν | µ) = ∞ otherwise; see Corollary (15.7) of [13], for example.(For a third expression see (4.1) below.)The first definition shows in particular that D( • | µ) is lower semicontinuous in the so-called τ -topology generated by the mappings ν → ν(A) with A ∈ E. Consequently, a minimizer does exist whenever C is closed in this topology.If C is also convex, the minimizer is uniquely determined due to the strict convexity of D( • | µ), and is then called the I-projection of µ on C. We consider here only the most classical case when C is defined by an integral constraint.That is, writing ν(g) for the integral of some bounded measurable function g : E → R d with respect to ν, we assume that C = {ν : ν(g) = a} for suitable a ∈ R d .In other words, we consider the constrained variational problem In this case one can use a convex Lagrange multiplier calculus as follows.
For any bounded measurable function f : E → R let P (f ) = log µ(e f ) be the log-Laplace functional of µ.One then has the variational formula meaning that D( • | µ) and P are convex conjugates (i.e., Legendre-Fenchel transforms) of each other; cf.Lemma 6.2.13 of [6].Let be the "entropy distance" of {ν : ν(g) = a} from µ.A little convex analysis then shows that J g (a) = sup i.e., J g is a partial convex conjugate of P (or, in other terms, the Cramér transform of the distribution µ • g −1 of g under µ).Moreover, if g is non-degenerate (in the sense that µ • g −1 is not supported on a hyperplane) then J g is differentiable on the interior I g = int{J g < ∞} of its essential domain, and one arrives at the following (4.3)Gibbs-Jaynes principle: For any non-degenerate g : E → R d , a ∈ I g and t = ∇J g (a), the probability measure Generalized versions of this result can be found in [3,4] or Example (9.42) of [37].
In Statistical Mechanics, the measures µ t of the form above are called Gibbs distributions, and the preceding result (or a suitable extension) justifies that these are indeed the equilibrium distributions of physical systems satisfying a finite number of conservation laws.In Mathematical Statistics, such classes of probability measures are called exponential families.Here are some familiar examples from probability theory.

Asymptotics governed by entropy
We will now turn to the dynamical aspects of the second law of thermodynamics.As before, we will not enter into a physical discussion of this fundamental law.Rather we will show by examples that the principle of increasing entropy (or decreasing relative entropy) stands also behind a number of well-known facts of probability theory.
Our first example is the so-called ergodic theorem for Markov chains.Let E be a finite set and P t = e tG , t ≥ 0, the transition semigroup for a continuous-time Markov chain on E. The generator G is assumed to be irreducible.It is well-known that there is then a unique invariant distribution µ (satisfying µP t = µ for all t ≥ 0 and, equivalently, µG = 0).Let ν be any initial distribution, and ν t = νP t be the distribution at time t.Consider the relative entropy D(ν t | µ) as a function of time t ≥ 0. A short computation (using the identities µG = 0 and G1 = 0) then gives the following result: (5.1) Entropy production of Markov chains: For any t ≥ 0 we have and in particular d dt D(ν t | µ) < 0 when ν t = µ.In the above, Ḡ(y, x) = µ(x) G(x, y)/µ(y) is the generator for the time-reversed chain, ϕ(s) = 1−s+s log s ≥ 0 for s ≥ 0, a(ν) = − x∈E ν(x) G(x, x) > 0, and the probability measures ν and ν on E × E are defined by ν(x, y) = ν(x) G(x, y)(1 − δ x,y )/a(ν) and ν(x, y) = ν(y) Ḡ(y, x)(1 − δ x,y )/a(ν), x, y ∈ E. The second statement follows from the fact that 1 is the unique zero of ϕ, and G is irreducible.A detailed proof can be found in Chapter I of Spitzer [36].The discrete time analogue was apparently discovered repeatedly by various authors; it appears e.g. in [32] and on p. 98 of [23].
The entropy production formula above states that the relative entropy D( • | µ) is a strict Lyapunov function for the fixed-time distributions ν t of the Markov chain.Hence ν t → µ as t → ∞.This is the well-known ergodic theorem for Markov chains, and the preceding argument shows that this convergence result fits precisely into the physical picture of convergence to equilibrium.
Although the central limit theorem is a cornerstone of probability theory, it is often not realized that this theorem is also an instance of the principle of increasing entropy.(This is certainly due to the fact that the standard proofs do not use this observation.)To see this, let (X i ) be a sequence of i.i.d.centered random vectors in R d with existing covariance matrix C, and consider the normalized sums S * n = n i=1 X i / √ n.By the very definition, S * n is again centered with covariance matrix C. But, as we have seen in Example (4.4), under these conditions the centered normal distribution µ C with covariance matrix C has maximal differential entropy.This observation suggests that the relative entropy may again serve as a Lyapunov function.Unfortunately, a time-monotonicity of relative entropies seems to be unknown so far (though monotonicity along the powers of 2 follows from a subadditivity property).But the following statement is true.
(5.2) Entropic central limit theorem: Let ν n be the distribution of This theorem traces back to Linnik [27], whose result was put on firm grounds by Barron [1].The multivariate version above is due to [21].By an inequality of Pinsker, Csiszár, Kullback and Kemperman (cf.p. 133 of [11] or p. 58 of [5]), it follows that ν n → µ C in total variation norm (which is equal to the L 1 -distance of their densities).
A similar result holds for sums of i.i.d.random elements X i of a compact group G. Let µ G denote the normalized Haar measure on G, and let ν n be the distribution of n i=1 X i , i.e., the n-fold convolution of the common distribution of the X i .A recent result of Johnson and Suhov [22] then implies that D(ν n | µ G ) ↓ 0 as n ↑ ∞, provided D(ν n | µ G ) is ever finite.Note that µ G is the measure of maximal entropy (certainly if G is finite or a torus), and that the convergence here is again monotone in time.
Our third example is intimately connected to Sanov's theorem (3.4).Suppose again (for simplicity) that E is finite, and let µ be a probability measure on E. Let C be a closed convex class of probability measures on E such that int C = ∅.We consider the conditional probability under the product measure µ n given that the empirical distribution belongs to the class C. (By Sanov's theorem, this condition has positive probability when n is large enough.)Do these conditional probabilities converge to a limit?According to the interpretation of Sanov's theorem, the most probable realizations ω are those for which D(L ω n | µ) is as small as possible under the constraint L ω n ∈ C.But we have seen above that there exists a unique probability measure µ * ∈ C minimizing D( • | µ), namely the I-projection of µ on C.This suggests that, for large n, µ n C concentrates on configurations ω for which L ω n is close to µ * .This and even more is true, as was shown by Csiszár (5.3).
(5.3) Csiszár's conditional limit theorem: For closed convex C with non-empty interior, where µ * is the I-projection from µ on C.
Note that the limit is again determined by the maximum entropy principle.It is remarkable that this result follows from purely entropic considerations.Writing ν C,n = µ n C (L • n ) for the mean conditional empirical distribution (which by symmetry coincides with the one-dimensional marginal of µ n C ), Csiszár (5.3) observes that The inequality can be derived from the facts that ν C,n ∈ C by convexity and µ * is the I-projection of µ on C. Now, by Sanov's theorem, the left-hand side tends to In view of the superadditivity properties of relative entropy, it follows that for each k ≥ 1 the projection of µ n C onto E k converges to µ k * , and one arrives at (5.3).
The preceding argument is completely general: Csiszár's original paper [4] deals with the case when E is an arbitrary measurable space.In fact, some modifications of the argument even allow to replace the empirical distribution L ω n by the so-called empirical process; this will be discussed below in (6.7).

Entropy density of stationary processes and fields
Although occasionally we already considered sequences of i.i.d.random variables, our main concern so far was the entropy and relative entropy of (the distribution of) a single random variable with values in E. In this last section we will recall how the ideas described so far extend to the set-up of stationary stochastic processes, or stationary random fields, and our emphasis here is on the non-independent case.
Let E be a fixed state space.For simplicity we assume again that E is finite.We consider the product space Ω = E Z d for any dimension d ≥ 1.For d = 1, Ω is the path space of an E-valued process, while for larger dimensions Ω is the configuration space of an E-valued random field on the integer lattice.In each case, the process or field is determined by a probability measure µ on Ω.We will assume throughout that all processes or fields are stationary resp.translation invariant, in the sense that µ is invariant under the shift-group (ϑ x ) x∈Z d acting on Ω in the obvious way.
In this setting it is natural to consider the entropy or relative entropy per time resp.per lattice site, rather than the (total) entropy or relative entropy.(In fact, D(ν|µ) is infinite in all interesting cases.)The basic result on the existence of the entropy density is the following.In its statement, we write Λ ↑ Z d for the limit along an arbitrary increasing sequence of cubes exhausting Λ, µ Λ for the projection of µ onto E Λ , and ω Λ for the restriction of ω ∈ Ω to Λ.
and for the integrands we have − lim for µ-almost ω and in L 1 (µ).Here µ( • | I)(ω) is a regular version of the conditional probability with respect to the σ-algebra I of shift-invariant events in Ω.
For a proof we refer to Section 15.2 of [13] (and the references therein), and Section I.3.1 of [11].In the case of a homogeneous product measure µ = α Z d we have h(µ) = H(α).
In view of Boltzmann's interpretation (1.3) of entropy, h(µ) is a measure of the lack of knowledge about the process or field per time resp.per site.Also, the L 1 -convergence result of McMillan immediately implies an asymptotic equipartition property analogous to (2.3), whence h(µ) is also the optimal rate of a block code, and thus the information per signal of the stationary source described by µ (provided we take the logarithm to the base 2).
What about the existence of a relative entropy per time or per site?Here we need to assume that the reference process has a nice dependence structure, which is also important in the context of the maximum entropy problem.
Let f : Ω → R be any function depending only on the coordinates in a finite subset ∆ of Z d .Such a function will be called local.A probability measure µ on Ω is called a Gibbs measure for f if its conditional probabilities for observing a configuration ω Λ in a finite region Λ ⊂ Z d , given a configuration ω Λ c outside of Λ, are almost surely given by the formula where Z Λ (ω Λ c ) is the normalization constant.Since f is local, each Gibbs measure µ is Markovian in the sense that the conditional probabilities above only depend on the restriction of ω Λ c to a bounded region around Λ. (This assumption of finite range could be weakened, but here is no place for this.)The main interest in Gibbs measures comes from its use for describing systems of interacting spins in equilibrium, and the analysis of phase transitions; a general account can be found in Georgii [13], for example.(To make the connection with the definition given there let the potential Φ be defined as in Lemma (16.10) of this reference.)In the present context, Gibbs measures simply show up because of their particular dependence properties.We now can state the following counterpart to (6.1).(6.2) Ruelle-Föllmer theorem: Suppose µ is a Gibbs measure for some local function f , and ν is translation invariant.Then the relative entropy density exists and is equal to p(f ) − h(ν) − ν(f ), where the so-called pressure of f , is the counterpart of the log-Laplace functional appearing in (4.3).
The second identity in (6.4) is often called the variational formula; it dates back to Ruelle [33].Föllmer [10] made the connection with relative entropy; for a detailed account see also Theorem (15.30) of [13] or Section I.3.3 of [11].An example of a non-Gibbsian µ for which d(• | µ) fails to exist was constructed by Kieffer and Sokal, see pp. 1092-1095 of [9].As in (6.1), there is again an L 1 (ν) and ν-almost sure convergence behind (6.3) [10].In the case f = 0 when the unique Gibbs measure µ is equal to α Z d for the equidistribution α on E, the Ruelle-Föllmer theorem (6.2) reduces to (6.1).
Since D(ν | µ) = 0 if and only if ν = µ, the preceding result leads us to ask what one can conclude from the identity d(ν | µ) = 0.The answer is the following celebrated variational characterization of Gibbs measures first derived by Lanford and Ruelle [25].Simpler proofs were given later by Föllmer [10] and Preston, Theorem 7.1 of [30]; cf. also Section 15.4 of [13], or Theorem (I.3.39) of [11].
(6.5) Variational principle: Suppose ν is stationary.Then ν is a Gibbs measure for f if and only if h(ν) + ν(f ) is equal to its maximum value p(f ).
Physically speaking, this result means that the stationary Gibbs measures are the minimizers of the free energy density ν(−f ) − h(ν), and therefore describe a physical system with interaction f in thermodynamic equilibrium.
It is now easy to obtain an analogue of the Gibbs-Jaynes principle (4.3).Let g : Ω → R d be any vector-valued local function whose range g(Ω) is not contained in a hyperplane.Then for all a ∈ R d we have in analogy to (4.2) which together with (6.5) gives us the following result, cf.Section 4.3 of [14].(6.6) Gibbs-Jaynes principle for the entropy density: Suppose a ∈ R d is such that j g is finite on a neighborhood of a, and let ν be translation invariant.Then h(ν) is maximal under the constraint ν(g) = a if and only if ν is a Gibbs measure for t a • g, where t a = ∇j g (a).
The next topic to be discussed is the convergence to stationary measures of maximal entropy density.The preceding Gibbs-Jaynes principle suggests that an analogue of Csiszár's conditional limit theorem (5.3) might hold in the present setting.This is indeed the case, as was proved by Deuschel-Stroock-Zessin [8], Georgii [14], and Lewis-Pfister-Sullivan [26] using suitable extensions of Sanov's theorem (3.4).We state the result only in the most interesting particular case.
(6.7)The equivalence of ensembles: Let C ⊂ R d be closed and such that for a unique a ∈ C having the same property as in (6.6).For any cube Λ in Z d let ν Λ,C be the uniform distribution on the set where ϑ per x is the periodic shift of E Λ defined by viewing Λ as a torus.(The assumptions imply that this set is non-empty when Λ is large enough.)Then, as Λ ↑ Z d , each (weak) limit point of the measures ν Λ,C is a Gibbs measure for t a • g.
In Statistical Mechanics, the equidistributions of the type ν Λ,C are called microcanonical Gibbs distributions, and "equivalence of ensembles" is the classical term for their asymptotic equivalence with the (grand canonical) Gibbs distributions considered before.A similar result holds also in the context of point processes, and thus applies to the classical physical models of interacting molecules [15].
Finally, we want to mention that the entropy approach (5.1) to the convergence of finite-state Markov chains can also be used for the time-evolution of translation invariant random fields.For simplicity let E = {0, 1} and thus Ω = {0, 1} Z d .We define two types of continuous-time Markov processes on Ω which admit the Gibbs measures for a given f as reversible measures.These are defined by their pregenerator G acting on local functions g as Gg(ω) = respectively.Here ω x ∈ Ω is defined by ω x x = 1 − ω x , ω x y = ω y for y = x, and ω xy is the configuration in which the values at x and y are interchanged.Under mild locality conditions on the rate function c the corresponding Markov processes are uniquely defined.They are called spin-flip or Glauber processes in the first case, and exclusion or Kawasaki processes in the second case.The Gibbs measures for f are reversible stationary measures for these processes as soon as the rate function satisfies the detailed balance condition that c(x, ω) = exp[ z: x∈∆+z f (ϑ z ω)] does not depend on ω x , resp.an analogous condition in the second case.The following theorem is due to Holley [17,18]; for streamlined proofs and extensions see [29,12,38].(6.8) Holley's theorem: For any translation-invariant initial distribution ν on Ω, the negative free energy h(ν t ) + ν t (f ) is strictly increasing in t as long as the time-t distribution ν t is no Gibbs measure for f .In particular, ν t converges to the set of Gibbs measures for f .This result is just another instance of the principle of increasing entropy.For similar results in the non-reversible case see [24,28] and the contribution of C. Maes to this volume.
Let me conclude by noting that the results and concepts of this section serve also as a paradigm for ergodic theory.The set Ω is then replaced by an arbitrary compact metric space with a µ-preserving continuous Z d -action (ϑ x ) x∈Z d .The events in a set Λ ⊂ Z d are those generated by the transformations (ϑ x ) x∈Λ from a generating partition of Ω.
The entropy density h(µ) then becomes the well-known Kolmogorov-Sinai entropy of the dynamical system (µ, (ϑ x ) x∈Z d ).Again, h(µ) can be viewed as a measure of the inherent randomness of the dynamical system, and its significance comes from the fact that it is invariant under isomorphisms of dynamical systems.Measures of maximal Kolmogorov-Sinai entropy play again a key role.It is quite remarkable that the variational formula (6.4) holds also in this general setting, provided the partition functions Z Λ (ω Λ c ) are properly defined in terms of f and the topology of Ω. p(f ) is then called the topological pressure, and p(0) is the so-called topological entropy describing the randomness of the action (ϑ x ) x∈Z d in purely topological terms.All this is discussed in more detail in the contributions by Keane and Young to this volume.