On entropy, entropy-like quantities, and applications

This is a review on entropy in various fields of mathematics and science. Its scope is to convey a unified vision of the classical as well as some newer entropy notions to a broad audience with an intermediate background in dynamical systems and ergodic theory. Due to the breadth and depth of the subject, we have opted for a compact exposition whose contents are a compromise between conceptual import and instrumental relevance. The intended technical level and the space limitation born furthermore upon the final selection of the topics, which cover the three items named in the title. Specifically, the first part is devoted to the avatars of entropy in the traditional contexts: many particle physics, information theory, and dynamical systems. This chronological order helps to present the materials in a didactic manner. The axiomatic approach will be also considered at this stage to show that, quite remarkably, the essence of entropy can be encapsulated in a few basic properties. Inspired by the classical entropies, further akin quantities have been proposed in the course of time, mostly aimed at specific needs. A common denominator of those addressed in the second part of this review is their major impact on research. The final part shows that, along with its profound role in the theory, entropy has interesting practical applications beyond information theory and communications technology. For this sake we preferred examples from applied mathematics, although there are certainly nice applications in, say, physics, computer science and even social sciences. This review concludes with a representative list of references.


Introduction
Entropy is a general concept that appears in different settings with different meanings. Thus, the Boltzmann-Gibbs entropy measures the microscopic disorder in statistical mechanics; the Shannon entropy measures uncertainty and compression in information theory; the Kolmogorov-Sinai entropy measures (pseudo-)randomness in measure-preserving dynamical systems; and the topological entropy measures complexity in topological dynamics. As for its importance, entropy is the protagonist of the second law of thermodynamics, associated by some authors with the arrow of time. In information theory, coding theory, and cryptography, Shannon entropy lies at the core of the fundamental definitions (information, typicality, channel capacity,...) and results (asymptotic equipartition, channel coding theorem, channel capacity theorem, conditions for secure ciphers,...). And in ergodic theory, Kolmogorov-Sinai and topological entropy are perhaps the most important invariants of metric and topological conjugacy, respectively, which are the equivalence concepts in measure-preserving and topological dynamics.
Also remarkable is a sort of universality that entropy possesses. By this we mean that other indicators of, say, compressibility or complexity turn out to be related to, or even coincide with some of the standard entropies. This is what happens, for example, with algorithmic complexity in computer science, Lempel-Ziv complexity in information theory, finitely observable invariants for the class of all finitely-valued ergodic processes in the theory of stochastic processes (see Theorem 19 below), and permutation entropy in dynamical systems. Shannon himself characterized his function as the only one that satisfies three basic properties or axioms, which would explain its uniqueness [1]. Later on, the 'entropies' introduced by Rényi [2], Havrda-Charvát [3], and Tsallis [4] drew the researchers' attention to other solutions under weaker conditions or different sets of axioms. Alone the suppression of the fourth Shannon-Khinchin axiom ("the entropy of a system -split into subsystems A and B-equals the entropy of A plus the expectation value of the entropy of B, conditional on A") opens the door to a broad variety of entropy-like quantities [5], called generalized entropies, which include the aforementioned Rényi, Havrda-Charvát and Tsallis entropies.
History shows that statistical mechanics, information theory, and dynamical systems build a circle of ideas. First of all, the theory of dynamical systems is an outgrowth of statistical mechanics, from which it borrowed the fundamental concepts: state space (read phase space), orbit (read trajectory), invariant measure (read Liouville measure), ergodicity (read Boltzmann's Ergodensatz), stationary state (read equilibrium state),... and entropy! Independently, Shannon created information theory in a purely probabilistic setting. Indeed, information is a function of a probability distri-bution and information sources are stationary random processes. Yet, the solution of the famous Maxwell's paradox in statistical mechanics (in which a microscopic 'demon' perverts the order established by the second law of thermodynamics) required the intervention of information theory in the form of Landauer's principle [6]. Furthermore, Shannon's brainchild inspired Kolmogorov's work on Bernoulli shifts. In return, information theory has benefited from symbolic dynamics. At the intersection of these three fields, amid past and on-going cross-pollination among them, lie ergodic theory and entropy.
In the last decades new versions of entropy have come to the fore. Approximate entropy [7,8], directional entropy [9], (ε, τ )-entropy [10], permutation entropy [11], sample entropy [12], transfer entropy [13], ... are some of the entropy-like quantities proposed by researchers to cope with new challenges in time series analysis, cellular automata, chaos, synchronization, multiscale analysis, etc. Their relationship to the classical mathematical entropies (i.e., the Shannon, Kolmogorov-Sinai, and topological entropies) is diverse. Thus, some of them turn out to be equivalent to a classical counterpart (e.g., metric and topological permutation entropy), or are defined with their help (e.g., transfer entropy, a conditional mutual information). In other cases, a new version can be considered a generalization of a classical one (e.g., directional entropy, (ε, τ )-entropy). Still in other cases, it is a convenient approximation devised for a specific purpose (e.g., approximate and sample entropy). We sidestep the question whether any one of them is an 'entropy' or rather an 'entropy-like' quantity, and use the word entropy in a wide sense. In any case, along with these new proposals, the conventional entropies continue to play a fundamental role in ergodic theory and applications. Always when it comes to capture the elusive concepts of disorder, information (or ignorance), randomness, complexity, irregularity, ..., whether in mathematics, natural sciences, or social sciences, entropy has proved to be unrivaled.
In the remaining sections we are going to zoom in on most of the topics mentioned above. At the same time we have tried to make our exposition reasonably self-contained in regard to the basic concepts and results. For details and more advanced materials the reader will be referred to the literature. But beyond the formal issues, this review should be more than just a guided visit to the most accessible facets of entropy. It should also show that entropy, in whichever of its variants, is an exciting and useful mathematical object.

A panorama of entropy
This section is an overview of entropy in the most traditional settings. For a more advanced account, which also includes interesting historical information, the reader is referred to the excellent review [14] and the references therein.

Entropy in thermodynamics and statistical mechanics
The word entropy was coined by the German physicist R. Clausius (1822Clausius ( -1888 [15], who introduced it in thermodynamics in 1865 to measure the amount of energy in a system that cannot produce work. The fact that the entropy increases in irreversible adiabatic processes that take one equilibrium state to another constitutes the second law of thermodynamics and clearly shows the central role of entropy in the physics of macroscopic bodies. The asymmetry in time introduced by this law has been associated by some authors with the arrow of time. Once the atomic nature of matter was discovered, physicists set out to find the microscopic counterparts of all macroscopic quantities in general, and entropy in particular. Think, for example, of one mole of an isolated gas in equilibrium (e.g., one gram of hydrogen). Its macroscopic state or macrostate is just determined by two parameters, say its volume V and temperature T . Microscopically the system consists of N ≃ 6.022 × 10 23 molecules moving (in a first approximation) under the laws of classical point-mass mechanics. Alternatively one can think of the system as a point describing a curve in a region Ω ⊂ R 6N , called the phase space, spanned by the coordinates and momenta of the N particles. It should be clear that the huge number of particles makes an impracticable task to solve the equations of motion, while allowing an statistical approach. Let W be the number of microscopic states or microstates (as determined by, e.g., the initial positions and momenta of the particles) consistent with the imposed macroscopic constraints (V = V 0 , and T = T 0 in our example). For simplicity (and in accordance with quantum physics!), we assume here that W is finite. In fact, the discretization of the state space was used by Max Planck as a working hypothesis (following previous ideas of Boltzmann and before Heisenberg's uncertainty principle justified it) to derive the radiation spectrum of the black body, thus ushering in the new era of quantum mechanics. The Boltzmann entropy of the system is then where k B = 1.3807 × 10 −23 J/K is called the Boltzmann constant. The logarithm in (1) is due to the fact that the entropy is additive with respect to extensive quantities such as volume, whereas the number of microstates is multiplicative. We will use natural logarithms throughout, although the base of the logarithm is not important as long as one sticks to the same choice in a calculation; a change of the logarithm base entails just a rescaling of the entropy, i.e., a change of units.
Gibbs generalized Boltzmann's formula to systems in which the constituent particles may fluctuate among states with different energies. This more general scenario is made possible by setting the system in contact with a thermal reservoir; the energy is then conserved in the composite system. Gibbs formula for a discrete set of microstates is where p i is the probability (i.e., the asymptotic fraction of time) that the system is in the microstate i with energy ε i , where U is the internal energy of the system. If p i = 0 one sets 0 · ln 0 := 0 in (2).
Note that Gibbs' entropy abridges to Boltzmann's entropy for equiprobable microstates, the so-called microcanonical ensemble. The set of microstates consistent with the restriction (3) (along with W i=1 p i = 1) is called the canonical ensemble. Regarding the microcanonical ensemble as the actual probability distribution amounts to Boltzmann's Ergodensatz (ergodic hypothesis): the trajectory of a closed system in the phase space is dense, hence the time spent in a region is proportional to its volume. In general, the ergodic hypothesis is not true in the physical systems.
In turn, the Tsallis entropy [4], (q > 0, q = 1) generalizes the Boltzmann-Gibbs entropy S, Equation (2), in the sense that lim q→1 S q = S. Let us mention at this point that Havrda and Charvát had previously introduced in [3] a family of entropies that differs from (4) only in a factor that depends on q. This is the reason for the name Havrda-Charvát-Tsallis entropy used by some authors.
A fundamental property of (2) and (4) is concavity with respect to the variables p i , 1 ≤ i ≤ W . The concavity of entropy and, in general, of all thermodynamical potentials (such as internal energy, free energy, and chemical potential) guarantees the stability of macroscopic bodies.

Entropy in classical information theory
In 1948 the word entropy appeared as well in the foundational papers of Shannon on information theory, coding theory and cryptography [1]. According to [16], it was von Neumann who, aware that Shannon's measure of information was formally the same as the Boltzmann-Gibbs entropy, proposed him to called it entropy. Indeed, let (Ω, B, µ) be a probability space and X a random variable on (Ω, B, µ) with outcomes in a finite set Γ ⊂ R. This means that (i) Ω is a non-empty set, (ii) B is a σ-algebra of subsets of Ω, (iii) µ is a measure on the measurable space (Ω, B), and (iv) X : Ω → Γ is a measurable map, Γ being endowed with the discrete σ-algebra (i.e., all subsets of Γ are measurable). The probability mass function of X is given by p(x) = µ(X = x) := µ(X −1 (x)), x ∈ Γ.
Therefore, H(X) can be viewed as the expected value of the random variable I(X) = − ln p(X), called the information function, and interpreted as the average information gained by knowing the outcome of the random variable X. An alternative interpretation is that H(X) measures the uncertainty about the outcome of X; see [17,Sect. 1.3] for other interpretations. Note that H(X) is actually a function of p(x). The range Γ of X is usually called alphabet in information theory. The entropy of random variables with Γ = R will be considered in Sect. 4.2 (see Equation (70)).
Another basic, entropy-like quantity is the conditional entropy of two uni-or multivariate random variables, where p(x |y) = p(x, y)/p(y) is the conditional probability of X = x given Y = y, and Γ X and Γ Y are the (finite) alphabets of X and Y , respectively. Their mutual information is From (6) and (7) it follows Therefore, I(X; Y ) is the reduction in the uncertainty of X due to the knowledge of Y ; or vice versa, the average amount of information that x conveys about y.
4. I(X; Y ) ≥ 0, with equality if and only if X and Y are independent.
Consider an information (or data) source that outputs letters, one per unit of time, from a finite alphabet Γ. Formally, such an information source is a discrete-time, stationary stochastic process X = (X t ) t∈T , where T = Z or T = N 0 = {0, 1, ..} and X t are equally distributed random variables on a probability space (Ω, B, µ) with alphabet Γ. A realization of X is then a sequence (x t ) t∈T = (X t (ω)) t∈T ∈ Γ T , ω ∈ Ω, called a message. A finite segment of a message, say, x t+n−1 t is called a word of length n, and its probability to be output is the joint probability p(x t+n−1 t ).
In information theory one takes usually T = N 0 because physical data sources must be turned on at some finite time.
Definition 3. The (Shannon) entropy rate (or just entropy) of the data source X is From the chain rule in Lemma 2(5) and the stationarity of X, one obtains the useful formula [18] h(X) = lim Since the conditional entropies H(X i |X i−1 , ..., X 1 ) build a decreasing sequence of nonnegative numbers, the convergence of (9) and (10) follows. h(X) measures the average information per time unit conveyed by the messages of the information source X. Typical examples of information sources are stationary, finite-state Markov chains, with state space Γ, or a function thereof whose domain is the state space of the chain and whose range is a subset of Γ.
The multivariate information function I(X t−1 ) and h(X) are related through the so-called asymptotic equipartition principle, this meaning that 'typical' messages have roughly the same probability. We say that an information source X is ergodic if the long-run relative frequency of any word in a message converges stochastically to the probability of that word. Historically, Boltzmann used the ergodic hypothesis to derive the equipartition of energy in the kinetic theory of gases. If the random variables X n are independent, then Theorem 4 is a straightforward consequence of the law of large numbers. As often happens in other situations, ergodicity can be viewed as a property generalizing that law.
Besides pervading all the conceptual tools of information theory, entropy is instrumental in most applications, especially in such important ones as the minimal code length, maximal channel capacity, or cipher security. For brevity we will discuss next only the compression of information, which is at the heart of modern communications and also related to Theorem 4.
Compression is any procedure that reduces the data requirements of a message without, in principle, loosing information. Suppose that code words w 1 , ..., w M of lengths L 1 , ..., L M , respectively, are assigned to the outcomes x 1 , ..., x M of a random variable X with probabilities p(x 1 ), ..., p(x M ). The code words are combinations of letters a 1 , ..., a D , usually 0, 1 (D = 2) in modern communications (or dot/dash in the Morse code!). Then the Huffman coding is a uniquely decipherable code that minimizes the average code-word lengthL = M i=1 p(x i )L i , which according to the noiseless coding theorem is known to satisfy [17] H(X) ≤L < H(X) + 1, the logarithm of H(X) being taken to base D. This result allows to interpret the entropy H(X) as a lower bound on the average compression of the symbols x i with relative frequencies p( There are also algorithms for lossless data compression which, unlike the Huffman coding, achieve the entropy bound (11) asymptotically with no prior knowledge of the probabilities p(x i ). Examples of such 'universal compressors' are different algorithms due to A. Lempel and J. Ziv based on pattern matching [18,19].
Finally, let us mention that the Rényi entropy [2], where q > 0, q = 1, generalizes the Shannon entropy in the sense that As a function of the parameter q, H q (X) is non-increasing. H 0 (X) := lim q→0 H q (X) = ln |Γ| is also called the Hartley entropy, H 2 (X) = − ln x∈Γ p(x) 2 the collision or quadratic entropy, and H ∞ := lim q→∞ H q (X) = − ln(max x∈Γ p(x)) the min-entropy. The Rényi entropy, especially H 2 (X), has been successfully applied in information theory (see [20] and references therein).

Entropy in measure-preserving dynamical systems
Entropy was introduced by Kolmogorov in ergodic theory in 1958 [21] as a metric invariant for Bernoulli shifts. Kolmogorov's proposal was inspired by Shannon's entropy, whose work on information theory was well known to him. This way he was able to show that the ( 1 2 , 1 2 )-Bernoulli shift and the ( 1 3 , 1 3 , 1 3 )-Bernoulli shift were not metrically isomorphic because the entropy of the former is ln 2 while the entropy of the latter is ln 3 (see below for details). He asked then if entropy is a complete isomorphism invariant for the Bernoulli shifts, i.e., if two Bernoulli shifts with the same entropy are isomorphic. One year later, Sinai generalized the Kolmogorov invariant to general measurepreserving dynamical systems [22] and this is the definition of entropy used ever since (also called metric entropy, measure-theoretic entropy, or Kolmogorov-Sinai entropy). The question on the completeness of entropy for Bernoulli shifts was answered affirmatively by Ornstein in 1969 [23,24]. Let us point out that historically (before the advent of chaos theory) only invertible transformations (i.e., automorphisms) were considered. In fact, the metric entropy is not a complete invariant for non-invertible Bernoulli shifts [25].
To avoid technical subtleties, we assume hereafter (as usually done in ergodic theory) that the probability spaces considered are isomorphic to the disjoint union of countably many point masses and an interval [a, b] ⊂ R endowed with the Lebesgue measure. These are called Lebesgue spaces [25].
One says equivalently that T is a µ-preserving map or that µ is a T -invariant measure. The transformation T introduces the 'dynamics' in the state space Ω via its iterates T n , n being interpreted as discrete time. The sequence (T n (ω)) n∈T is the orbit of ω ∈ Ω.
Here C is the product σ-algebra generated by the cylinder sets (i.e., sets of sequences with a finite number of components fixed), σ is the (left-)shift transformation (x t ) t∈T → (x t+1 ) t∈T , and m is a probability measure that makes σ measure-preserving. If T = Z the shift is called two-sided and σ is invertible, σ −1 being the right shift (x t ) t∈Z → (x t−1 ) t∈Z ; if T = N 0 the shift is called one-sided and σ is k-to-1. Let (i) p = (p 1 , ..., p k ) be a probability vector, (ii) P = (p ij ) 1≤i,j≤k a stochastic matrix (i.e., p ij ≥ 0, Given a finite partition α = {A 1 , ..., A k } of Ω, define the entropy of α as If β = {B 1 , ..., B l } is another partition of Ω, let α ∨ β denote the least common refinement of α and β, i.e., α ∨ β = Least common refinements of more than two partitions are defined recursively. Sinai's definition of the entropy of (Ω, B, µ, T ) proceeds by 'coarse-graining' the state space Ω with a partition α and refining α by the repeated action of T . Specifically, define the maps X α t : Ω → {1, ..., k} by Then one can prove [19, Sect. 1.1.3] that X α = (X α t ) t∈T is a stationary random process on (Ω, B, µ) with alphabet {1, ..., k}, called the symbolic dynamics of T with respect to the partition α. The entropy of T with respect to the partition α is then defined as the Shannon entropy of the information source X α , Since where Definition 6. The (metric) entropy of (Ω, B, µ, T ), or just the entropy of T if the underlying measure-preserving system is clear from the context, is where the supremum is taken over all finite partitions α of Ω.
One also says that T is ergodic with respect to µ, or that µ is an ergodic measure. Although this modern definition is a far cry from Boltzmann's ergodicity hypothesis (a property called now minimality), it is remarkable that the equality of time averages and phase space averages of observables follows as well from the celebrated Birkhoff's ergodic theorem [26]. As a useful example, the (p, P )-Markov shift (either one-sided or two-sided) is ergodic if and only if P is irreducible (i.e., for all i, j, there exists n > 0 with p n ij > 0, where p n ij is the (i, j)-entry of the matrix P n ). In particular, Bernoulli shifts are ergodic.
A finite partition γ = (G 1 , ..., G k ) of Ω is called a generating partition, or just a generator of (Ω, B, µ, T ) if B is the smallest σ-algebra containing all T −n G in , n ∈ T, 1 ≤ i n ≤ k.
Theorem 7. (The Kolmogorov-Sinai theorem [22]) If γ is a generator of the dynamical system (Ω, B, µ, T ), then This theorem was first proved by Kolmogorov In particular, the entropy of the ( 1 k , ..., 1 k )-Bernoulli shift is ln k.
in virtue of Theorem 7. Thus generators allow to establish a bridge between the Kolmogorov-Sinai and Shannon entropies via symbolic dynamics. From a practical viewpoint, generators provide a method to compute entropy via Equation (18). There are a number of theorems guaranteeing the existence of generators. Thus, Krieger's theorem states that if T is ergodic and invertible with h µ (T ) < ∞, then T has a generator [27]. Although the proof is nonconstructive, Smorodinsky [28] and Denker [29] provided methods to construct a generator for ergodic and aperiodic invertible maps. Denker's construction could even be extended by Grillenberger [30] to the nonergodic case. The existence of generators for non-invertible maps was proved by Kowalski under different assumptions [31,32]. However, as far as we know, the construction of such generators remains an open problem as of this writing.
The fact that historically entropy entered into the theory of dynamical systems through the shift systems is not casual since, as we have just seen, these are dynamical models of stationary random processes. Thus, the p-Bernoulli shift models in the sense just explained an i.i.d. random process with probability distribution p. More generally, the (p, P )-Markov shift ({1, ..., k} T , C, m, σ) is the support of a k-state, stationary Markovian process (X t ) t∈T on a probability space (Ω, B, µ) with stationary probability distribution p and transition probability matrix P , i.e., µ(X t1 = i 1 , ..., X tn = i n ) = p i1 p i1i2 ...p in−1in . As with the Bernoulli shifts, it can be shown that the partition } is a generator of both one-and two-sided Markov shifts, so Of course, Equation (19) is just a special case of Equation (21).
If γ is a generator of (Ω, B, µ, T ), then the support of the symbolic dynamic X γ is isomorphic to (Ω, B, µ, T ).
Therefore (see (20)), (13)). This means that there is a 1-to-1 relation between ω ∈ Ω and the realizations X γ (ω) for µ-almost all ω. We conclude that the orbits of deterministic systems with known generators can be used to produce truly random sequences. By way of illustration, the orbits of the tent map Λ : where B and λ stand here for the Borel σ-algebra and the Lebesgue measure of [0, 1), respectively. If Λ is replaced by the shift map To conclude this section, let us mention that if Ω is a compact n-dimensional manifold and T a C 1+δ diffeomorphism preserving an absolutely continuous measure µ, then h µ (f ) is related to the (strictly positive) Lyapunov exponents χ i (counting multiplicities) via the celebrated Pesin's formula where (·) + stands for positive part [33]. If T is ergodic, then χ i are constant µ-almost everywhere and the integration in (22) can be dropped. A special case of this theorem is of historical relevance, namely, Sinai's formula for the entropy of automorphisms of the n-dimensional torus [34]. Let Ω = R n /Z n and T M : ω → M ω (mod 1), where M is an n × n matrix with integer entries and determinant of absolute value one. Then the entropy of T with respect to the Lebesgue measure λ is given by the eigenvalues Λ 1 , ..., Λ n of M as This result showed for the first time that not only random processes but also 'deterministic processes' can have a positive entropy. Pesin's formula was generalized to arbitrary ergodic Borel measures by Ledrappier and Young [35].

Entropy in topological dynamical systems
Let Ω be a compact metric space and T : Ω → Ω a continuous transformation. We say then that (Ω, T ) is a topological dynamical system. The notion of entropy in (Ω, T ), or topological entropy, was introduced by Adler, Konheim, and McAndrew [36]. Their proposal follows the definition of metric entropy but using open covers instead of partitions. Specifically, if α = {A 1 , ..., A k } is now an open cover of Ω, then the topological entropy of T relative to α is given by where, similarly to (16), α (n) is the least common refinement of the open covers α, , and N (α (n) ) denotes the number of sets in a finite subcover of α (n) with smallest cardinality.
Definition 11. The topological entropy of T is defined by where α ranges over all open covers of Ω.
Later, Dinaburg [37] and Bowen [38] introduced a different definition for continuous maps on compact metric spaces (Ω, ρ) which is equivalent to Definition 11 (thus, independent of the metric ρ) and, moreover, can be extended to noncompact metric spaces. In this approach, one introduces a new metric in Ω that takes into account the separation of points along their initial orbit segments of length k ≥ 1. Denote by r k (ǫ) the minimal number of ǫ-balls with respect to ρ k that cover the whole space. Then Thus, the topological entropy measures the exponential growth rate of the number of distinguishable orbits with finite precision. For other approaches, also when Ω is noncompact, see [25,Sect. 7.2]. For a survey of topological entropy in different settings of topological dynamics, see [39].
Topological entropy is an invariant of the equivalence between topological dynamical systems, a notion usually called conjugacy. Two topological dynamical systems (Ω, T ) and (Ω,T ) are said to be (topologically) conjugate if there exists a homeomorphism φ : A prototype of a topological dynamical system (Ω, T ) is the full shift over k symbols ({1, ..., k} T , σ), where {1, ..., k} is equipped with the discrete topology and {1, ..., k} T with the product topology (generated by the cylinder sets). Indeed, {1, ..., k} T is then a compact, metrisable space, and the shift σ is a continuous transformation (bi-continuous, hence a homeomorphism, if T = Z). More interesting though are the subshifts, which are obtained from a full shift by excluding a finite or infinite set of 'forbidden' words from the state space. Therefore, subshifts are shift spaces with constrained sequences. Let S be the subset of {1, ..., k} T built by all allowed words, also called a language. The topological entropy of the subshift σ S = σ| S is given by where A n ⊂ S is the set of allowed words of length n. If the list of forbidden words is finite, one speaks of a subshift of finite type (SFT). Two SFTs σ S and σ S ′ are said to be almost conjugate if there is a third SFT σ R and factor maps φ : R → S, φ ′ : R → S ′ that are 1-to-1 on an open dense set. Topological entropy is a complete invariant of almost topological conjugacy for aperiodic and irreducible SFTs [40]. In general, a shift space A (topological) Markov chain (or 1-step SFT) is a SFT (S, σ S ) such that, given an allowed word x t ...x t+n and a letter x, the concatenation x t ...x t+n x is an allowed word if and only if x t+n x is an allowed word. This being the case, a Markov chain can be ascribed a so-called transition matrix A = (a ij ) 1≤i,j≤k , a ij ∈ {0, 1}, in such a way that S = {(x t ) t∈T : a xtxt+1 = 1 for all t ∈ T}. A Markov chain is irreducible if and only if the matrix A is irreducible. Alternatively, a Markov chain can be described by a graph G and its adjacency matrix A G = A (up to reordering of the vertices). In either case, h(σ A ) = ln ρ A , where ρ A is the spectral radius of A (i.e., the largest absolute value of the eigenvalues). In particular, h top (σ) = ln k for the one-and two-sided full shifts. Note for further reference that ln k is also the metric entropy of the ( 1 k , ..., 1 k )-Bernoulli shift. A factor of a Markov chain is called a sofic shift [41]. Subshifts were considered by Shannon in the context of constrained data sources [1]. Indeed, due to technological feasibility or convenience, it is sometimes necessary to encode messages to sequences that satisfy certain constrains. For example, to ensure proper synchronization in magnetic or optical recording it might be necessary to limit the length of runs of 0's between two 1's when reading and recording bits [18]. Specifically, if W n is the set of allowed words of length n that can be transmitted over a noiseless channel, then is the capacity of the 'constrained channel'. By definition of noiseless channel, C is an upper bound for the entropy of any data source transmitting over the channel. Comparison of (29) with (28) shows that C is the topological entropy of the subshift σ S with language S = ∪ n≥1 W n . Shannon proved [1] that if the data source is an irreducible Markovian random process (of order 1), then there is an assignment of transition probabilities which maximizes the entropy. In other words, if (S, σ S ) is an irreducible Markov chain, there exists a σ S -invariant measure m * on (S, C) such that the metric entropy of (S, C, m * , σ S ) equals the topological entropy of (S, σ S ). m * is called the Parry measure (he proved its uniqueness in [42]); (S, C, m * , σ S ) is the support of the capacity achieving Markovian source. The study of constrained languages evolved over time to symbolic dynamics, nowadays a stand-alone branch of mathematics with important applications to areas other than dynamical systems and information theory, such as formal languages, computer science, and graph theory [43].
According to the Krylov-Bogolyubov theorem, any topological dynamical system has at least one Borel probability T -invariant measure [44]. The proof resorts to standard results of functional analysis. Set µ n (ω) : is the unit point mass at T i (ω). Then every µ n (ω) is a Borel probability measure on Ω, and the weak * -limit of (µ n (ω)) n∈N0 is a T -invariant Borel probability measure on Ω, supported on the closure of the orbit (T i (ω)) i∈N0 . Note that distinct periodic orbits (if any) contribute distinct invariant measures. The existence of invariant measures in any topological dynamical system raises the question about the relationship between topological and metric entropy. The answer is given by the following variational principle.
Theorem 12. ( [45]). Let Ω be a compact metric space and T : Ω → Ω a continuous map. Then where µ ranges over all Borel probability T -invariant measures.
The above examples of the ( 1 k , ..., 1 k )-Bernoulli measure and the Parry measure show that the bound is called a measure of maximal entropy. If, moreover, µ * is unique (as the both examples just mentioned), then µ * is the natural choice to characterize the dynamics.
Both the topological entropy and the variational principal are conveniently generalized in the so-called thermodynamical formalism [46]. One of the main characters in this approach is the topological pressure [47]. Let C(Ω, R) be the space of real-valued continuous functions of a compact metric space Ω. The pressure of µ, a T -invariant Borel probability measure, is the map P µ (T, ·) : An invariant measure µ * such that P µ * (T, ϕ) = P (T, ϕ) is called an equilibrium state for ϕ. Since P (T, 0) = h top (T ), a measure of maximal entropy is then an equilibrium state for the potential ϕ = 0. One of the most important Beside Markov chains, there are few systems for which exact formulas or good algorithms to compute the topological entropy are known. One of the exceptions are the multimodal maps, which are relevant in the study of low dimensional chaos [48]. See also [49,50] for recursive algorithms based on the expression where ℓ n is the number of maximal monotonicity intervals ('lap number') of f n [51].
Topological entropy was extended to non-autonomous dynamical systems in [52]. In the case of switching systems, a different approach was used in [53,54]. Other extensions, this time to set-valued mappings, can be found in [55].

Entropy of group actions
Let (Ω, B, µ) be a Lebesgue probability space and T : Ω → Ω a µ-preserving transformation. The set M µ (Ω) of µ-preserving transformations T : Ω → Ω is a group if T is invertible or, otherwise, a semigroup under composition.
This means that there exists a homomorphism ψ : . If G is a group with identity element e, then ψ(e) is the identity transformation, and T g −1 = T −1 g . Sometimes one says that T g is a (measure-preserving) action of G on Ω.
A Z or N 0 action on (Ω, B, µ) amounts to the invertible or non-invertible measure-preserving dynamical system (Ω, B, µ, T ), respectively, where T generates G, i.e., ψ(n) = T n . For another example, let (Ω, B, µ) be a compact abelian group with the Haar measure µ. Furthermore, let G be a subgroup of Ω, and the transformation T g : Ω → Ω be given by T g (ω) = gω, g ∈ G. Then g → T g is a measure-preserving G action on (Ω, B, µ).
The perhaps simplest example of a measure-preserving action of an arbitrary group G is a Bernoulli action [56]. Let K be a finite set with the discrete topology, and Ω = K G := {ω : G → K} with the product topology, so Ω is compact. Define a right action R g ω(h) = ω(hg) and a left action L g ω(h) = ω(g −1 h) of G on Ω. To construct an invariant measure for these two actions endow K with the discrete σ-algebra D and a probability mass function p, and endow Ω with the product measure, taking for each coordinate the probability space (K, D, p). For G = Z one obtains the standard (two-sided) Bernoulli shift on |K| symbols, with σ n = R n = L −n , n ∈ Z. A similar construction can be made for a semigroup, where now only the right action R g is defined. For G = N 0 one obtains the one-sided Bernoulli shift.
R-actions are also called flows. A prominent example of an R action is the Bernoulli flow, i.e., a representation R ∋ x → T x ∈ M µ (Ω), such that T x is isomorphic to a Bernoulli shift [57].
The definition of metric entropy for K 1 × ... × K d actions is a straightforward generalization of the definition of metric entropy. Indeed, let α be a finite partition of Ω, and set Then, is the d-dimensional entropy of the measure-preserving K 1 × ... × K d action on (Ω, B, µ). For d = 1 we recover the definition of the Kolmogorov-Sinai entropy.
The extension of metric entropy from Z to more general groups has been done in two major steps. In a first step, Ornstein and Weiss [58] extended the necessary theoretical framework and fundamental results (such as Kolmogorov's theorem for Bernoulli shifts, and Ornstein's isomorphism theorem) to the amenable groups. These are groups to which Riesz's proof of von Neumann ergodic theorem carries over. Incidentally, the amenable groups were introduced and studied by von Neumann in connection with the Banach-Tarski paradox. They are precisely those locally compact groups for which no paradoxical decomposition exists. For the purposes of ergodic theory, a group G is amenable if it has a sequence of finite sets F n , called a Følner sequence, such that lim n→∞ |gF n ∆F n | |F n | = 0 for any g ∈ G. Følner sequences are actually the handle that makes amenable groups amenable to the concepts and tools of ergodic theory, in particular to entropy [59]. It can be shown that abelian (e.g., Z d ), nilpotent and solvable groups are amenable.
In a second step, Lewis Bowen proved that an entropy theory can be also developed for the much larger class of sofic groups, introduced by Gromov [60,61]. The interested reader is referred to the excellent review [56]. Likewise, topological dynamical systems can be also generalized to continuous actions of groups as follows. Recall that a topological space is called σ-compact if it is a countable union of compact sets. Given a σ-compact space Ω, the set of continuous transformations T : Ω → Ω is a group (if T is invertible) or a semigroup (otherwise) under composition. We focus on the group of homeomorphisms.

Definition 14.
[62, Definition 8.1] Let G be σ-compact metric group and Ω a σ-compact metric space. A continuous G action on Ω is a homomorphism from G to the group of homeomorphisms of Ω, g → T g , such that the map It is therefore very satisfactory that the topological entropy of the action of an amenable group G on a compact metric space Ω and the entropies of the same action viewed as a measure-preserving one on Ω are related by a variational principle. The same happens with sofic groups. See [56] for further details and references.

Entropy is a very natural concept
It is not surprising that, owing to the paramount role of entropy in information theory, Shannon pondered over its uniqueness from the very beginning. Indeed, he proved in Appendix 2 of his foundational paper [1] that the only function H of a probability mass distribution {p 1 , ..., p n } satisfying just three properties (see below) is of the form

Entropy from an axiomatic viewpoint
There exist various axiomatic characterizations of entropy showing that it is a very natural construct, in particular under its interpretation as a measure of information or uncertainty. We list here only a few of them for the Shannon, Rényi and Tsallis entropies.
If n ∈ N, set P n = (p 1 , p 2 , . . . , p n ) ∈ [0, 1] n : (P8) a-RECURSIVITY: For each (p 1 , p 2 , . . . , p n ) ∈ P n , n > 2, and a > 0, Bearing in mind that H shall measure the information or uncertainty contained in a probability distribution (p 1 , p 2 , . . . , p n ), properties (P1)-(P8) are well interpretable. In particular, if p 1 , p 2 , . . . , p m and q 1 , q 2 , . . . , q n are the probabilities of the outcomes i and j of two given finitelyvalued random variables X and Y , respectively, then property (P6) for a = 1 expresses that, in case that X and Y are stochastically independent, then the information contained in X and Y adds to the information contained in the random vector (X, Y ). For a = 1, property (P6) is simply called additivity.
Moreover, (P7) for a = 1 can be interpreted as follows: If p ij , i = 1, 2, . . . , m, j = 1, 2, . . . , n, are the joint probabilities of the outcomes i and j of two random variables X and Y , respectively, then the information contained in the random vector (X, Y ) is the sum of the information contained in X and the mean information of Y given the outcome of X, that is, H(X, Y ) = H(X) + H(Y |X) in the notation of Section 2.2 (see Equation (6)).
(iv) The following statements are equivalent:  Of course, f, f q , g q correspond to the Shannon, Rényi and Tsallis entropies, respectively. Recall that Rényi and Tsallis entropies for q = 1 are defined as the Shannon entropy since f q and g q converge to f as q → 1. Furthermore, for fixed q the Rényi entropy f q and the Tsallis entropy g p are monotonically related as follows: The equivalences of (a) and (b) and (a) and (c) in Theorem 16(iv), both characterizing the Shannon entropy f , are well known results of Khinchin [63] and Faddeev [64], respectively. Here, the properties (P7) and (P8), respectively, with a = 1 are the substantial ones; in their general form they play a central role for characterizing Tsallis entropy g q . Shannon already characterized his entropy in [1], by the properties (P1), (P3) and a third property substantially being the strong 1-additivity (additivity 'for successive choices'). Probability functionals like the Rényi and Tsallis entropies that satisfy the properties of continuity (P1), maximality (P4) and expansibility (P5) are called generalized entropies.
The characterization of the Tsallis entropy g q based on (P8) in Theorem 16(v) was given by Furuichi [65]. A statement generalizing the equivalence of (a) and (b) in Theorem 16(v), using mainly (P7), was provided by Suyari [66] with a correction by Ilić et al. [67]. This generalization considers the whole family of Tsallis entropies continuously depending on q and includes the family of Havrda-Charvát entropies [3]. Other characterizations of the Tsallis entropy are due to Abe [68], who uses a kind of conditional property, and to dos Santos [69], who mainly refers to q-additivity.
Note that the statements listed in (i), (ii) and (iii) of Theorem 16 not covered by (iv) or (v) are well known and can easily be shown, possibly with the exception of the statement in (ii) that the Rényi entropy f q does not satisfy (P7) nor (P8) for any a > 0. To show this, one can argue as follows. Given q = 1 and a > 0, suppose that f q satisfies (P7) or (P8 by additivity, which implies a = 1; but H = f q does not satisfy (P7) nor (P8) for a = 1.
The Rényi entropy has been characterized in different way, however, these characterizations are not as simple as those for the Tsallis entropy. The first characterization, given by Rényi [2], uses mainly additivity and a mean property for some special q-depending weighting, but it is based on the extension of the system of distributions given by the set P to all incomplete distributions. Solving an open question posed by Rényi, Aczél and Daróczy [70] gave a first characterization without considering incomplete distributions. Of further characterizations, we only mention here one due to Jizba and Arimitsu [71].
To conclude, let us underline the difficulty of finding the most general map H : P → [0, ∞) satisfying a given set of properties. To get a taste, the interested reader is referred to [5,72].

Characterizations of the Kolmogorov-Sinai entropy
In the same way as various Shannon entropy-like concepts have been considered to quantify the diversity of probability distributions in different situations, one can try to generalize the concept of Kolmogorov-Sinai entropy by starting from entropies other than Shannon's. The results of the last fifteen years show that this line of work does not lead to anything new, at least for automorphisms.
In particular, attempts were made in [73,74,7] to quantify the complexity of dynamical systems with the help of the Rényi entropy. Takens and Verbitskiy [75] have discussed the Rényi analogue of the Kolmogorov-Sinai entropy for variable q > 0. Given a measure-preserving dynamical system (Ω, B, µ, T ) and a finite partition α = {A 1 , . . . , A n } of Ω, define the Rényi entropy of α as the Rényi entropy of T with respect to α as h q µ (T, α) = lim inf n→∞ 1 n H q (α (n) ) (see Equation (16)), and, finally, the Rényi entropy of T as where the supremum is taken over all finite partitions α of Ω. The surprising fact is that the new quantity (34) is not sensible in the case q < 1 and coincides with the Kolmogorov-Sinai entropy for q ≥ 1.

Theorem 17. [75] For each ergodic automorphism T of a Lebesgue probability space
Recall that a µ-preserving map T of a probability space (Ω, B, µ) is called automorphism if it is bijective and T −1 is also µ-preserving.
Furthermore, Theorem 17 remains true if T is not an automorphism or h µ (T ) = 0 [75]. On the other hand, as shown in [76], h q µ (T ) can be strictly smaller than h µ (T ) when ergodicity fails. Statements similar to Theorem 17 for a Tsallis variant of the Kolmogorov-Sinai entropy were given by Mesón and Vericat [77].
Then the g-entropy of T , h µ (g, T ), is defined by It turns out that g-entropies with the same behavior near 0 as η provide the same information on the dynamical system (Ω, B, µ, T ). η(x) < ∞, then where h µ (T ) is the Kolmogorov-Sinai entropy of T .
Additional information on the relation between the Kolmogorov-Sinai entropy and the g-entropies can be found in the paper of Falniowski. The proof of Theorem 18 uses a very general characterization of the Shannon entropy of finitely-valued stochastic processes given by Ornstein and Weiss [79].
A map J from the class of finitely-valued ergodic stochastic processes (X t ) t∈Z into a metric space (M, ρ) is said to be finitely observable if there exist functions S n : R n → M, n ∈ N such that lim n→∞ S n (X 1 (ω), X 2 (ω), . . . , X n (ω)) = J((X t ) t∈Z ) for µ-a.a. ω ∈ Ω, where (X t ) t∈Z is defined on a probability space (Ω, B, µ). J is called an invariant if it takes the same values for finitely-valued stochastic processes with isomorphic sequence space models (see Definition 9).

Theorem 19. [79] Every finitely observable invariant for the class of all finitely-valued ergodic processes is a continuous function of the Kolmogorov-Sinai entropy.
A consequence of this statement is, roughly speaking, that the only useful complexity measures for dynamical systems are continuous functions of the Kolmogorov-Sinai entropy.

Other entropies
Of the various entropies and entropy-like quantities proposed in the literature, the selection below responds to both their applications and mathematical insights.

Approximate and sample entropy
Approximate entropy and sample entropy are widely-used quantities for measuring complexity of (finite) time series (x t ) N t=1 and the underlying dynamical systems. They quantify the change in the relative frequencies of length k timedelay vectors with increasing k. Here, for simplicity, we assume delay time τ = 1, but searching for delay times with an optimal perception of the dynamical structure, is an important task in nonlinear data analysis; see [80, Sect. 9.2] for several methods.
We first recall the concepts of approximate entropy and sample entropy from [8,81] and then discuss where they come from (see [82] for a more extended discussion).
Definition 20. Given a time series (x t ) N t=1 , and a tolerance ǫ > 0 for accepting time-delay vectors of length k ∈ N as similar, the approximate entropy is defined as where and and the sample entropy is defined as where Thus, with respect to the maximum norm, C(i, k, ǫ, (x t ) N t=1 ) is the relative frequency of the vectors (x t ) j+m−1 t=j , 1 ≤ j ≤ N − k + 1, which are within a distance ǫ from the vector (x t ) i+k−1 t=i , and C(k, ǫ, (x t ) N t=1 ) is the relative frequency of pairs of vectors (x t ) i+m−1 t=i , (x t ) j+m−1 t=j , 1 ≤ i, j ≤ N − k + 1, within a distance ǫ. At a first glance, the quantities considered do not look much like entropies. In the following, we justify (i) the approximate entropy as an estimate of the Kolmogorov-Sinai entropy, referring to a discussion in [7] based on ideas from [83,84], and (ii) the sample entropy as an estimate of the H 2 (T )-entropy, introduced, according to [85], in [12].
The correlation integral, which is necessary to understand what is really behind approximate and sample entropies, was originally defined in [83]. Given a probability space, one can interpret the correlation integral as the probability of two points being closer than a distance ǫ.
Theorem 22. Let (Ω, B, µ, T ) be a measure-preserving dynamical system, where Ω is a compact metric space, µ is a non-atomic Borel probability measure, and T is a continuous map. Then Recall that a measure µ is called non-atomic if for any measurable set A with µ(A) > 0 there exists a measurable set B ⊂ A such that µ(A) > µ(B) > 0.
Further, the following statement can be shown (see [82]).
In general it is not known whether the limit (44) exists.
In this setting, assume that x t = X(T t (ω)), where X : Ω → R is a Borel-measurable function, i.e., the time series (x t ) N t=1 is obtained from the dynamical system (Ω, B, µ, T ) via an observable X. Here B is the Borel σ-algebra of Ω, and µ is a Borel measure. Then, for an ergodic T the term (38) is a natural estimate of µ(B k (T i (ω), ǫ)). This being the case, comparison of the definition (36) of approximate entropy with the right hand side of (45) for finite k, n and ǫ, leads to the conclusion that ApEn(k, ǫ, (x t ) N t=1 ) approximates the Kolmogorov-Sinai entropy under the conditions of Theorem 23.
The switch from the metric ρ of the state space Ω to the maximum norm metric ρ k of the time delay vectors (x i , x i+1 , ..., x i+k−1 ), Equation (26), is justified by Takens' theorem in the case of dynamical systems with certain natural differentiability assumptions. This theorem guarantees that generically the space of delay vectors 'reconstructs' the dynamical system with equivalence of the original metric and the maximum norm metric (see [85], in particular Chapter 6, for more details).
Next we consider the relationship between the H 2 (T )-entropy (see below) and its estimation by means of the correlation integral.
Definition 24. Let (Ω, B, µ, T ) be a measure-preserving dynamical system, where Ω is a compact metric space, µ is a Borel probability measure, and T is a continuous map. The H 2 (T )-entropy and the correlation entropy CE(T, 1) are then defined by Note that the limit for ǫ → 0 in (47) exists due to the monotonicity properties of − 1 k Ω ln µ(B k (ω, ǫ))dµ(ω); see [88, Lemma 2.1] for details. The limit for k → ∞ in (47) exists by Lemma 2.14 from [88].
The H 2 (T )-entropy was introduced in [12] (see also [85]). Inspired by [7,74,12], where the Rényi approach was used, Takens and Verbitskiy [75] introduced a correlation entropy CE(T, q) of order q. Here we will only consider the case q = 1. For an interesting study of properties of the correlation entropies CE(T, q), we refer also to Verbitskiy's dissertation [88].

Theorem 25. Under the conditions of Theorem 22 it holds
Note that the inequality in (48) follows directly from Jensen's inequality.
Equation (40) turns out to be an appropriate estimate of the correlation integral from a time series (x i ) i∈N = (X(T i (ω))) i∈N , again under consideration of Takens' reconstruction theorem (see [12,84,85]). In [89] it was shown that for almost all ω ∈ Ω, lim N →∞ C(k, ǫ, (X(T t (ω))) N t=1 ) = C(k, ǫ). We refer to [85,90] for a further discussion on the estimation of the correlation integral.

Directional entropy
Directional entropy was introduced by Milnor in [9] for cellular automaton maps. In its simplest formulation, a onedimensional cellular automaton map is a transformation F : S Z → S Z , where S is a finite set of symbols, such that (F (s)) i , i ∈ Z, depends only locally on the components of s = (s i ) i∈Z . Specifically, there is an r ≥ 1, called the radius of F , and a 'local rule' f : S 2r+1 → S such that (F (s)) i = f (s i−r , ..., s i+r ).
(50) The transformation F implements the state update of the cellular automaton (CA) (S Z , F ), so one writes s t = F t (s). Let us mention in passing that a map between two shift spaces of the form (50) is called a block map. They are characterized by being continuous and shift commuting [91].
This situation generalizes readily to d-dimensional CA; here F : S Z d → S Z d . Cellular automata can be considered as toy models of spatially extended systems (d-dimensional space being modelled by the lattice Z d ) because the time evolution of each component s i∈Z d (or 'cell') depends on the states of the neighboring components. Roughly speaking, Milnor's topological and measure-theoretic directional entropy measure the complexity of such a spatiotemporal dynamics in the direction of a vector v ∈ R d .
Following [92], let V be an k-dimensional vector subspace of R d , 1 ≤ k < d, and let Q ⊂ V be a unit cube in V built on an orthonormal basis of V . Likewise, let Q ′ ⊂ V ⊥ be such a unit cube in the orthogonal subspace of V , centered at the origin. The set W a,b := (aQ + bQ ′ ) ∩ Z d , a, b ∈ R, is called a window. Furthermore, let T be a measure-preserving Z d action on a Lebesgue probability space (Ω, B, µ), and α a finite partition of Ω. Define, and The convergence of the limits in (51) and (52) was proved by Milnor in [9].
The directional entropy (53) does not depend on the ortonormal bases used to define the windows W a,b . If k = d (i.e., V = R d ), then the windows have the form W a := aQ, and h µ,d (T, R d ) = h µ,d (T ), Equation (32). As for the relation between different directional entropies of the same action, the following is an interesting result.
For the directional entropy, including h µ,d (T ), a similar result to Theorem 8 holds.
The interested reader is referred to [93,94,95,96] for further insights into directional entropy.

Permutation entropy
Let I be a closed interval of R endowed with the Borel σ-algebra B. A map T : I → I is said to be piecewise, strictly monotone if there is a finite partition of I into intervals such that, on each of those intervals, T is continuous and strictly monotone. Furthermore, let r : {0, 1, ..., L − 1} → {0, 1, ..., L − 1} be the permutation sending i to r i , 0 ≤ i ≤ L − 1. For notational convenience, we will write r = (r 0 , r 1 , ..., r L−1 ).
Definition 28. We say that x ∈ I has ordinal L-pattern r, or that x is of type r, if We refer to L as the length of the ordinal pattern r. Other conventions, such as reversing the order relations in (55), can be also found in the literature.
Let us highlight at this point that the permutations of L elements (or ordinal L-patterns for that matter) build the symmetric group of degree L, denoted by S L , when endowed with the 'product' (actually, composition of permutations) r • s = (r 0 , r 1 , ..., r L−1 ) • (s 0 , s 1 , ..., s L−1 ) = (s r0 , s r1 , ..., s rL−1 ), for all r, s ∈ S L , the unity being the identity permutation (0, 1, ..., L − 1). S L is non-commutative for L ≥ 3, and its cardinality is L!. The algebraic structure of ordinal patterns will be exploited in the next section.
Let P r be the set of x ∈ I of type r ∈ S L , i.e., points x satisfying (55), and That is, π L is a partition of I into subsets whose points have the same ordinal L-pattern, possibly except for a set of Lebesgue measure 0 (the set of periodic points with period less than L). Alternatively, one can agree some convention to rank T ri (x) and T rj (x), 0 ≤ r i , r j ≤ L − 1, whenever T ri (x) = T rj (x). π L is called an ordinal partition. Note that π L+1 is a refinement of π L .
Definition 29. Let T : I → I be a piecewise, strictly monotone map, and µ be a T -invariant measure on (I, B). Then, the metric permutation entropy of T is defined as Furthermore, the topological permutation entropy of T is defined as Note that the definitions (58) and (59) are simpler than their standard counterparts (17) and (25), respectively, since no supremum must be taken. For this reason the following result (also showing the convergence of (58) and (59)) is remarkable.
As for the topological permutation entropy of order L, it amounts to counting ordinal L-patterns. Application examples to time series analysis can be found in [98]. Moreover, Theorem 30(ii), i.e., |π L | ∝ e htop(T )L , along with |S L | = L! ∝ e L ln L , has an interesting offshoot. Namely, the existence of forbidden ordinal patterns, that is, ordinal patterns that cannot be realized by any orbit of the system. For example, the ordinal pattern (2, 1, 0) cannot be realized by the logistic map f (x) = 4x(1 − x), 0 ≤ x ≤ 1, i.e., there is no initial condition x ∈ [0, 1] such that f 2 (x) < f (x) < x [99,100]. It is trivial that if L min is the minimum length of the forbidden patterns, then there are forbidden L-patterns for all L ≥ L min (in fact, a superexponentially growing number of them). This result was used in [101,19,102] to discriminate deterministic noisy signals from white noise. We come back to this application in Sect. 5.4.
The computation of permutation entropies of finite orders benefits from the practical advantages of using ordinal patterns. Thus, h * µ (T, L) and h * 0 (T, L) are computationally fast and relatively robust against observational noise. Actually, ordinal patterns of deterministic signals are more robust than those of random signals due to a mechanism called dynamic robustness [102]. Furthermore, the calculation of ordinal patterns does not require a prior knowledge of the data range, what allows real-time data processing and analysis.
Metric permutation entropy was extended to general dynamical systems along two different lines [103]. One of them follows the standard, partition-based approach to metric entropy and turns out to be equivalent to it [104]. The other generalization considers real-valued maps, 'observables' on dynamical sets, and ordinal patterns defined on their domain. It can be shown that under some separation conditions of the maps, this approach leads also to the metric entropy [105,106,107].
An open problem is the extension of topological permutation entropy to more general settings, as successfully done with the metric version. Misiurewicz proved in [108] that Theorem 30(ii) does not hold if T is not piecewise, strictly monotone, so no straightforward generalization can be expected. Incidentally, Theorem 30(ii) shows that, under some restrictions, topological entropy can be computed via partitions instead of open covers.

Transfer entropy
A relevant question in time series analysis of coupled random or deterministic processes is the causality relation, i.e., which process is driving and which is responding. A possible method to discriminate cause from effect, proposed originally by Wiener [109], is as follows: "For two simultaneously measured signals, if we can predict the first signal better by using the past information from the second one than by using the information without it, then we call the second signal causal to the first one". The implementation of this principle in linear time series analysis via autoregressive processes goes by the name of Granger causality [110]. Transfer entropy, introduced by Schreiber in [13] for measuring the information exchanged between two processes in both directions separately, can be considered an information-theoretic implementation of Wiener causality.
Let X = (X t ) t∈T and Y = (Y t ) t∈T be two stationary random processes. Then, the transfer entropy from Y to X, T Y→X , is the reduction of uncertainty in future values of X, given past values of X, due to the additional knowledge of past values of Y. Set X Unlike the (unconditioned) mutual information (Equation (8)), T Y→X (Λ) is asymmetric under the exchange of the processes X and Y. Therefore, the directionality indicator (62) measures the net transfer of information between the processes X and Y. For example, if ∆T Y→X (Λ) > 0 then Y is the driving process with a coupling delay Λ. This so-called coupling direction is one of the main objectives in the study of interacting systems.
Other definitions of transfer entropy (e.g., with t − 1 instead of t, or with k = n) can be also found in the literature. For a more general notion, called momentary information transfer, see [111].
In practice, data are finite if not sparse. In the latter case, one often uses transfer entropy with the least dimension, (i.e., k = n = 1), as we assume hereafter for simplicity. Furthermore, to estimate transfer entropy (as well as other statistics and observables) one resorts to symbolic representations. Indeed, this is a common technique in time series analysis which consists in trading-off realizations of a random variable for symbols belonging to some convenient alphabet. Instances of this procedure are binning in traditional statistical data analysis, instantaneous phases via the Hilbert transform, and symbolic dynamics in nonlinear time series analysis (Sect. 2.3). The rationale of the latter instance is that the insight provided by a 'coarse-grained' dynamics may be sufficient for one's needs, specially when the actual, 'sharp' dynamics is too complex for a detailed analysis. In an ordinal representation, a special case of symbolic dynamics, the state space is coarse-grained by the ordinal partition π L (57), so the symbols are ordinal L-patterns. Some of these symbolic representations have an interesting feature in common, namely, the alphabet (whether finite, countably infinite, or continuous) is a group G. Thus, G = (Z, +) when using bins, G = (S L , •) (see (56)) when using ordinal L-patterns, and G = ([0, 1), +) when using phases.
To show that group-theoretic representations can provide an additional leverage in time series analysis, we consider next random processes with algebraic alphabets, i.e., G-valued random processes, where G is a finite group.
Definition 32. The transfer entropy of two processes with a common algebraic alphabet is called algebraic transfer entropy.
Actually, to warrant the convergence of information-theoretic quantities in case of representations with infinite algebraic groups, it would be sufficient that the probability of the symbols is positive only for a finite number of them, as always happens in practice.
Note that (67) equates a conditional mutual information (with three variables and |G| 3 possible values) to an unconditioned mutual information (with two variables and |G| 2 possible values) thanks to the use of transcripts. This result, called the dimensional reduction of the algebraic transfer entropy, can make a difference in time series analysis if the data sequences are short, as often happens in practice. See [114,Theorem 1] for much general results when the history lengths of the processes involved are arbitrary.
We conclude that the coupling direction between processes with a common algebraic alphabet can be determined with mutual informations of transcripts. Corollary 34 was applied in [115,114] to ordinal representations of time series analysis with satisfactory results.
Transfer entropy remains a hot research topic because of the crucial role of causality in data analysis. For further insights into this subtle subject, the interested reader is referred to the papers [116,117,118,119,120], just to mention a few.

Applications
The practical applications of the mathematical entropy are usually associated with information theory and communications technology. Our pick in this section belongs to the realm of applied mathematics. There are certainly others in physics, computer science, and social sciences.

Time series analysis
Entropy plays a prominent role in the analysis of time series, in particular when the data stem from systems assumed to be non-linear. In this context, a central objective of the data analyst is to quantify the complexity of both data and systems. In the following we review some applications of entropy to time series analysis. Needless to say, no claim of completeness is made. Rather, we are going to concentrate on sample, approximate, permutation, Tsallis and Rényi entropies applied to physiological data. Due to their very nature, these data contain a rich structure of complex patterns, besides being often recorded in large amounts. For applications of those entropies to the analysis of data from other fields such as physics, chemistry, biology, economics and geophysics, we refer to [121,122,123,124,98,103,125] and the references therein.
Early applications of approximate, sample and permutation entropies to physiological data were directly connected with the formulation of the corresponding concept or followed it a short time after (compare also [81,126]). Thus, Pincus and Viscarello analyzed fetal heart-rate variability (HRV) with approximate entropy already in 1992 [127], and Lake et al., neonatal HRV with sample entropy in 2002 [128]. Note that an abnormal heart rate is often signalized by a reduced variability of the electrocardiograms (ECGs) characteristics -the right task for entropy! Frank et al. [129] demonstrated also the ability of permutation entropy to classify fetal behavioral states on the basis of HRV. Many other applications can be found in cardiovascular studies based on ECGs. For instance, Acharya et al. [130] and Voss et al. [131] compared the performances of various measures, including approximate and sample entropy, when analyzing HRV. Along the same line, Graff et al. [132] studied the ability of the approximate, sample, permutation, and fuzzy entropies to discriminate between healthy patients and patients with congestive heart failure in the case of short ECG data sets. For further applications and comparisons of entropies related to ECGs, we refer to [98,125].
Just as ECGs are indispensable for measuring heart activity, electroencephalograms (EEGs) and magnetoencephalograms (MEGs) are the principal diagnostic tools for assessing brain activity. Since abnormal brain states are often reflected by a change in the complexity of the electromagnetic activity, the interest in entropy for the analysis of EEGs and MEGs is increasing in the medical research and praxis.
Approximate, sample and permutation entropies have been the basic tools in the non-linear analysis of EEGs and MEGs from patients with Alzheimer's disease (AD), mainly to distinguish healthy from diseased subjects (see Hornero et al. [133], Morabito et al. [134], and references therein). Another field of applications is the quantification of anesthetic drug effects on the brain activity as measured by EEGs, including comparative testing of different anesthetics and the discrimination between consciousness and unconsciousness. Thereby different entropy measures have been shown to be sensitive to the depth of anaesthesia. For an overview on the main entropy measures used in this field and a comprehensive list of references, see Liang et al. [135].
Some additional applications of the approximate and sample entropies are related to sleep research, in particular to the separation of sleep stages based on EEG data. For example, Acharya et al. [136] compared several non-linear measures, including approximate entropy, for analyzing sleep stages in surface EEGs, while Burioka et al. [137] estimated the values of the approximate entropy for different sleep stages. Also permutation entropy has been applied to classify sleep stages in human EEGs; the first attempt in this direction was made in [138]. Moreover, sleep stage segmentation via the conditional entropy of ordinal patterns (directly related to permutation entropy) has been studied in [139].
The epilepsy research is probably the main biomedical application field of entropy in time series analysis. The reasons for this are manifold. About one percent of the world's population is estimated to suffer from epilepsy in a spectrum that goes from mild to strong forms, the latter case being associated with serious health and psychosocial problems. Moreover, the mechanisms of epileptic activity are far from being understood. From the viewpoint of non-linear time series analysis, EEG signals related to epileptic activity are interesting because they contain special wave forms (spike waves, sharp waves, slow waves), often believed to be the result of low-dimensional dynamics.
The approximate and sample entropies have been used as biomarkers in algorithms for epileptic EEG analysis, in particular for epileptic seizure detection. In this category, Kannatal et al. [140] tested different entropy measures, among them approximate entropy, using a popular data set related to epilepsy [141]. Srinivasan et al. [142] investigated approximate entropy as an input feature for a neural-network-based automated epileptic EEG detection system, and Jouny et al. [143] considered different complexity and entropy measures, in particular sample entropy and permutation entropy, to characterize early partial seizure onset in intracranial EEG recordings.
The first application of permutation entropy in epileptic EEG analysis is due to Cao et al. [144], who detected this way dynamic changes during epileptic seizures in intracranial EEGs, and also to Keller and Lauffer [145], who studied vagus stimulation for reducing epileptic activity. The predictability of epileptic seizures by permutation entropy has been the subject of several papers (see, e.g., Li et al. [146] and Bruzzo et al. [147]). Li et al. [148] studied how permutation entropy reflects the changes in surface EEGs due to absence seizures during different seizure phases (seizure-free, pre-seizure, seizure). The choice of parameters in applications of permutation entropy to time series from different sleep stages were studied in [139,82].
As discussed in Sect. 2, the Shannon entropy along with its generalizations, the Tsallis and Rényi entropies, are functions of discrete probability distributions, which for time series are not given a priori. The q-dependent weighting of the probabilities in the Tsallis and Rényi entropies as compared to the Shannon entropies endows data analysis with more flexibility. There are various ways of extracting probability distributions from time series. One of them (already mentioned in previous sections) is symbolic dynamics, which in practice boils down to counting symbols or patterns in time series. This being the case, it is natural to consider the Tsallis and Rényi variants of permutation entropy. First attempts in this direction have been reported in [135] in the field of anesthesia. Moreover, Rényi entropy in combination with simple symbolic dynamics has been applied to HRV by Kurths et al.; see, e.g., [149].
Furthermore, the estimation of the probability distributions for computing entropies can be done in the frequency domain as well. The interest shifts now to the distributions of frequencies and wavelets at different scales. In this context, the Tsallis entropy has been used for EEG and ECG analysis by a number of groups, e.g., Gamero et al. [150], Capurro et al. [151], and Zhang et al. [152]). As for the Rényi entropy, applications reach from epilepsy detection in EEG (see, e.g., [140]) over artifact rejection in multichannel scalp EEG (see [153] and references therein) to early diagnosis of Alzheimer's disease in MEG data (see, e.g., [154]).

Statistical inference
Let X be a random variable on a probability space (Ω, B, µ) taking values in a measurable set Γ ⊂ R. Suppose that all we know about X are the expectation values of some measurable functions ('observables') φ k : Γ → R, 1 ≤ k ≤ K, i.e., where ρ(x) ≥ 0 is the probability density function of X, and S ⊂ Γ is the support of ρ(x) (i.e., S = {x ∈ Γ : The conditions (68) are called the moment constraints. To guarantee that the integrals (68) exist, we suppose that the functions φ k , 1 ≤ k ≤ K, are integrable over S.
Given the functions φ k and the numbers m k , 0 ≤ k ≤ K, how can we use this information to best characterize the density function ρ(x) of X?
According to the maximum entropy principle of Jaynes [155], "in making inferences on the basis of partial information we must use the probability distribution which has maximum entropy subject to whatever is known". In the case that nothing is known, this principle leads to Laplace's principle of indifference, also called the principle of insufficient reason, according to which all events will be assigned equal probabilities. Therefore, the principle of maximum entropy can be considered as an extension of Laplace's principle.
On entropy, entropy-like quantities, and applications A PREPRINT Theorem 35. (Maximum entropy probability distributions) The probability density function ρ * (x) which maximizes the entropy over all probability density functions satisfying the moment constraints (68) are of the form x ∈ S, provided that there exist λ 0 , ..., λ K ∈ R such that ρ * (x) satisfies the constraints (68) and (69).
See [18,Chapter 12] for a proof without resorting to calculus of variations. Needless to say, the parameters λ j are the Lagrange multipliers corresponding to the constraints (69) for j = 0, and (68) for 1 ≤ j ≤ K.
In [156] one can find a table with a number of maximum entropy probability distributions corresponding to a variety of moment constraints. We summarize next the simplest cases: with no moment constraints, with fixed mean, and with fixed mean and variance. The same analysis holds for multivariate random variables. Thus, if S = R n and X i X j = R ij , 1 ≤ i, j ≤ n, then the maximum entropy density is the (zero mean) Gaussian distribution ., x n ) T and R is the autocorrelation matrix (R ij ) 1≤i,j≤n . One finds Random variables with finite alphabets are a particular case of Theorem 35. If Γ = {x 1 , ..., x N }, and µ(X = x n ) = p n , 1 ≤ n ≤ N , then the maximum entropy probability distribution is where the normalization factor is called the partition function. It follows for 1 ≤ k ≤ K. Equations (73) determine the Lagrange multipliers λ 1 , ..., λ K , provided that these can be solved as functions of m 1 , ..., m K .
The probability distribution that maximizes the Gibbs entropy (2) subject to the energy constraint (3) is the so-called Gibbs distribution for the canonical ensemble, Here Γ = {ε 1 , ..., ε N }, φ 1 (x) = x, m 1 = U , and λ 1 = − 1 kB T . If instead of the Gibbs entropy one uses the Tsallis entropy S q , Equation (4), then The fact that the probability distributions of the statistical mechanics (not only for the canonical ensemble) can be deduced by information-theoretical means led Jaynes to conclude that statistical mechanics is an application of information theory. More recent and challenging applications of entropy maximization to collective behavior (e.g., tool use and social cooperation) can be found in [157] and the references therein.

Cluster analysis
Application of entropy with its interpretation as amount of information, uncertainty or diversity has led also to various kinds of minimum entropy principles in different fields. As representatives of such principles, we are going to touch upon two strategies used in data clustering.
In cluster analysis one seeks a natural partitioning of a set of n objects into subsets c 1 , c 2 , . . . , c k called the clusters. For simplicity, we assume the objects to be identified with the feature vectors x 1 , x 2 , . . . , x n ∈ R m , whose components are the values of m observables at each object. Clearly, the question what a natural partition is, strongly depends on the practical aspects of a data analysis, with the result that the number of different clustering criteria is huge. Nevertheless, a rough criterion is that the feature vectors inside a cluster should be similar, while dissimilar when belonging to different clusters. An information-theoretical tool to describe the homogeneity of data within a cluster is, of course, entropy. In this context, the idea of minimizing the mean entropy of clusters is straightforward.
Assume that the feature vectors are obtained by independent draws of a continuous m-dimensional random vector X, and describe the cluster assignment by a random variable C taking on the values i = 1, 2, . . . , n, the outcome i being the label of the cluster a feature vector belongs to. This way, a realization of the model (X, C) can be identified with a feature vector and the cluster containing it, respectively. In this setting, the minimalization of the mean cluster entropy amounts to searching for a cluster assignment C with minimal conditional entropy where p(i) denotes the probability of the cluster c i and ρ(x|i) the conditional density of X given C = i (see [158]). In a sense, one seeks a cluster assignment C for which the variable X representing the feature vectors contains as much information of C as possible.
From an automatic learning viewpoint, it is more natural to start from the feature vectors and look for a clustering optimally representing the structure of the set of feature vectors. This approach, called the minimum conditional entropy principle (see e.g. [159,160]), consists in minimizing the conditional entropy where ρ(x) ≥ 0 is the probability density function of X and ρ(i|x) is the conditional probability of i given x.
Both models, possibly together with additional assumptions about the probability distributions, lead to clustering strategies for given feature vectors x 1 , x 2 , . . . , x n via minimizing appropriate estimations of H(X|C) or H(C|X) on the base of the feature vectors. Note that in [159] a Tsallis variant of H(C|X) has been also discussed.

Testing stochastic independence and determinism
To finish this sample of applications, let us discuss some stochastic independence tests based both on metric and topological permutation entropy. Along with the biomedical applications to time series analysis mentioned in Sect. 5.1, testing financial time series for stochastic independence was also one of the first applications of permutation entropy [161,162].
As already explained in Sect. 4.3, an immediate consequence of Theorem 4.11(ii) for the topological permutation entropy is the existence of forbidden ordinal patterns in the restricted setting of Definition 4.10. To be specific, if I is a closed interval of R and T : I → I is a piecewise, strictly monotone map, then there exist L min ≥ 2 and length-L ordinal patterns r such that no x ∈ I is of type r for all L ≥ L min . In other words, with the mild provisos just stated (which may be taken for granted in practice), a one-dimensional dynamics has always forbidden ordinal patterns of sufficiently large length. The opposite happens with independent and identically distributed (i.i.d.) random processes, also called white noise. In this case, all ordinal patterns of whatever length are allowed.
Let (x t ) t∈N0 be the orbit of x 0 ∈ I under the action of T . In practice, an observer will not measure the true values x t but the 'noisy' valuesx t instead, due to a number of reasons such as measurement errors and electronic noise. This fact is modeled by adding to x t a so-called observational noise w t at each time step, i.e., where (w t ) t∈N0 is a realization of a stochastic process. Unless one knows better, the w t 's are supposed to be white noise. If, on the contrary, there are correlations among the variables w t , then one speaks of colored noise.
One of the preliminary tasks in non-linear time series analysis is precisely to ascertain whether the noisy data at hand has actually been output by a deterministic system, i.e., if they have the structure (74). There are some standard methods for detecting determinism, e.g., the cross prediction error statistic [80], but denoising should be first applied.
An alternative method in one-dimensional dynamics, which does not require denoising, exploits the existence of forbidden ordinal patterns, and the robustness of ordinal patterns against observational noise [101,19,102]. To this end, we use hypothesis testing. Our null hypothesis is that the noisy data (x t ) t∈N0 are outcomes of an i.i.d random process, that is, H 0 : (x t ) t∈N0 is white noise. (75) A working hypothesis in non-linear time series analysis (based on the phenomenon being observed) is that, however random the data may look like, there is an underlying deterministic component. This being the case, the rejection of H 0 is equated to determinism.
A further caveat is that, in practice, all time series are finite. This implies the possible existence of false forbidden patterns, i.e., allowed ordinal patterns that are missing just because the time series is finite but would appear if the series were long enough [163]. Therefore, it is always good practice to check the stability of the results against changes in the length of the data.
Thus, consider a noisy and finite time series (x t ) N −1 t=0 and the null hypothesis To accept or rejectH 0 , there are two simple-minded methods.
Method 1 [19,Chapter 9] consists of the following three steps. Use now any measure of dissimilarity for the number of forbidden patterns (or even for the probability distributions) to accept or rejectH 0 . Roughly speaking, if the results of (a) and (c) are about the same, the sequence is very likely not deterministic (or the observational noise is so strong as compared to the deterministic signal that the latter has been completely masked). Otherwise, the data stems from a deterministic process.
Under the null hypothesis, the ordinal patterns of the non-overlapping windows e k will be also i.i.d.. Let ν r be the number of e k 's of type r ∈ S L . Thus, ν r = 0 means that the L-pattern r is missing in (x t ) N −1 t=0 ; otherwise we say that r is visible. We apply then a chi-square goodness-of-fit hypothesis test with the statistic χ 2 (L) = r∈SL (ν r − K/L!) 2 K/L! = L! K r∈SL:visible ν 2 r − K, since π∈SL ν π = K. Here K/L! is the absolute frequency of an ordinal L-pattern ifH 0 holds true. In the affirmative case, χ 2 (L) converges in distribution (as K → ∞) to a chi-square distribution with L! − 1 degrees of freedom. Thus, for large K, a test with approximate level α is obtained by rejectingH 0 if χ 2 (L) > χ 2 L!−1,1−α , where χ 2 L!−1,1−α is the upper 1 − α critical point for the chi-square distribution with L! − 1 degrees of freedom [164]. In our case, the hypothetical convergence of χ 2 (L) to the corresponding chi-square distribution may be considered sufficiently good if ν π > 10 for all visible L-patterns π, and K/L! > 5.
Notice that, since this test is based on distributions, it would be possible that a deterministic map has no forbidden L-patterns, thus ν π = 0 for all π ∈ S L , however, the null hypothesis could be rejected because those ν π 's are not evenly distributed.
A standard test for stochastic independence in time series is the Brock-Dechert-Scheinkman (DBS) test, which is based on the correlation dimension. Method 2 was numerically benchmarked against the DBS test in [101,102] by means of the Lorenz map and the delayed Henon map. In practically all cases studied, Method 2 outperformed the DBS test. A similar method based on a different statistic was proposed in [165].
Finally, one can also use the metric permutation entropy to test the null hypothesis (76). To this end, let h * L ((x t ) N −1 t=0 ) = − r∈SL p r ln p r be the empirical metric permutation entropy of order L of (x t ) N −1 t=0 , i.e., p r = |{(x t ,x t+1 , ...,x t+L−1 ) of type r ∈ S L : 0 ≤ t ≤ N − L}| N − L + 1 .