Information-Theoretic Inequalities on Unimodular Lie Groups

Classical inequalities used in information theory such as those of de Bruijn, Fisher, and Kullback carry over from the setting of probability theory on Euclidean space to that of unimodular Lie groups. These are groups that posses integration measures that are invariant under left and right shifts, which means that even in noncommutative cases they share many of the useful features of Euclidean space. In practical engineering terms the rotation group and Euclidean motion group are the unimodular Lie groups of most interest, and the development of information theory applicable to these Lie groups opens up the potential to study problems relating to image reconstruction from irregular or random projection directions, information gathering in mobile robotics, satellite attitude control, and bacterial chemotaxis and information processing. Several definitions are extended from the Euclidean case to that of Lie groups including the Fisher information matrix, and inequalities analogous to those in classical information theory are derived and stated in the form of fifteen small theorems. In all such inequalities, addition of random variables is replaced with the group product, and the appropriate generalization of convolution of probability densities is employed.


Introduction
Shannon's brand of information theory is now more than six decades old, and some of the statistical methods developed by Fisher, Kullback, etc., are even older. Similarly, the study of Lie groups is now more than a century old. Despite their relatively long and roughly parallel history, surprisingly few connections appear to have been made between these two vast fields. One such connection is in the area of ergodic theory [1,3,11], where the Boltzmann-Shannon entropy is replaced with topological entropy [39,43,64,77]. Ergodic theory developed in parallel with information theory and remains an active area of research among mathematicians to the current day (see e.g., [83]). Both use concepts of entropy (though these concepts are quite different from each other), and some common treatments have been given over the years (see e.g., [10]). However, it should be noted that some of the cornerstones of information theory such as the de Bruijn inequality, Fisher information, Kullback-Leibler divergence, etc., do not carry over to ergodic theory. And while connections between ergodic theory and Lie groups are quite strong, connections between information theory and Lie groups are virtually nonexistent. The goal of this paper is therefore to present a unified framework of "information theory on Lie groups." As such, fifteen small theorems are presented that involve the structure and/or group operation of Lie groups. Unlike extensions of information theory to manifolds, the added structure inherent in Lie groups allow us to draw much stronger parallels with inequalities of classical information theory, such as those presented in [26,27].
In recent years a number of connections have begun to emerge linking information theory, group theory, and geometry. A cross section of that work is reviewed here, and it is explained how the results of this paper are distinctly different from prior works.
In the probability and statistics literature, the statistical properties of random walks and limiting distributions on Lie groups has been studied extensively by examining the properties of iterated convolutions [28,35,38,53,61,62,68]. The goal in many of these works is to determine the form of the limiting distribution, and the speed of convergence to it. This is a problem closely related to those in information theory. However, to the author's knowledge concepts such as entropy, Fisher information, Kullback-Leibler divergence, etc., are not used significantly in those analysis. Rather, techniques of harmonic analysis (Fourier analysis) on Lie groups are used, such as the methods described in [24,32,36,57,70,71,73,75,80]. Indeed, to the best of the author's knowledge the only work that uses the concept and properties of information-theoretic (as opposed to topological) entropy on Lie groups is that of Johnson and Suhov [40,41]. Their goal was to use the Kullback-Leibler divergence between probability density functions on compact Lie groups to study the convergence to uniformity under iterated convolutions, in analogy with what was done by Linnik [51] and Barron [6] in the commutative case. The goal of the present paper is complementary: using some of the same tools, many of the major defined quantities and inequalities of (differential) information theory are extended from R n to the context of unimodular Lie groups, which form a broader class of Lie groups than compact ones.
The goal here is to define and formalize probabilistic and information-theoretic quantities that are currently arising in scenarios such as robotics [48,54,56,65,72,58,47,76], bacterial motion [9,74], and parts assembly in automated manufacturing systems [13,25,44,60,69]. The topics of detection, tracking, estimation and control on Lie groups has been studied extensively over the past four decades. For example, see [14,15,18,29,42,67,24,58,76,55,5,78] (and references therein). Many of these problems involve probability densities on the group of rigid-body motions. However, rather than focusing only on rigid-body motions, a general information theory on the much broader class of unimodular Lie groups is presented here with little additional effort.
Several other research areas that would initially appear to be related to the present work have received intensive interest. For decades, Amari has developed the concept of information geometry [2] in which the Fisher information matrix is used to define a Riemannian metric tensor on spaces of probability distributions, thereby allowing those spaces to be viewed as Riemannian manifolds. This provides a connection between information theory and differential geometry. However, in information geometry, the probability distributions themselves (such as Gaussian distributions) are defined on a Euclidean space, rather than on a Lie group.
A different kind of connection between information theory and geometry has been established in the context of medical imaging and computer vision in which probability densities on manifolds are analyzed using information-theoretic techniques [59]. However, a manifold generally does not have an associated group operation, and so there is no natural way to "add" random variables.
Relatively recently, Yeung and coworkers have used the structure of finite groups to derive new inequalities for discrete information. While this heavily involves the use of the theory of finite groups, the goal is to derive new inequalities for classical information theory, i.e., that which is concerned with discrete information related to finite sets. For example, see the work of Chan and Yeung [19,20] and Zhang and Yueng [81]. Li and Chong [49] and Chan [20] have addressed the relationship between group homomorphisms and information inequalities using the Ingleton inequality. In these works, the groups are discrete, and the new inequalities that are derived pertain to classical informational quantities. In contrast, the goal of the current presentation is to extend concepts from information theory to the case where variables "live in" a Lie group.
While on the one hand work that connects geometry and information theory exists, and on the other hand work that connects finite-group theory and information theory exists, very little has been done along the lines of developing information theory on Lie groups, which in addition to possessing the structure of differential manifolds, also are endowed with group operations. Indeed, it would appear that applications such as deconvolution on Lie groups [21] (which can be formulated in an information-theoretic context [79,46]), and the field of Simultaneous Localization and Mapping (or SLAM) [72] have preceded the development of formal information inequalities that take advantage of the Lie-group structure of rigid-body motions.
This paper attempts to address this deficit with a two-pronged approach: (1) by collecting some known results from the functional analysis literature and reinterpreting them in information-theoretic terms (e.g. Gross' log-Sobolev inequality on Lie groups); (2) by defining information-theoretic quantities such as entropy, covariance and Fisher information matrix, and deriving inequalities involving these quantities that parallels those in classical information theory.
The remainder of this paper is structured as follows: Section 2 provides a brief review of the theory of unimodular Lie groups and gives several concrete examples (the rotation group, Euclidean motion group, Heisenberg group, and special linear group). An important distinction between information theory on manifolds and that on Lie groups is that the existence of the group operation in the latter case plays an important role. Section 3 defines entropy and relative entropy for unimodular Lie groups and proves some of their properties under convolution and marginalization over subgroups and coset spaces. The concept of the Fisher information matrix for probability densities on unimodular Lie groups is defined in Section 4 and several elementary properties are proven. This generalized concept of Fisher information is used in Section 5 to establish the de Bruijn inequality for unimodular Lie groups. Finally, these definitions and properties are combined with recent results by others on log-Sobolev inequalities in Section 6.

A Brief Review of Unimodular Lie Groups
Rather than starting with formal definitions, examples of unimodular Lie groups are first introduced, their common features are enumerated, and then their formal properties are enumerated.

An Introduction to Lie Groups via Examples
Perhaps one reason why there has been little cross-fertilization between the theory of Lie groups and information theory is that the presentation styles in these two fields are very different. Whereas Lie groups belong to pure mathematics, information theory emerged from engineering. Therefore, this section reviews some on the basic properties of Lie groups from a concrete engineering perspective. All of the groups considered are therefore matrix Lie groups.

Example 1: The Rotation Group
Consider the set of 3 × 3 rotation matrices Here SO(3) denotes the set of special orthogonal 3 × 3 matrices with real entries. It is easy to verify that this set is closed under matrix multiplication and inversion. That . Furthermore, the 3 × 3 identity matrix is in this set, and the associative law R 1 (R 2 R 3 ) = (R 1 R 2 )R 3 holds, as is true for matrix multiplication in general. This means that SO(3) is a group, and is called the special orthogonal (or rotation) group. Furthermore, it can be reasoned that the nine independent entries in a 3 × 3 real matrix are constrained by the orthogonality condition RR T = I to the point where a three-degree-of-freedom subspace remains. (The condition detR = +1 does not further constrain the dimension of this subspace, though it does limit the discussion to one component of the space defined by the orthogonality condition).
It is common to describe the three free degrees of freedom of the rotation group using parametrizations such as the ZXZ Euler angles: where R i (θ) is a counterclockwise rotation about the i th coordinate axis. Another popular description of 3D rotations or the axis-angle parametrization R(ϑ, n) = I + sin ϑN + (1 − cos ϑ)N 2 (2) where N is the unique skew-symmetric matrix such that N x = n × x for any x ∈ R 3 , and n is the unit vector pointing along the axis of rotation and × is the vector cross product. The "vee and hat" notation is used to describe this relationship. Here n = (n·n) 1 2 = 1. It can be parameterized in spherical coordinates as n = n(φ, θ), and so a parametrization of the form R = R(ϑ, φ, θ) results. The angles ϑ, φ, θ are not the same as the Euler angles α, β, γ.
The group SO(3) is a compact Lie group, and therefore has finite volume. When using Euler angles, volume is computed with respect to the integration measure which when integrated over 0 ≤ α, γ ≤ 2π and 0 ≤ β ≤ π gives a value of 1. Indeed, this result was obtained by construction by using the normalization of 8π 2 . The same volume element will take on a different form when using the axis-angle parametrization, in analogy with the way that the volume element in R 3 can be expressed in the equivalent forms dxdydz and r 2 sin θdrdφdθ in Cartesian and spherical coordinates, respectively. Given any 3-parameter description of rotation, the angular velocity of a rigid body can be obtained from a rotation matrix. Angular velocity in the body-fixed and spacefixed reference frames can be written respectively as ω r = J r (q)q and ω l = J l (q)q where q is any parametrization (e.g., q = [α, β, γ] T or q = [ϑ, φ, θ] T , where T denotes the transpose of a vector or matrix).
The Jacobian matrices J r (q) and J l (q) are computed from the parametrization R(q) and the definition of the ∨ operation in (3) as This gives a hint as to why the subscripts l and r are used: if derivatives with respect to parameters appear on the 'right' of R T , this is denoted with an r, and if they appear on the 'left' then a subscript l is used. Explicitly for the Euler angles, and gives the factor that appears in the volume element dR in (4). This is not a coincidence. For any parametrization of SO(3) of the form R(q), the volume element can be expressed as dR = 1 8π 2 |J(q)|dq 1 dq 2 dq 3 where J(q) can be taken to be either J r (q) or J l (q). Though these matrices are not equal, their determinants are.
Whereas the set of all rotations together with matrix multiplication forms a noncommutative (R 1 R 2 = R 2 R 1 in general) Lie group, the set of all angular velocity vectors ω r and ω l (or more precisely, their corresponding matrices,ω r andω l ) together with the operations of addition and scalar multiplication form a vector space. Furthermore, this vector space is endowed with an additional operation, the cross product ω 1 × ω 2 (or equivalently the matrix commutator [ω 1 ,ω 2 ] =ω 1ω2 −ω 2ω1 ). This makes the set of all angular velocities a Lie algebra, which is denoted as so(3) (as opposed to the Lie group, SO(3)).
The Lie algebra so(3) consists of skew-symmetric matrices of the form The skew-symmetric matrices {X i } form a basis for the set of all such 3 × 3 skewsymmetric matrices, and the coefficients {x i } are all real. Lie algebras and Lie groups are related in general by the exponential map. For matrix Lie groups (which are the only kind of Lie groups that will be discussed here), the exponential map is the matrix exponential function. In this specific case, It is well known (see [24] for derivation and references) that (8) is simply a variation on (2) with x = ϑn. An interesting and useful fact is that except for a set of measure zero, all elements of SO(3) can be captured with the parameters within the open ball defined by x < π, and the matrix logarithm of any group element parameterized in this range is also well defined. It is convenient to know that the angle of the rotation, ϑ(R), is related to the exponential parameters as |ϑ(R)| = x . Furthermore, Relatively simple analytical expressions have been derived for the Jacobian J l and its inverse when rotations are parameterized as in (8): The corresponding Jacobian J r is calculated as [24] Note that J l = J T r and J l = RJ r .
The determinants are

Example 2: The Euclidean Motion Group of the Plane
The Euclidean motion group of the plane can be thought of as the set of all matrices of the form together with the operation of matrix multiplication. It is straightforward to verify that the form of these matrices is closed under multiplication and inversion, and that g(0, 0, 0) = I, and that it is therefore a group. This is often referred to as the special Euclidean group, and is denoted as SE(2). Like SO(3), SE(2) is three dimensional. However, unlike SO(3), SE(2) is not compact. Nevertheless, it is possible to define a natural integration measure for SE(2) as dg = dxdydθ.
And while SE(2) does not have finite volume (and so there is no single natural normalization constant such as 8π 2 in the case of SO(3)), this integration measure nevertheless can be used to compute probabilities from probability densities.
Note that g(x, y, θ) = exp(xX 1 + yX 2 ) exp(θX 3 ) where These matrices form a basis for the Lie algebra, se (2). It is convenient to identify these with the natural basis for R 3 by defining (X i ) ∨ = e i . In so doing, any element of se (2) can be identified with a vector in R 3 . The Jacobians for this parametrization are then of the form Note that |det(J l )| = |det(J r )| = 1.
This parametrization is not unique, though it is probably the most well-known one.
As an alternative, consider the exponential parametrization exp : se(2) → SE(2): Comparing this with (10) it is clear that x 3 = θ, but x = x 1 and y = x 2 . The Jacobians in this exponential parametrization are It follows that

Example 3: The Heisenberg Group
The Heisenberg group, H(1), is defined by elements of the form and the operation of matrix multiplication. Therefore, the group law can be viewed in terms of parameters as The identity element is the identity matrix g(0, 0, 0), and the inverse of an arbitrary element g(α, β, γ) is Basis elements for the Lie algebra are The Lie bracket, If the inner product for the Lie algebra spanned by these basis elements is defined as (X, Y ) = tr(XY T ), then this basis is orthonormal: As a result, the matrix exponential is a polynomial in the coordinates {x i }: The parametrization in (12) can be viewed as the following product of exponentials: The logarithm is obtained by solving for each x i as a function of α, β, γ. By inspection this is The Jacobian matrices for this group can be computed in either parametrization. In terms of α, β, γ, In terms of exponential coordinates, In both parametrizations |detJ r | = |detJ l | = 1.

Example 4: The Special Linear Group
The group SL(2, R) consists of all 2 × 2 matrices with real entries with determinant equal to unity. In other words, for a, b, c, d ∈ R elements of SL(2, R) are of the form Subgroups of SL(2, R) include matrices of the form A basis for the Lie algebra sl(2, R) is An inner product can be defined in which this basis is orthonormal. It can be shown that any g ∈ SL(2, R) can be expressed as a product of g 1 (x), g 2 (y), and g 3 (θ). This is called an Iwasawa decomposition of SL(2, R).
The above g i are not the only subgroups of SL(2, R) For example, exponentiating matrices of the form ξ · (X 3 + 2X 2 ) results in a subgroup of matrices of the form The Iwasawa decomposition allows one to write an arbitrary g ∈ SL(2, R) in the form [70] g In this parametrization the right Jacobian is The left Jacobian is It is easy to verify that Hence, SL(2, R) is unimodular (which means the determinants of the left and right Jacobians are the same).

Generalizations
Whereas several low-dimensional examples of Lie groups were presented to make the discussion concrete, a vast variety of different kinds of Lie groups exist. For example, the same constraints that were used to define SO(3) relative to R 3×3 can be used to define SO(n) from R n×n . The result is a Lie group of dimension n(n − 1)/2 and has a natural volume element dR. Similarly, the Euclidean motion group generalizes as all (n + 1) × (n + 1) matrices of the form resulting in SE(n) having dimension n(n + 1)/2 and natural volume element dg = dRdt where t ∈ R n and dt = dt 1 dt 2 · · · dt n is the natural integration measure for R n . The following subsections briefly review the general theory of Lie groups that will be relevant when defining information-theoretic inequalities.

Exponential, Logarithm, and Vee Operation
In general an n-dimensional real matrix Lie algebra is defined by a basis consisting of real matrices {X i } for i = 1, ..., n that is closed under the matrix commutator. That is, which are called the structure constants of the Lie algebra.
In a neighborhood around the identity of the corresponding Lie group, the parametrization is always valid in a region around the identity in the corresponding Lie group. And in fact, for the examples discussed, this parametrization is good over almost the whole group, with the exception of a set of measure zero. The logarithm map log g(x) = X (which is the inverse of the exponential) is valid except on this set of measure zero. It will be convenient in the analysis to follow to identify a vector x ∈ R n as Here {e i } is the natural basis for R n .
In terms of quantities that have been defined in the examples, the adjoint matrices Ad and ad are the following matrix-valued functions: The dimensions of these square matrices is the same as the dimension of the Lie group, which can be very different than the dimensions of the matrices that are used to represent the elements of the group. The function ∆(g) = detAd(g) is called the modular function of G. For a unimodular Lie group, ∆(g) = 1.

Integration and Differentiation on Unimodular Lie Groups
Unimodular Lie groups are defined by the fact that their integration measures are invariant under shifts and inversions. In any parametrization, this measure (or the corresponding volume element) can be expressed as in the examples by first computing a left or right Jacobian matrix and then setting dg = |J(q)|dq 1 dq 2 · · · dq n where n is the dimension of the group. In the special case when q = x is the exponential coordinates, where x = X ∨ and dx = dx 1 dx 2 · · · dx n . In the above expression it makes sense to write the division of one matrix by another because the involved matrices commute. The symbol G is used to denote the Lie algebra corresponding to G. In practice the integral is performed over a subset of G, which is equivalent to defining f (e X ) to be zero over some portion of G.
Let f (g) be a probability density function (or pdf for short) on a Lie group G. Then It can be shown that unimodularity implies the following equalities for arbitrary h ∈ G, which generally do not all hold simultaneously for measures on nonunimodular Lie groups: Many different kinds of unimodular Lie groups exist. For example, SO(3) is compact and therefore has finite volume; SE(2) belongs to a class of Lie groups that are called solvable, H(1) belongs to a class called nilpotent; and SL(2, R) belongs to a class called semisimple. Each of these classes of Lie groups has been studied extensively. But for the purpose of this discussion, it is sufficient treat them all within the larger class of unimodular Lie groups.
Given a function f (g), the left and right Lie derivatives are defined with respect to any basis element of the Lie algebra X i ∈ G as .
(22) The use of l and r mimicks the way that the subscripts were used in the Jacobians J l and J r in the sense that if exp(tX i ) appears on the left/right then the corresponding derivative is given an l/r designation. This notation, while not standard in the mathematics literature, is useful in computations because when evaluating left/right Lie derivatives in coordinates g = g(q), the left/right Jacobians enter in the computation as [24] whereX r = [X r 1 , ...,X r n ] T ,X l = [X l 1 , ...,X l n ] T , and ∇ q = [∂/∂q 1 , ..., ∂/∂q n ] T is the gradient operator treating q like Cartesian coordinates.

Probability Theory and Harmonic Analysis on Unimodular Lie Groups
Given two probability density functions f 1 (g) and f 2 (g), their convolution is Here h ∈ G is a dummy variable of integration. Convolution inherits associativity from the group operation, but since in general For a unimodular Lie group, the convolution integral of the form in (24) can be written in the following equivalent ways: where the substitutions z = h −1 and k = h −1 • g have been made, and the invariance of integration under shifts and inversions in (21) is used.
A powerful generalization of classical Fourier analysis exists. It is built on families of unitary matrix-valued functions of group-valued argument that are parametrized by values λ drawn from a setĜ and satisfy the homomorphism property: Using * to denote the Hermitian conjugate, it follows that In this generalized Fourier analysis (called noncommutative harmonic analysis) each U (g, λ) is constructed to be irreducible in the sense that it is not possible to simultaneously block-diagonalize U (g, λ) by the same similarity transformation for all values of g in the group. Such a matrix function U (g, λ) is called an irreducible unitary representation. Completeness of a set of representations means that every (reducible) representation can be decomposed into a direct sum of the representations in the set.
Once a complete set of IURs is known for a unimodular Lie group, the Fourier transform of a function on that group can be defined aŝ Here λ (which can be thought of as frequency) indexes the complete set of all IURs. An inversion formula can be used to recover the original function from all of the Fourier transforms as The integration measure d(λ) on the dual (frequency) spaceĜ is very different from one group to another. In the case of a compact Lie group,Ĝ is discrete, and the resulting inversion formula is a series, much like the classical Fourier series for 2πperiodic functions. A convolution theorem follows from (26) as and so does the Parseval/Plancherel formula: Here || · || is the Hilbert-Schmidt (Frobenius) norm, and d(λ) is the dimension of the matrix U (g, λ). A useful definition is Explicit expressions for U (g, λ) and u(X i , λ) using the exponential map and corresponding parameterizations for the groups SO(3), SE(2) and SE(3) are given in [58,32]. As a consequence of these definitions, it can be shown that the following operational properties result [24]: . This is very useful in probability problems because a diffusion equation with drift of the form ∂ρ(g; t) (where D = [D ij ] is symmetric and positive semidefinite and given initial conditions ρ(g; 0) = δ(g)) can be solved in the dual spaceĜ, and then the inversion formula can convert it back. Explicitly, where The solution to this sort of diffusion equation is important as a generalization of the concept of a Gaussian distribution. It has been studied extensively in the case of G = SE(3) in the context of polymer statistical mechanics and robotic manipulators [22,23,82]. As will be shown shortly, some of the classical information-theoretic inequalities that follow from the Gaussian distribution can be computed using the above analysis.

Properties of Entropy and Relative Entropy on Groups
As defined earlier, the entropy of a pdf on a unimodular Lie group is For example, the entropy of a Gaussian distribution with covariance Σ is S(ρ(g; t)) = log{(2πe) n/2 |Σ(t)| where log = log e . The Kullback-Leibler distance between the pdfs f 1 (g) and f 2 (g) on a Lie group G naturally generalizes from its form in R n as As with the case of pdfs in R n , at "almost all" values of g ∈ G (or, in probability terminology "f 1 (g) = f 2 (g) almost surely". That is, they must be the same up to a set of measure zero. Something that is not true in R n that holds for a compact Lie group is that the limiting distribution is the number one. If f 2 (g) = 1 is the limiting distribution, then D KL (f 1 1) = −S(f 1 ).

Convolutions Generally Increase Entropy
Theorem 3.1: Given pdfs f 1 (g) and f 2 (g) on the unimodular Lie group G, Proof: Denote the result of an n-fold convolution on G as Recall that a single pairwise convolution is computed as The n-fold convolution can be computed by performing a series of pairwise convolutions and stringing them together using the associative law. Convolution of functions on the group inherits associativity from the group law, which is reflected in the notation Johnson and Suhov [40,41] proved the following result for compact Lie groups: A noncompact group can not have f (g) = 1 as a limiting distribution, and so it does not make sense in this case to use the notation D KL (f 1,n 1). Nevertheless, essentially the same proof that gives (34) can be used in the more general case of notnecessarily-compact unimodular Lie groups to show that entropy must increase as a result of convolution. This can be observed by first expanding out S(f 1,n ) as: In going from (37) to (38) all that was done was to reverse the order of integration (i.e., using Fubini's Theorem) and in going from (38) to (39) the change of variables k = g • h −1 is used together with the invariance of integration under shifts. Next, observe that and so Since no direct comparison between f 1,n and the uniform distribution is made, Johnson and Suhov's proof of (34) that has been adapted above yields S(f 1,n−1 * f n ) ≥ S(f 1,n−1 ).
Essentially the same proof can be used to show that In other words, convolution in either order increases entropy.

Entropy Inequalities from Jensen's Inequality
Jensen's inequality is a fundamental tool that is often used in deriving informationtheoretic inequalities, as well as inequalities in the field of convex geometry. In the context of Lie groups, Jensen's inequality can be written as where Φ : R ≥0 → R is a convex function on the half infinite line, ρ(g) is a pdf, and φ(g) is another nonnegative measurable function on G.
Two important examples of Φ(x) are Φ 1 (x) = − log x and Φ 2 (x) = +x log x. If G is compact, any constant function on G is measurable. Letting φ(g) = 1 and Φ(x) = Φ 2 (x) then gives 0 ≤ −S(f ) for a pdf f (g). In contrast, for any unimodular Lie group, letting This leads to the following theorem.
Theorem 3.2: Let f (λ) denote the Frobenius norm and f (λ) 2 denote the induced 2-norm of the Fourier transform of f (g) and define and Furthermore, denote the unit Heaviside step function on the real line as u(x) and let For finite groups B = 1 for functions that have full spectrum, and for bandlimited expansions on other groups B is finite.
Proof: Substituting α = 1 into (41) and using the Plancherel formula (28) yields The fact that − log x is a decreasing function and A 2 ≤ A for all A ∈ C n×n gives the second inequality in (43). The convolution theorem together with the facts that both norms are submultiplicative, − log(x) is a decreasing function, and the log of the product is the sum of the logs gives An identical calculation follows for D 2 . The statement in (45) follows from the Plancherel formula (28) and using Jensen's inequality (40) in the dual spaceĜ rather than on G: Recognizing that when B is finite ρ(λ) = u f (λ) /B becomes a probability measure on this dual space, it follows that This completes the proof.
Properties of dispersion measures similar to D(f ) and D 2 (f ) were studied in [35], but no connections to entropy were provided previously. By definition, bandlimited expansions have B finite. On the other hand, it is a classical result that for a finite group, Γ, the Plancherel formula is (see, for example, [24]): where α is the number of conjugacy classes of Γ and d k is the dimension off k . And by Burnside's formula α k=1 d 2 k = |Γ| it follows that B = 1 when all f k = 0.

The Entropy Produced by Convolution on a Finite Group is Bounded
Let Γ be a finite group with |Γ| elements {g 1 , ..., g |Γ| }, and let ρ Γ (g i ) ≥ 0 with |Γ| i=1 ρ Γ (g i ) = 1 define a probability density/distribution on Γ. In analogy with how convolution and entropy are defined on a Lie group, G, they can also be defined on a finite group, Γ by using the Dirac delta function for G, denoted here as δ(g). If Γ < G (i.e., if Γ is a subgroup of G), then letting can be used to define a pdf on G that is equivalent to a pdf on Γ in the sense that if the convolution of two pdfs on Γ is Given a finite group, Γ, let Unlike the case of differential/continuous entropy on a Lie group, 0 ≤ S(ρ).
The following theorem describes how the discrete entropy of pdfs on Γ behaves under convolution. Since only finite groups are addressed, the superscript Γ on the discrete values ρ(g i ) are dropped.
Proof: The lower bound follows in the same way as the proof given for Theorem *.1 with summation in place of integration. The entropy of convolved distributions on a finite group can be bounded from above in the following way.
Since the convolution sum contains products of all pairs, and each product is positive, it follows that ρ 1 (g k )ρ 2 (g −1 k • g i ) ≤ (ρ 1 * ρ 2 )(g i ) for all k ∈ {1, ..., |Γ|}. Therefore, since log is a strictly increasing function, it follows that Since this is true for all values of k, we can bring the log term inside of the summation sign and choose k = j. Then multiplying by −1, and using the properties of the log function, we get Rearranging the order of summation signs gives  .

(50) But summation of a function over a group is invariant under shifts. That is,
Hence, the terms in parenthesis in (50) can be written by replacing g −1 j • g i with g i gives (49).

Entropy and Decompositions
Aside from the ability to sustain the concept of convolution, one of the fundamental ways that groups resemble Euclidean space is the way in which they can be decomposed. In analogy with the way that an integral over a vector-valued function with argument x ∈ R n can be decomposed into integrals over each coordinate, integrals over Lie groups can also be decomposed in natural ways. This has implications with regard to inequalities involving the entropy of pdfs on Lie groups. Analogous expressions hold for finite groups, with volume replaced by the number of group elements.
As in the case of pdfs on Euclidean space, (51) follows from the fact that the Kullback-Leibler divergence in (32) has the property that D KL (f f 1 f 2 ) ≥ 0.

Coset Decompositions
Given a subgroup H ≤ G, and any element g ∈ G, the left coset gH is defined as gH = {g • h|h ∈ H}. Similarly, the right coset Hg is defined as Hg = {h • g|h ∈ H}. In the special case when g ∈ H, the corresponding left and right cosets are equal to H. More generally for all g ∈ G, g ∈ gH and g 1 H = g 2 H if and only if g −1 2 • g 1 ∈ H. Likewise for right cosets Hg 1 = Hg 2 if and only if g 1 • g −1 2 ∈ H. Any group is divided into disjoint left (right) cosets, and the statement "g 1 and g 2 are in the same left (right) coset" is an equivalence relation.
An important property of gH and Hg is that they have the same number of elements as H. Since the group is divided into disjoint cosets, each with the same number of elements, it follows that the number of cosets must divide without remainder the number of elements in the group. The set of all left(or right) cosets is called the left(or right) coset space, and is denoted as G/H (or H\G). For finite groups one writes |G/H| = |H\G| = |G|/|H|. This result is called Lagrange's theorem. Similar expressions can be written for Lie groups and Lie subgroups after the appropriate concept of volume is introduced. We will use the following well-known fact [37]: where g ∈ gH is taken to be the coset representative. In the special case when f (g) is a left-coset function (i.e., a function that is constant on left cosets), (52) reduces to where it is assumed that d(h) is normalized so that Vol(H) = H dh = 1, and is the value of the function f (g) on each coset representative (which is the same as that which results from averaging over the coset gH).

Theorem 3.4:
The entropy of a pdf on a unimodular Lie group is no greater than the sum of the marginal entropies on a subgroup and the corresponding coset space: Proof: For the moment it will be convenient to denote a function on G as f G (g) (rather than f (g)) and write That is, a function on G evaluated at g can be equally described as a function on a coset, together with a rule for extracting a specific coset representative, which in this case is the identity. This means that given gH, g is recovered from g ∈ gH as g • e −1 = g. By enforcing the constraint on the definition off G/H×H that then g can be recovered from g • h ∈ gH as g • h • h −1 = g. Using this construction, we can define For example, if G = SE(n) is a Euclidean motion group and H = SO(n) is the subgroup of pure rotations in n-dimensional Euclidean space, then G/H ∼ = R n , and we can write It follows from the classical information-inequality for the entropy of marginal distributions obtained by letting F (g) = −f (g) log f (g) and using the nonnegativity of the Kullback-Leibler divergence together with the shift-invariance of integrals on unimodular Lie groups that (53) holds.

Double Coset Decompositions
Let H < G and K < G. Then for any g ∈ G, the set is called the double coset of H and K, and any g ′ ∈ HgK (including g ′ = g) is called a representative of the double coset. Though a double coset representative often can be described with two or more different pairs (h 1 , k 1 ) and (h 2 , k 2 ) so that g ′ = h 1 • g • k 1 = h 2 • g • k 2 , we only count g ′ once in HgK. Hence |HgK| ≤ |G|, and in general |HgK| = |H| · |K|. In general, the set of all double cosets of H and K is denoted H\G/K. Hence we have the hierarchy g ∈ HgK ∈ H\G/K. It can be shown that membership in a double coset is an equivalence relation. That is, G is partitioned into disjoint double cosets, and for H < G and K < G either Hg 1 K ∩ Hg 2 K = ∅ or Hg 1 K = Hg 2 K.
Another interesting thing to note (when certain conditions are met) is the decomposition of the integral of a function on a group in terms of two subgroups and a double coset space: A particular example of this is the integral over SO (3), which can be written in terms of Euler angles as Theorem 3.5: The entropy of a pdf on a group is no greater than the sum of marginal entropies over any two subgroups and the corresponding double-coset space: Proof: Consistent with (55) it is possible to decompose a function fG(g) as then letting F (g) = −f (g) log f (g) and using the nonnegativity of the Kullback-Leibler divergence together with the shift-invariance of integrals on unimodular Lie groups gives (56)

Nested Coset Decompositions
Theorem 3.6: The entropy of a pdf is no greater than the sum of entropies of its marginals over coset spaces defined by nested subgroups: Proof: Given a subgroup K of H, which is itself a subgroup of G (that is, H < K < G), it is possible to write [37] G/H F (gH)d(gH) = Therefore, Again letting F (g) = −fG(g) log fG(g), it follows from the properties of Kullback-Leibler divergence and the unimodularity of G that if and then (57) follows.

Class Functions and Normal Subgroups
In analogy with the way a coset is defined, the conjugate of a subgroup H for a given g ∈ G is defined as Recall that a subgroup N ≤ G is called normal if and only if gN g −1 ⊆ N for all g ∈ G. This is equivalent to the conditions g −1 N g ⊆ N , and so we also write gN g −1 = N and gN = N g for all g ∈ G.
A function, χ(g), that is constant on each class has the property that for any g, h ∈ G. Though convolution of functions on a noncommutative group is generally noncommutative, the special nature of class functions means that where the change of variables k = g • h −1 is used together with the unimodularity of G.
Theorem 3.7: For arbitrary pdfs on a unimodular Lie group G and arbitrary g1, g2 ∈ G, however, entropy satisfies the following equalities Proof: Each equality is proven by changing variables and using the unimodularity property in (21).
Proof: Statement (a) follows from the fact that if either ρ1 or ρ2 is a class function, then convolutions commute. Statement (b) follows from the first equality in (59) and the definition of a symmetric function.
Theorem 3.9: Given class functions χ1(g) and χ2(g) that are pdfs, then for general g1, g2 ∈ G, and yet Proof: Here the first and final equality will be proven. The middle one follows in the same way. Similarly,

Fisher Information and Diffusions on Lie Groups
The natural extension of the Fisher information matrix for the case when f (g, θ ) is a parametric distribution on a Lie group is In the case when θ parameterizes G as g(θ ) = exp( i θiXi) and f (g, θ ) = f (g •exp( i θiXi)), In a similar way, we can define Theorem 4.1: The matrices (62) and (62) have the properties and , and I(f )(g) = f (g −1 ).
Proof: The operatorsX l i and R(h) commute, and likewiseX r i and L(h) commute. This together with the invariance of integration under shifts proves (64). From the definitions ofX l i andX r i in (22), it follows that .
Using the invariance of integration under shifts then gives (65). As a special case, when f (g) is a symmetric function, the left and right Fisher information matrices will be the same.
Note that the entries of Fisher matrices F r ij (f ) and F l ij (f ) implicitly depend on the choice of orthonormal Lie algebra basis {Xi}, and so it would be more descriptive to use the notation F r ij (f, X) and F l ij (f, X) . If a different orthonormal basis {Yi} is used, such that Xi = k a ik Y k , then the orthonormality of both {Xi} and {Yi} forces A = [aij ] to be an orthogonal matrix. Furthermore, the linearity of the Lie derivative, The same holds for F l ij . Summarizing these results in matrix form:

Fisher Information and Convolution on Groups
The decrease of Fisher information as a result of convolution can be studied in much the same way as for pdfs on Euclidean space. Two approaches are taken here. First, a straightforward application of the Cauchy-Bunyakovsky-Schwarz (CBS) inequality is used together with the bi-invariance of the integral over a unimodular Lie group to produce a bound on the Fisher information of the convolution of two probability densities. Then, a tighter bound is obtained using the concept of conditional expectation in the special case when the pdfs commute under convolution.

Theorem 4.2:
The following inequalities hold for the diagonal entries of the left and right Fisher information matrices: Proof: The CBS inequality holds for groups: If a(g) ≥ 0 for all values of g, then it is possible to define j(g) = [a(g)] 1 2 and k(g) = [a(g)] 1 2 b(g), and since j(g)k(g) = a(g)b(g), Using this version of the CBS inequality, and letting and a(g) = f1(h)f2(h −1 • g), essentially the same manipulations as in [16] can be used, with the roles of f1 and f2 interchanged due to the fact that in general for convolution on a Lie group (f1 * f2)(g) = (f2 * f2)(g): Since for a unimodular Lie group it is possible to perform changes of variables and inversion of the variable of integration without affecting the value of an integral, the convolution can be written in the following equivalent ways, It then follows that using (70) and the bi-invariance of integration that (67) holds.

A Tighter Bound Using Conditional Expectation for Commuting PDFs
In this subsection a better inequality is derived.

Theorem 4.3:
The following inequality holds for the right and left Fisher information matrices: where i = 1, 2 and P is an arbitrary symmetric positive definite matrix with the same dimensions as F . Then Then by the change of variables k = h −1 • g, This means that And therefore, An analogous argument using f12(h, g) = ρ1(g • h −1 )ρ2(h) and f2(g) = (ρ1 * ρ2)(g) shows that and F l ii (f2) ≤ F l ii (ρ1). The above results can be written concisely by introducing an arbitrary positive definite diagonal matrix Λ as follows: If this is true in one basis, then using (66) the more general statement in (73) must follow in another basis where P = P T > 0. Since the initial choice of basis is arbitrary, (73) must hold in every basis for an arbitrary positive definite matrix P . This completes the proof. In some instances, even though the group is not commutative, the functions ρ1 and ρ2 will commute. For example, if ρ(g • h) = ρ(h • g) for all h, g ∈ G, then (ρ * ρi)(g) = (ρi * ρ)(g) for any reasonable choice of ρi(g). Or if ρ2 = ρ1 * ρ1 * · · · ρ1 it will clearly be the case that ρ1 * ρ2 = ρ2 * ρ1. If, for whatever reason, ρ1 * ρ2 = ρ2 * ρ1 then (73) can be rewritten in the following form: for any P = P T > 0, (77) and likewise for F l .
Proof: Returning to (74) and (75), in the case when ρ1 * ρ2 = ρ2 * ρ1 it is possible to write and Since the following calculation works the same way for both the 'l' and 'r' cases, consider only the 'r' case for now. Multiplying the first equality in (78) by 1 − β and the second by β and adding together 1 : for arbitrary β ∈ [0, 1]. Now squaring both sides and taking the (unconditional) expectation, and using Jensen's inequality yields: This statement simply says The value of β ∈ [0, 1] that gives the tightest bound is , resulting in the inequality Alternatively, if before computing the optimal β we first multiply both sides of (80) by λi and sum over i, the result will be Again, since the basis is arbitrary, Λ can be replaced with P . Then the optimal value of β will give (77).

A Special Case: SO(3)
Consider the group of 3×3 orthogonal matrices with determinant +1. LetX r = [X r 1 ,X r 2 ,X r 3 ] T andX l = [X l 1 ,X l 2 ,X l 3 ] T . These two gradient vectors are related to each other by an adjoint matrix, which for this group is a rotation matrix [24]. Therefore, in the case when G = SO (3), Therefore, the inequalities in (76) will hold for pdfs on SO(3) regardless of whether or not the functions commute under convolution, but restricted to the condition P = I.

Generalizing the de Bruijn Identity to Lie Groups
This section generalizes the de Bruijn identity, in which entropy rates are related to Fisher information.
subject to the initial conditions ρ(g, 0) = α(g) is simply ρ(g, t) = (α * f D,h,t) (g). This follows because all derivatives "pass through" the convolution integral for ρ(g, t) and act on f D,h,t (g). Taking the time derivative of S(ρ(g, t)) we get Using (83), the partial with respect to time can be replaced with Lie derivatives. But with f1 = log ρ and f2 = ρ then gives The implication of this is that

Information-Theoretic Inequalities from Log-Sobolev Inequalities
In this section information-theoretic identities are derived from Log-Sobolev inequalities. Subsection 6.1 provides a brief review of Log-Sobolev inequalities. Subsection 6.2 then uses these to write information-theoretic inequalities.
In (87) the scalar function cG(t) depends on the particular group. For G = (R n , +) we have c R n (t) = t, and likewise c SO(n) (t) = t.
In analogy with the way that (85) evolved from (86), a descendent of (87) for noncompact unimodular Lie groups is [4,7,8] G |ψ(g)| 2 log |ψ(g)| 2 dg ≤ n 2 log 2CG πen G X ψ 2 dg The only difference is that, to the author's knowledge, the sharp factor CG in this expression is not known for most Lie groups. The information-theoretic interpretation of these inequalities is provided in the following subsection.

Information-Theoretic Inequalities
For our purposes the form in (85) will be most useful. It is interesting to note in passing that Beckner has extended this inequality to the case where the domain, rather than being R n , is the hyperbolic space H 2 ∼ = SL(2, R)/SO(2) and the Heisenberg groups H(n), including H(1) [7,8]. Our goal here is to provide an information-theoretic interpretation of the inequalities from the previous section.
Proof: We begin by proving (89) for G = (R n , +). Making the simple substitution f (x) = |ψ(x)| 2 into (85) and requiring that f (x) be a pdf gives Here S(f ) is the Boltzmann-Shannon entropy of f and F is the Fisher information matrix. As is customary in information theory, the entropy power can be defined as N (f ) in (89) with CG = 1. Then the log-Sobolev inequality in the form in (90) is written as (89).
For the more general case, starting with (91) and letting f (g) = |ψ(g)| 2 gives The rest is the same as for the case of R n . Starting with Gross's original form of log-Sobolev inequalities involving the heat kernel, the following information-theoretic inequality results: Theorem 6.2: The Kullback-Leibler divergence and Fisher-Information distance of any arbitrary pdf and the heat kernel are related as where in general given f1(g) and f2(g), DF I (f1 f2) = One of the fundamental inequalities of information theory is the entropy power inequality N (f1 * f2) ≥ N (f1) + N (f2) for any pdfs f1 and f2 on R n with N (fi) defined as in (89) for C R n = 1. This was first stated by Shannon [63] together with a verification of the necessary conditions for it to be true. This was followed up with proofs of sufficiency by Stam and Blachman [12,66]. Without going into too many details, the key technical points of their proofs require two properties. First, f1 * ρt 1 * f2 * ρt 2 = f1 * f2 * ρt 1 * ρt 2 (which is not a problem in R n since convolution is commutative). Second, they also use a scaling argument requiring that any pdf f (x) that is scaled as fs(x) = s · f (s · x) will become the Dirac delta function as s → 0. That is not to say that these two properties are essential to proving the entropy power inequality, but rather only that they are the properties that are used in the most familiar proofs. However, there is somewhat of a conundrum because for compact Lie groups, the heat kernel ρt(g) is a class function, and therefore satisfies the first condition. However, there is no natural way to rescale on a compact Lie group (not even on the circle group, SO(2)). And in fact, it is easy to see that on compact Lie groups the entropy power inequality does not hold. For example, the limiting distribution on a compact Lie group is ρ∞ = 1 with entropy S(ρ∞) = 0, and entropy power N (ρ∞) = 1. Since ρ∞ * f = ρ∞ for any pdf, f , we get N (ρ∞ * f ) = 1 1 + N (f ) since N (f ) > 0 always.
On the other hand, it is possible for some groups to introduce a concept of scaling. For example, it is possible to do this in the Heisenberg group, roughly speaking, because all coordinate directions extend to infinity. Groups that admit a scaling property have been studied extensively [31]. However, whether the heat equations on such groups yield solutions that are class functions then becomes an issue. Regardless, for the groups of primary interest in engineering applications, i.e., the rotation and rigid-body motion groups, the possibilities for an entropy power inequality appear to be pretty slim.

Conclusions
By collecting and reinterpreting results relating to the study of diffusion processes, harmonic analysis, and log-Sobolev inequalities on Lie groups, and merging these results with new definitions of covariance and Fisher information, many inequalities of information theory were extended here to the context of probability densities on unimodular Lie groups. In addition, the natural decomposition of groups into cosets, double cosets, and the nesting of subgroups provides some inequalities that result from the Kullback-Leibler divergence of probability densities on Lie groups. Some special inequalities related to finite groups were also provided.
While the emphasis of this paper was on the discovery of fundamental inequalities and the introduction of Lie group concepts to the information theory audience, the motivation for this study originated with applications in robotics and other areas. Though these applications were not explored here, references to the literature pertaining to robot motion and image reconstruction were provided.