Information distance estimation between mixtures of multivariate Gaussians

There are efficient software programs for extracting from image sequences certain mixtures of distributions, such as multivariate Gaussians, to represent the important features needed for accurate document retrieval from databases. This note describes a method to use information geometric methods to measure distances between distributions in mixtures of multivariate Gaussians. There is no general analytic solution for the information geodesic distance between two k-variate Gaussians, but for many purposes the absolute information distance is not essential and comparative values suffice for proximity testing. For two mixtures of multivariate Gaussians we must resort to approximations to incorporate the weightings. In practice, the relation between a reasonable approximation and a true geodesic distance is likely to be monotonic, which is adequate for many applications. Here we compare several choices for the incorporation of weightings in distance estimation and provide illustrative results from simulations of differently weighted mixtures of multivariate Gaussians.


Introduction
A recent review of techniques for extracting local features for automatic object recognition in images has been given by Cao et al [4]; implicit in such techniques is computer vision and the elicitation of features that are invariant under image transformation for object classification.In a number of important areas of application the representation of local features-think of smiley, neutral or sad faces in video sequences-can be achieved through mixtures of multivariate Gaussian distributions.The Riemannian manifold of the family of k-variate Gaussians for a given k is well understood through information geometric study using the Fisher metric.For an introduction to information geometry and a range of applications see [1].
Here we consider a mixture distribution consisting of a linear combination of k-variate Gaussians with an increasing sequence of k = 2, 3, . . ., N variables: where µ k ∈ R k is the k-vector of means and Σ k ∈ R (k 2 +k)/2 is the positive definite symmetric (k × k) covariance matrix with components (σ ij ), i ≤ j = 1, 2, ..., k.The standard basis for the space of covariance matrices is We presume that the parameters and relative weights w k of these component probability density functions (1) have been obtained empirically, giving a mixture density: We wish to be able to estimate the information distance ).What we have analytically are natural norms, on the space of means and on the space of covariances, giving the information distance between two multivariate Gaussians of the same number k of variables in two particular cases: Here we have the positive definite symmetric quadratic form Σ to give a norm on the difference vector of means: Here we need a positive definite symmetric matrix constructed from Σ A and Σ B to give a norm on the space of differences between covariances; the information metric is given by Atkinson and Mitchell [2] from a result attributed to S.T. Jensen, using In principle, (4) yields all of the geodesic distances since the information metric is invariant under affine transformations of the mean [2] Appendix 1; see also the article of P.S. Eriksen [3].
Also, we know analytically the Kullback-Leibler divergence, or relative entropy, between two multivariate Gaussians f A = (k, µ A , Σ A ), f B = (k, µ B , Σ B ) with the same number k of variables, its square root giving a separation measurement [5]: This is not symmetric, so to obtain a distance we could take the average KL-distance in both directions: The Kullback-Leibler distance tends to the information distance as two distributions become closer together; conversely it becomes less accurate as they move apart.Explicitly, we have for the covariance part The true geodesic distance is plotted against DKL Σ (f A , f B ) in Figure 1 for 600 bivariate Gaussian covariance matrices.7) for 600 bivariate Gaussian covariance matrices.
1.1 Example: Bivariate Gaussians The analytic expression for distance between two covariance matrices is cumbersome so we show a numerical example:

Approximating distances between arbitrary mixtures
There is no general analytic solution for the geodesic distance between two k-variate Gaussians, but for many purposes the absolute information distance is not essential and comparative values suffice for proximity testing, then the sum D = D µ + D Σ from ( 3) and ( 4) is a sufficient approximation.Indeed, (4) gives the geodesic distance between f A with Σ A = I and f B with µ A = µ B = 0 and the information metric is invariant under affine transformations of the mean [2,3].
So, a fortiori, also we do not have the distance between two mixtures of multivariate Gaussians: For this we must resort to approximations for incorporating the weightings of component Gaussians.In practice, it may not matter greatly since the relation between a reasonable approximation and a true geodesic distance is likely to be monotonic, which may be adequate for many applications.
One method is to combine equations ( 3) and ( 4) through the linear combination (2), obtaining an approximation as a corresponding linear combination of distances.To achieve this there are several choices of how to combine weighted sets of D µ and D Σ and here we mention two.The natural choice §2.1 incorporates the Gaussian component weights w k inside the matrix operations; a simpler choice §2.2 just takes the average weighted values.Figure 2 and, Figure 3 illustrate their results on different sequences of weight vectors.However, both of those approaches suffer from the disadvantage of assuming that the k-variate components from two mixtures come from the same space but in fact there may be no connection between the contributing features they are representing.
The new implementation described in §2.3 uses the incorporated weights, the information geometric norm on the mean vectors and the Frobenius norm on the covariance matrices to project the mixture distributions onto the complex plane.This allows the direct calculation of a distance between two mixture distributions using moduli, without assuming any connections between the mixtures.

Incorporated weights
Given two mixture distributions f A = (µ A , Σ A ), f B = (µ B , Σ B ) we split the distance estimate function D * into D * µ and D * Σ as follows: ) respectively, for ten different random sequences of k-variate Gaussians.

Averaged weights
Given two mixture distributions f A = (µ A , Σ A ), f B = (µ B , Σ B ) we could split the distance estimate function D # into D # µ and D # Σ as follows with δµ = (µ A − µ B ): In this case, if f A = (µ A , Σ A ), and f B = (µ B , Σ B ) arise as differently weighted sums of the same sequence of covariances, then is the identity matrix and D # Σ (f A , f B ) = 0. Figure 3 shows the effect on D # of differing averaged weighting sequences using (13), (14).

Mixtures projected onto the complex plane
The idea here is simple: for each mixture distribution f A given by a weighted sum (2) we obtain two numbers ||µ A || and ||Σ A || being the weighted sums of norms of means and covariances.The norm on mean vectors is given by (3) and for the covariance matrices we need a matrix norm,  Note that if M αβ has eigenvalues {λ α } and is represented on a basis of eigenvectors then Given a mixture distribution f A consisting of M different multivariate Gaussians: Now we can represent f A by the complex number φ A = ||µ A || + i||Σ A || and its difference from another such complex number φ B for f B gives us a distance measure in our reduced space of mixtures: The result of using (17) to project mixtures onto the complex plane is shown in Figure 4.The three bars give ∆(f A , f B ), ∆(f B , f C ), ∆(f A , f C ) respectively, for ten different random sequences of k-variate Gaussians.The three barcharts, in Figure 2, Figure 3 and Figure 4, use the same mixtures of multivariate Gaussians.It appears that the projection of mixtures onto the complex plane, Figure 4, as described in the present section gives a wider range of differences and shows the intuitively expected largest differences mostly between increasing and decreasing weight sequences, ∆(f A , f C ) in the third columns of each replication.
Figure 5 shows a plot of the points (||µ||, ||Σ||) ∈ C for the ten mixtures of random k-variate Gaussians having k = 2, 3, 4, 5 variables, with increasing weights f A , uniform weights f B , and decreasing weights f C .The g A , g B , g C are for the same mixtures except that Σ C 2 has been replaced by Σ C 2 /5 and h A , h B , h C are for the same mixtures except that Σ C 5 has been replaced by Σ C 5 /5 to show the effect of a change in one covariance component.In each case the mean for each over the ten replications is shown as a large point.

Acknowledgement
The methods described in §2.1 and §2.2 were developed with J. Scharcanski and J. Soldera during a visit to UFRGS, Brazil with a grant from The London Mathematical Society in 2013 and the author is grateful to CAPES (Coordeanao de Aperfeioamento de Pessoal de Nivel Superior, Brazil) for partially funding this project.An application of the results to face recognition will be reported elsewhere in a joint paper.

DFigure 2 :DFigure 3 :Figure 4 :
Figure 2: Effect of incorported weights §2.1: Distances between pairs of mixtures of random k-variate Gaussians having k = 2, 3, 4, 5 variables, with increasing weights A, uniform weights B, and decreasing weights C. The three bars give D* (f A , f B ), D * (f B , f C ), D * (f A , f C ) respectively,for ten different random sequences of k-variate Gaussians.

Figure 5 :
Figure 5: Mixture projection onto C §2.3: Mixtures are shown plotted in (||µ||, ||Σ||)-space for the 10 random k-variate Gaussians having k = 2, 3, 4, 5 variables, with increasing weights f A , uniform weights f B , and decreasing weights f C .The g A , g B , g C are for the same mixtures except that Σ C 2 has been replaced by Σ C 2 /5 and h A , h B , h C are for the same mixtures except that Σ C 5 has been replaced by Σ C 5 /5 to show the effect of a change in one covariance component.The mean for each over the ten replications is shown as a large point.