Small time asymptotics of the entropy of the heat kernel on a Riemannian manifold

We give an asymptotic expansion of the relative entropy between the heat kernel $q_Z(t,z,w)$ of a compact Riemannian manifold $Z$ and the normalized Riemannian volume for small values of $t$ and for a fixed element $z\in Z$. We prove that coefficients in the expansion can be expressed as universal polynomials in the components of the curvature tensor and its covariant derivatives at $z$, when they are expressed in terms of normal coordinates. We describe a method to compute the coefficients, and we use the method to compute the first three coefficients. The asymptotic expansion is necessary for an unsupervised machine-learning algorithm called the Diffusion Variational Autoencoder.


Introduction
In this article, we give a small-time asymptotic expansion of the relative entropy (also called the Kullback-Leibler divergence) D KL (q Z (t, z, •)||ϑ) := Z q Z (t, z, w) log (q Z (t, z, w)) dVol(w) + log[Vol(Z)] (1) of the heat kernel q Z (t, z, •) of a d-dimensional closed Riemannian manifold Z with respect to the normalized Riemannian volume ϑ.More precisely, our main result (Theorem 3.1) is that for every n ∈ N and z ∈ Z we have the following nth order asymptotic expansion for small t D KL (q Z (t, z, .)||ϑ where the the coefficients c i (z) (i = 1, 2, ..., n) can be expressed as universal polynomials in the components of the curvature tensor and its covariant derivatives at z (assuming these components are written down in normal coordinates at z).Moreover, we describe a method to compute the coefficients c i (z) for arbitrary i and we compute c 0 (z), c 1 (z) and c 2 (z).
Our main theorem together with the computation of c 0 (z) and c 1 (z) proves Theorem 4.4, which essentially states that where Sc(z) is the scalar curvature of the manifold in the point z ∈ Z.We have mentioned Theorem 4.4 in earlier work [17,Proposition 1], but here we follow up with a rigorous proof.
In order to prove these results, we use the so-called parametrix expansion, which is a classical approach to construct the fundamental solution to a partial differential equation equation.Hadamard [7] used a parametrix expansion to construct a fundamental solution for the wave equation.Analogous to this construction, Minakshisundaram and Pleijel [16] derived the following parametrix expansion of the heat kernel for a closed Riemannian manifold This work was supported by NWO GROOT project UNRAVEL, OCENW.GROOT.2019.044.
q Z (t, z, w) ≈ 1 (2πt) for small t and smooth functions u i defined on a neighborhood of the diagonal of Z × Z.The expansion in Equation ( 2) was derived to study the analytic continuation in the s-plane of the Dirichlet series that corresponds to the spectra of the Laplace-Beltrami-Operator.More generally, the coefficients of a parametrix expansion are Riemannian invariant and encode the relationship between the topology, the geometry and the spectra of a differential operator on the Riemannian manifold [20], [2].
For our results we need rather strong rigorous error estimates for the approximation (2).In Theorem 2.2, we record one such estimate that essentially goes back to [11].Similar estimates were obtained by Ben Arous [1] for the more general case of fundamental solution of a hypoelliptic heat equation.

Motivation: The Diffusion Variational Autoencoder
Our mathematical result is needed in a machine-learning algorithm called the Diffusion Variational Autoencoder (∆VAE), which is a variant of a Variational Autoencoder (VAE) that allows for a closed manifold as latent space, rather than the usual Euclidean latent space.
A Variational Autoencoder [13] [14] [19] is an unsupervised machine-learning algorithm that aims to find a representation of a dataset in a lower-dimensional space.The lower-dimensional space is called the latent space.A Variational Autoencoder is built up of two parts, an encoder, which maps data points to points in latent space, and a decoder which maps points in latent space to back to points in data space.Both parts are usually implemented as neural networks and are optimized according to some loss function, which stimulates the composition of encoder and decoder to approximate the identity map.
Because the latent space is usually a relatively low-dimensional latent space, much lower than the data space, a VAE is often thought to compress the data.The heuristic is that if a VAE finds a good compression, then the position of an encoding in latent space contains useful, relevant or even interpretable information about the dataset.Sometimes this is expressed by saying that a VAE manages to disentangle latent factors [8,9].Although it is still unclear what is precisely meant by that, there are attempts to define disentanglement by Higgins et al [9] and to measure disentanglement by Pérez Rey et al [18].
There are several variants of the VAE that are specifically meant to improve this disentanglement of latent factors.Higgins et al [8] and Burgess et al [3] introduce the β-VAE which add an adjustable hyperpamarameter β in the loss function of the VAE in order to balance latent channel capacity and independence constraints with reconstruction accuracy.The factorVAE [12] is a modification of the β-VAE and disentangles in the way that is forces the distribution of the representation to be factorial in order to have independence in each dimension.
Yet sometimes, a VAE with a standard Euclidean space is fundamentally unable to capture aspects of the dataset that a human observer might view as important latent factors.This is for instance the case in the idealized example when the dataset consists of pictures of a single object, rotated over different angles around one axis.To a human observer, the angle of rotation may be viewed as an important factor.However, when a VAE with one-dimensional latent space is trained on this dataset, either pictures of the object that are very similar end up in very different parts of latent space, or otherwise two nearby sampled points on the latent space can give two far away points or even two meaningless points in the dataspace when decoded [6].This phenomenon was called manifold mismatch by Davidson et al [5].
To overcome the problem of manifold mismatch, one could choose a latent space that is homeomorphic to the topological structure of the data [6].In the case of the example, the latent space could be chosen to be a circle.
Several variants of the VAE have been developed that allow for various topological spaces as latent space.Xu and Durrett [22] and Davidson et al [5] developed VAEs that allowed for spherical latent spaces, and Falorsi et al [6] designed a VAE which allowed for Lie groups as latent space.In order to allow for general closed manifolds as latent space, Pérez Rey et al [17] developed the Diffusion Variational Autoencoder (∆VAE).There are also approaches that focus on learning a different metric on a space otherwise homeomorphic to R n , such as the ones by Kalatzis et al [10] and by Chadebec et al [4].
The asymptotic expansion derived in this article is important for the computational efficiency of the ∆VAE: The relative entropy is one term in the loss function for the ∆VAE, and the asymptotic expansion provides a means to approximate it quickly.This article provides rigorous argumentation for the correctness of this approximation.

Organization of the work
This work is organized as follows.In Section 2, we set some notations and review all preliminary results that will be needed in the proof of the main result and the computation of the coefficients c i (z).In Section 3, we prove the existence of an expansion up any order for the entropy, and that the expansion is such that the coefficients are universal polynomials in the component of the curvature tensor and its covariant derivatives at z for z ∈ Z fixed.Moreover, we express the coefficients c i (z) in this asymptotic expansion in terms of Gaussian integrals.In Section 4, we outline a general procedure to compute the Taylor series at 0 of the square root of the determinant det(g(y)) of the Riemannian metric g, and the Taylor series at 0 of the functions u i (0, y) (i = 0, 1, 2, . . ., n) that appear in the parametrix expansion.We use the lowest order terms of these Taylor series that are already recorded in [21] to compute the coefficients c 0 (z), c 1 (z) and c 2 (z) of Theorem 3.1.The computation of c 0 (z) and c 1 (z) rigorously recovers the first-order expansion mentioned in [17, Proposition 1].

Notations and preliminary results
We first introduce notation for the heat kernel on a Riemannian manifold.We denote by (Z, g) a ddimensional closed and connected Riemannian manifold equipped with a Riemannian metric g.We let q Z : (0, ∞) × Z × Z → R denote the heat kernel of Z, i.e the fundamental solution of the heat equation [20, Equation (1.24)] where Vol denotes the standard Riemannian volume measure on Z.For fixed t and z, the function w ∈ Z −→ q Z (t, z, w) is the density with respect to the standard Riemannian volume measure of a probability measure on Z.
Next, we will give an expression for the KL-divergence between the kernel q Z (t, z, •) and the normalized Riemannian volume measure ϑ.The measure ϑ has density 1/Vol(Z) with respect to the standard Riemannian volume measure Vol(•).The KL-divergence can therefore be written as In the article, we are going to give asymptotic expansions of this expression for small values of t.
The following lemma allows us to reduce the domain of integration.
Lemma 2.1.Fix z ∈ Z and let ǫ > 0. Then for any positive integer n, is the ball of radius ǫ centered at z given by the distance dist Z induced by the Riemannian metric g.
Proof.Let z be an element of Z, and let ǫ > 0. Since we consider only small values of t, then there exists a constant C > 0 such that [15, Theorem 3.5 and Remark 3.6 with j = m = l = 0 and As a result, there exists a positive constant c = c(z, ǫ) independent of w such that for all w ∈ Z \ B ǫ (z) and for small values of t.
is a decreasing function converging to 0 when x converges to 0, we have for small and positive values of t −e − c t c t ≤ q Z (t, z, w) log[q Z (t, z, w)] ≤ 0.
Since c does not depend on w, we have that Since the volume of Z is finite, then the lower bound converges to 0 faster than t n for any fixed positive integer n when t goes to 0. Therefore It is then enough to compute the integral in Equation ( 4) on a small neighborhood of z in order to have any order (in t) expansion of the entropy.
Throughout the rest of this work, we fix a point z ∈ (Z, g) and a ball U := B ǫ (z) of radius ǫ centered at z such that the closure Ū of U does not intersect the cut-locus of z (we also identify U with its image with respect to the exponential map based at z).We denote by y the corresponding normal coordinates of w ∈ U , i.e exp −1 (w) = y ∈ R d ∼ = T z Z and dist Z (z, w) 2 = ||y|| 2 := g(y, y).Moreover, let Ric(z) and Sc(z) := Trace(Ric(z)) denote respectively the Ricci curvature and the scalar curvature of Z at z.By Lemma 2.1, we have for any positive integer n: Z q Z (t, z, w) log q Z (t, z, w)dVol(w) = U q Z (t, z, w) log q Z (t, z, w)dVol(w) + o(t n ).
We are going to use the following nth order (n ∈ {0, 1, 2, . . .}) Minakshisundaram-Pleijel parametrix expansion of q Z (t, z, w), which is valid for w in the closure Ū of U .Theorem 2.2 (Minakshisundaram-Pleijel parametrix expansion).Let K ⊂ Z be compact such that K × K does not intersect the cut locus.Then there exist smooth functions where This theorem was basically proved by Kannai [11], although his formulation of the theorem was slightly different.
Furthermore, we have the following common property of the functions u i (i ∈ {0, 1, 2, ...}) and the square root of the determinant of the metric tensor at w.
.. denote the covariant derivatives of the curvature tensor at z.For i ∈ N, let u i (0, y) be the expression of u i (z, w) of Equation ( 6) in terms of the normal coordinates (the normal coordinate of z and w are respectively 0 and y).Then the coefficients of the Taylor series at y = 0 of the function can be expressed as a universal polynomials 1 in the components of R z , ∇R z , ∇ 2 R z , ..., ∇ n R z , ... expressed in normal coordinates.The same result holds for the function where g(y) is the matrix of the Riemannian metric at w expressed in terms of the normal coordinate y.
Proof.See Proof of [20,Lemma 3.26] Before going to the next section, let us prove some elementary results and some properties of Gaussian integrals.
Definition 2.4.Let f be a function defined on a ball B of center 0 (and of any radius) of R d .We say that for any X ∈ B. (ii) Assume that σ = 1.For i = 1, . . ., d, we have Then we have for small values of t. (iv 1 These are polynomials that depend only on the dimension of Z Proof.(i) This follows from the probability density function of N (0, σ).

Since E[y
(iii) Since any polynomial is a linear combination of monomials, it is enough to assume that Since then the term E[P (y)] of the upper bound in Equation ( 7) cancels out after expanding the finite product
The case n = 0 follows by bounding e − x 2 2 x k with ≤ 1 x 2 (this is always possible with small values of t).This implies that and the result follows by Equation (7).
Then the Jacobian of s 0 is 1 and we have that This implies that and we have the result.
(iv) This follows by using the moment of order p ∈ {0, 1, 2, . . .} of a Gaussian random variable.More precisely, As in [21], we denote Y = (Y 1 , ..., Y d ) ∈ R d , i.e. we put the index up for the components of Y ∈ R d and we will use the Einstein summation convention.Furthermore, for k ∈ N and for any finite sequence of integer i 1 , ..., i k ∈ {1, ..., d}, we denote: We will need the following contraction Lemma.
(ii) By using the moments of standard Gaussian random variables, we have or (#{i, j, k, l} = 2 and exactly three of i, j, k and l are equal) where the last equality is just the Einstein summation convention.(iii) By using Lemma 2.6 (ii), we have Hence Trace(T ).
(iv) By using the moment of standard Gaussian random variables, we have Hence

Existence and property of an expansion of any order
In this section we are going to prove, for any given positive integer n, the existence of a higher order (in t) expansion of the form: where c i (z)'s (i = 0, 1, ..., n) are universal polynomials in the components of the curvature tensor and its covariant derivatives at z.The main result can be summarized in the following theorem.
Theorem 3.1.Let z ∈ (Z, g) and let n ≥ 0. Then (1) there exist c 0 (z), c 1 (z), ..., c n (z) ∈ R such that Furthermore, the coefficient c i (z) (i = 0, 1, .., n) can be expressed as universal polynomials in the components of the curvature tensor and its covariant derivatives at z, when these are expressed in normal coordinates.
(2) More precisely, the coefficient c i (z) (i = 1, 2, . . ., n) is given by the Gaussian integral where P i (Y ) and Q i (Y ) are polynomials in Y .The coefficients of P i (Y ) and Q i (Y ) can be expressed as universal polynomials in the components of the curvature tensor and its covariant derivatives at z.The polynomials P i (Y ) and Q i (Y ) are defined as follow.
Let u i (0, y) (i = 0, ..., n) be the ith order parametrix of Z (cf.Equation ( 6) expressed in terms of the normal coordinates y at z. Then the two polynomials P i (Y ) and Q i (Y ) (i = 0, 1, ..., n) can be computed by the following two steps.
Step 1. Sum up all the terms T of the Taylor series at (t = 0, y = 0) of the function such that • T is of degree p in t and T is of degree q in the components of y (with q even), • p and q are integers such that i = p + q 2 , and let F i (t, y) be the obtained quantity.Do the same procedure for the function and let G i (t, y) be the obtained quantity.
Step 2. Replace y with √ tY in the expressions of F i (t, y) and G i (t, y) so that we obtain two polyno- Proof.We are going to prove (1) and (2) of Theorem 3.1 at the same time.Recall that from Lemma 2.1, it is enough to compute the entropy in a normal neighborhood U := B ǫ (z) of z, so that we can use the parametrix expansion in (6).By using the normal coordinates y ∈ R d that correspond to the normal neighborhood U of z, we have the integral on U of q Z (t, z, w) log q Z (t, z, w) with respect to dVol(w) U q Z (t, z, w) log q Z (t, z, w)dVol(w) By the fact that Z q z (t, z, w)dVol(w) = 1 and by the estimate in Equation ( 5), we have Moreover, the quantity for small values of t.Therefore, we have the following equality where and Throughout this proof, we use the same notation U or B ǫ for both the neighborhood of z in Z and the neighborhood of 0 in R d .In order to compute the terms K 1 , K 2 and K 3 of Equation ( 10), we are going use the following change of variables Since y ∈ R d , then the Jacobian of this change of variables is just t (i) Computation of K 1 (t) (11) and P i (Y ) (i = 0, 1, 2, ..., n) Consider the function F : R × U → R defined for (t, y) ∈ R × U by which is C ∞ around (t, y) = (0, 0) because g is smooth and non-singular on U and the u i 's are C ∞ .Then F has the Taylor expansion at (0, 0) where F (0,0) is the ith derivative of F at (0, 0) for i = 0, 1, ..., N .If we plug this Taylor expansion into Equation ( 11), we have and by using the change of variables in ( 14), we have that Hence, by taking N large enough, we have since the moment of any order of a Gaussian random variable is finite. Therefore Since the derivatives F (i) (0,0) (i = 0, ..., N ) are multi-linear forms and by Lemma 2.6 (iii), we can compute the integral of Equation ( 16) over R d for small values of t , i.e.
Again, since the derivatives are multi-linear forms, the quantity is a finite sum of monomials of the form where α is a multi-index, i.e a tuple of positive integers, c k,α := c k,α (z) ∈ R and By Proposition 2.3, the coefficients of the Taylor series at 0 of the functions y ∈ U −→ det[g(y)] and (t, y) ∈ R×U −→ n i=0 u i (0, y)t i are universal polynomials in the components of the curvature tensor and its covariant derivatives at z. Consequently, the coefficients c k,α are also universal polynomials in the components of the curvature tensor and its covariant derivatives at z.
Assume now that α 1 + ... + α d is odd so that we have a term with a non-integer power of t.In this case, there exists at least an index i 0 ∈ {0, 1, ..., d} such that α i0 is odd and the number of such i 0 's must also be odd.Hence, for a fixed t, the function is an odd function in Y on the ball B ǫ √ t and its integral on that ball is then 0 by Lemma 2.5.Now, for i = 0, 1, 2, ..., n let F i (t, √ tY ) be the sum of all terms T of the quantity that satisfy the following two condition: • p and q are such that i = p + q 2 .Then there exist a polynomial and we have Furthermore, the coefficients of P i (Y ) (i = 0, 1, ..., d) can be expressed as universal polynomials in the components of the curvature tensor and its covariant derivatives at z (cf.Proposition 2.3).
(ii) Computation of K 2 (t) (12) and Q i (Y ) (i = 0, 1, 2, ..., n) We use exactly the same methods as in the computation of K 1 (t).Consider the function G : R × U → R defined for (t, y) ∈ R × U by where F (t, y) is defined in Equation (15).
and g is smooth and non-singular.
Then at (0, 0), the function G has the Taylor expansion where G (i) (0,0) is the ith derivative of G at (0, 0) for i = 0, ..., N .If we plug this Taylor expansion into Equation ( 12), we have and by using the change of variables in ( 14), we have that Again, since the moment of any order of a Gaussian vector is finite, we have by taking N large enough. Therefore Since the derivatives are multi-linear forms, then by Lemma 2.6 (iii), we can integrate the second member of Equation ( 20) over R d , i.e.
is a finite sum of monomials of the form where c k,α := c k,α (z) ∈ R, α = (α 1 , ..., α d ) is a multi-index, i.e a tuple of positive integers and Since u 0 (0, 0) = 1, the Taylor series at (0, 0) of the function (t, where P i is homogeneous polynomial of degree i in t, y 1 , ..., y d for i > 0 and the coefficients of P i (i ∈ N) are universal polynomials in the components of the curvature tensor and its covariant derivatives at z (by Proposition 2.3).Hence, the coefficients of the Taylor series at (0, 0) of the function are also universal polynomial on the components of the curvature tensor and its covariant derivatives at z. Hence, by Proposition 2.3, all the coefficients c k,α (z) are universal polynomials on the component of the curvature and its covariant derivatives at z. Assume now that α 1 + ... + α d is odd so that we have a term with non-integer power of t in the integral (16).In this case, there exist at least an index i o ∈ {0, 1, ..., d} such that α i0 is odd and the number of such i o 's must also be odd.Hence, for a fixed t, the function is an odd function in Y on the ball B ǫ √ t and its integral on that ball is then 0 by Lemma (2.5).
Now, let G i (t, √ tY ) be the sum of all terms T of the quantity that satisfy the following two conditions: 2. p and q are such that i = p + q 2 .Then there exists a polynomial and we have Furthermore, the coefficients of Q i (Y ) (i = 0, 1, ..., n) can be expressed as universal polynomials on the components of the curvature tensor and its covariant derivatives at z.
Since u 0 (0, 0) = 1, then because the function that we integrate against the Gaussian distribution in ( 13) is bounded for small values of t.

Computation of the coefficients c n (z)
In this section, we will describe how the coefficients c n (z) appearing in Theorem 3.1 can actually be computed.For this, one needs the Taylor series at 0 of det(g)(y) and of the functions u i (0, y) (i = 0, 1, ...) that appear in the parametrix expansion.The lowest order terms of these expansions have for instance been recorded by Sakai in [21], and we will use these results.However, we do think it is insightful to see the general outline for computing these expansions, for which we present a general procedure in Subsection 4.1.Afterwards, we compute the coefficient c n (z) of the asymptotic expansion in Theorem 3.1 for n = 0, 1, 2.

Computation of the Taylor series of relevant quantities
In this subsection, we are going to describe how to compute the Taylor series at y = 0 of det(g(y)) and of the functions u i (0, y) (i = 0, 1, ...) appearing in the parametrix expansion.These are needed to compute the coefficients c i (z) (i = 0, 1, 2, ...).

Taylor series of the metric coefficients g ij
We follow [21].We let y ∈ R d , ℓ ∈ (0, ∞) and let γ : R → M be a geodesic from z such that for k = 1, . . ., d, its kth normal coordinate is Then γ k (ℓ) = y k .We define J i as the Jacobi field along the geodesic γ with the initial conditions The Jacobi field is in fact given by We set f (t) := g(J i (t), J j (t)) Then g ij (y) equals f (ℓ).Therefore, to get the Taylor expansion for g ij , it suffices to consider the Taylor expansion for f .To obtain this Taylor expansion, we use the Jacobi equation where the left-hand side should be interpreted as the second covariant derivative in the direction γ ′ (t).By taking covariant derivatives of both sides, and (repeatedly) using that ∇ γ ′ (t) γ ′ (t) = 0, we get that for all k = 0, 1, . . .,

Taylor series of g ij
To get the Taylor series of g ij we use that for g close enough to I,

Taylor series of det(g(y))
To obtain the Taylor series of det(g(y)) we use the Jacobi formula to first get a Taylor series for log(det(g(γ(t)))) after which we exponentiate this series.
4.1.4Taylor series of u 0 (0, y) We know that u 0 (0, y) = (det(g(y))) and therefore to get the Taylor expansion of u 0 we use use the Taylor expansion of the function (ix).C 0 is given by Proof.The values of E 1 , (E 2 ) ij and (E 4 ) ijkl can be obtained by combining the Taylor series at x = 0 of (1 + x)  Here we only need the constant terms of the Taylor series of F and G, i.e: and G 0 (t, y) = G(0, 0) = 0.
• Step 2. Since F 0 (t, y) and G 0 (t, y) are constants, then there is no variable to replace as in the second step in part (2) of Theorem 3.1.
• Step 3. Since P 0 (Y ) = 1 and Q 0 (Y ) = 0, we have by (2) of Theorem 3.1 Therefore Step 1.By (2) of Theorem 3.1, we need the functions We need the terms of the Taylor series at (t = 0, y = 0) of F and G that are of degree zero in t and of degree 2 in y, or of degree one in t and of degree 0 in y.By using the Taylor series in (24)5, we have: where the remaining terms are either of degree strictly higher than two in y or of degree more than one in both t and y.By using the Taylor expansion of the logarithm, We have therefore proven rigorously the following theorem, which is Proposition 1 of [17].
Theorem 4.4.Let z ∈ Z and let Sc(z) denotes the scalar curvature at z. Then: where ϑ is the normalized Riemannian volume on Z.

Computation of c
In this last subsection, we are going to prove the following theorem.
Theorem 4.5.With the notations of Theorem 3.1, we have: Proof.We are going to use the method outlined in Theorem 3.1. 1.
Step 1.We have the two functions We can get terms in the Taylor series of F and G by using the Taylor expansion in Equations 24: terms without t should be of degree at most 4 in the components of y, terms with t should be of degree at most 2 in the components of y and terms with t 2 should be of degree 0 in the components of y.We have (using that E 1 = 0 as stated in Lemma 4.3 (i)) By using the Taylor series at 0 of log(1 + x), we get the expansion log u 0 (0, y) + tu 1 (0, y) + t 2 u 2 (0, y) By combining this with the expression of G, we have Hence, we find Step 2. By using the change of variables y ←→ √ tY , by using the notations in Equation (25) we have the two polynomials 3.
Step 3.All we need to do now is to compute Gaussian integrals.By using Lemma 2.7, we find Let us compute each term of Equation ( 27) by using the results of Lemma 4.3.

Lemma 2 . 5 .Lemma 2 . 6 .
Let f be an odd and integrable function on a ball B of center 0 of R d .Then B f (x)dx = 0. Proof.Consider the change of variable y = −x.The Jacobian of this change of variables is 1.Therefore B f (x)dx = B f (−y)dy = − B f (y)dy and we have the result.Let σ be a positive real number and let y = (y 1 , ..., y d ) ∼ N (0 d , σI d ) be a d-dimensional Gaussian random variable.

d 2 .
In addition, we have Y ∈ B ǫ √ t because y ∈ B ǫ .