Regularization lemmas and convergence in total variation

We provide a simple abstract formalism of integration by parts under which we obtain some regularization lemmas. These lemmas apply to any sequence of random variables $(F_n)$ which are smooth and non-degenerated in some sense and enable one to upgrade the distance of convergence from smooth Wasserstein distances to total variation in a quantitative way. This is a well studied topic and one can consult for instance Bally and Caramellino [Electron. J. Probab. 2014], Bogachev, Kosov and Zelenov [Trans. Amer. Math. Soc. 2018], Hu, Lu and Nualart [J. Funct. Anal. 2014], Nourdin and Poly [Stoch. Proc. Appl. 2013] and the references therein for an overview of this issue. Each of the aforementioned references share the fact that some non-degeneracy is required along the whole sequence. We provide here the first result removing this costly assumption as we require only non-degeneracy at the limit. The price to pay is to control the smooth Wasserstein distance between the Malliavin matrix of the sequence and the Malliavin matrix of the limit which is particularly easy in the context of Gaussian limit as their Malliavin matrix is deterministic. We then recover, in a slightly weaker form, the main findings of Nourdin, Peccati and Swan [J. Funct. Anal. 2014]. Another application concerns the approximation of the semi-group of a diffusion process by the Euler scheme in a quantitative way and under the H\"ormander condition.


Introduction
The main historical application of Malliavin calculus, introduced in 1975 by Paul Malliavin, was a probabilistic proof of the Hörmander regularity criterion. But in the 40 last years it gave rise to a huge amount of various applications, and in particular it has been developed as a branch of stochastic analysis on the Wiener space, see the classical book of Nualart [21], as well as the more recent area of research pioneered by Nourdin and Peccati, see [16]. There is a major philosophical difference between the two aforementioned views of Malliavin calculus, as the so-called Malliavin-Stein's method (of Nourdin and Peccati), which has been intensively studied in a recent past, mixes the formalism of integration by parts provided by Malliavin calculus operators together with the Stein's method. Let us recall that the quintessence of Stein's method consists of identifying a suitable functional operator which characterizes a specific target and use it to prove convergence towards this target in a quantitative way. The most emblematic example is certainly the univariate standard Gaussian distribution γ which is characterized by the equation f ′ − xf, γ = 0 for every test function f . The link between Malliavin calculus appears in the identity: where Γ is the square field operator on the Wiener space and L −1 is the pseudo-inverse of the Ornstein-Uhlenbeck operator. The quantity of interest is then Γ[X, −L −1 X] which is different from the quantity Γ[X, X] which is standard in Malliavin calculus. Hence, although these two points of view are rather close as they both employ the Malliavin calculus to compute distances between distributions, they go towards different directions. Malliavin-Stein methods focus on specific targets with specific operators in order to provide rates of convergence whereas regularization lemmas focus on smoothness of distribution and upgrading distances of convergence. The present article explores this direction, namely we do not aim at proving limit theorems but instead of that, given a limit theorem, we explore the strongest probabilistic distances and the smoothness of the laws. In some sense, both approaches are complementary.
To do so, we introduce an abstract framework built on Dirichlet form theory in which such properties may be obtained by using some integration by parts techniques. Those techniques are very similar to the standard Malliavin calculus but are presented in a more general framework which goes far beyond the sole case of the Wiener space. In particular, we aim at providing a minimalist setting leading to our regularization lemma. Our unified framework includes the standard Malliavin calculus and different known versions -the "lent particle" approach for Poisson point measures (developed by Bouleau and Denis [11]), the calculus based on the splitting method developed and used in [2,3,5] as well as the Γ calculus in [9].
The first aim of this paper is to present, in this unified framework, the following regularization lemma (see Theorem 3.1): Here f δ = f * φ δ is the regularization by convolution by means of a super kernel φ δ (see (3.1) -(3.2) and (3.3)). We use Malliavin calculus (abstract version) for F : then σ F is the Malliavin covariance matrix and C q (F ) is a quantity which involves the Malliavin-Sobolev norms up to order q of F. This inequality holds for every δ > 0, η > 0 and every q ∈ N. So one may play on these parameters according to the problem at hand. A more powerful variant of the above lemma involves derivatives of the test function f : Here f m,∞ = |β|=m ∂ β f ∞ and m = |γ| . Such an inequality allows to handle convergence in distribution norms for the law of F n to the law of F. Applications of such convergence results are given in [3,5].
One important application of the regularization lemma consists in proving that, if a sequence F n → F in a distance involving smooth test functions (as for example for the Wasserstein distance) then it converges also in total variation distance. Of course, in order to get such a result, we need F n to be smooth, in order to control C q (F n ), and (more or less) non degenerated, in order to control P(det σ Fn ≤ η). Actually, according to the non degeneracy properties, several variants of the convergence result are obtained.
Let us give an informal version of these results. Assume first that we have the uniform non degeneracy condition Q p := sup n E((det σ Fn ) −p ) < ∞ for every p. Then we prove (see Lemma 3.8 for a precise statement) that, for every given ε > 0 where d T V is the total variation distance and d W is the Wasserstein distance. Here C is a constant which depends on the Sobolev norms and on the "non degeneracy" constant Q p for some p large enough. Notice that we loose something, because we get the power 1 − ε instead of 1 for d W (F, F n ). This is somehow a technical drawback of our method which is based on an optimization procedure. A more careful examination of this optimization procedure is likely to provide logarithmic losses but this would result in highly technical computations which fall beyond the scope of this paper. Let us emphasize that the previous estimate requires nondegeneracy assumptions along the whole sequence (F n ) which may be in general rather hard to check. Assumptions of non-degeneracy on the sequence (F n ) may sometimes be provided by classic anti-concentration estimates. For instance, when the underlying Gaussian functionals are polynomials, the Carbery-Wright estimate gives a kind of non-degeneracy but in a much weaker way. The reader can consult [10,20] for results in this direction. Another reference of interest is [13] where convergence of densities is explored when the limit is Gaussian and under the same non-degeneracy assumption. Finally, let us mention the reference [22] which shows that in the particular setting of quadratic forms of Gaussian vectors, the central convergence automatically implies the required non-degeneracy assumptions and the previous results apply under the sole assumption of Gaussian convergence.
In order to bypass this major issue, we are able to obtain a variant of the above estimate without assuming nothing on the non degeneracy of F n (so Q p may be infinite). In Lemma 3.10 we prove that, for every ε > 0 where C depends on the Sobolev norms and ε only (and not on the non degeneracy constant Q p ).
In concrete examples it may be difficult to precisely estimate d W (σ F , σ Fn ), but then one may use the standard upper bound d W (σ F , σ Fn ) ≤ C DF − DF n 1 . Doing this is not innocent, because on replace "weak distances" with "strong" onces and this may induce a serious loss of accuracy: for example the weak distance is of order 1 n while the strong distance is 1 √ n . However, the aforementioned result completely covers the case of central convergence as in this case σ F is a deterministic matrix and the quantity d W (F, F n ) is easy to estimate. Using this strategy we recover a central result of Nourdin-Peccati theory [18] establishing multivariate total variation estimates for suitable sequences converging to Gaussian. Proofs are completely different as the proof of the aforementioned article employs tools of information theory and provides stronger results such as convergence in entropy. On the other hand, our result is more general and requires much less structural information on the sequence approximating the Gaussian law.
We also illustrate the above results in the framework of the approximation of the semi-group of a diffusion process by using the Euler scheme: if one assumes uniform ellipticity, then one has uniform non degeneracy for the Euler scheme and may use (1.2). But if one works under Hörmander condition, then the Euler scheme is degenerated so Q p = ∞. In [8] this problem has been discussed and the authors have been obliged to work with a slightly regularized Euler scheme in order to bypass this difficulty. Now, one may use (1.3) and to get the convergence for the real Euler scheme (without regularization). But one looses accuracy: we pass from 1 n to 1 √ n , so the result is not optimal. A last type of results concerns the distance between density functions. This issue has already been discussed in [2]. Here, in the Theorem 3.13 we prove the following: if F and G are smooth and non degenerated then the density functions p F and p G exists and are smooth. Moreover, for every multi index α and for every ε > 0 This is a striking improvement with respect to the estimate obtained in [2], see (2.53) there.

Abstract framework
In this section we present an abstract framework which covers most of the known variants of Malliavin calculus and which allows to obtain the integration by parts formula that we need. We consider a probability space (Ω, F, P) and a subset E ⊂ ∩ p>1 L p (Ω; R). The guiding example is E = S or also, E = D ∞ (the space of simple functionals respectively the space of smooth functionals in the classical Malliavin calculus). We assume that for every φ ∈ C ∞ p (R d ) (sooth functions with polynomial growth) and every F ∈ E d , one has φ(F ) ∈ E. In particular E is an algebra. In the sequel we will also use the following consequence. For η > 0 we denote by Ψ η : (0, ∞) → R a smooth function which is equal to zero on (0, η/2) and to one on (η, ∞). Then for every η > 0, Moreover we consider In the language of Dirichlet forms Γ is the carré du champs operator. Notice that, since Γ(F, G) ∈ E and E is an algebra, if F, G, H ∈ E then Γ(F, G)H ∈ E. We also may define Γ(F, Γ(G, H)).
♣ L : E → E which is a linear operator.
We assume: In particular, taking φ(x, y) = xy we obtain Notice that we also have the following extension of the duality formula: using the duality first and the chain rule for the function φ(x, y) = xy we get for every F, G, H ∈ E: This gives the standard integration by parts formula that we present now.
.., d be its Malliavin covariance matrix. We suppose that σ F is invertible, we denote γ F = σ −1 F , and we assume that Moreover, iterating this relation we get

Remark 2.2
If det σ F is almost surely invertible and (det σ F ) −1 ∈ E then (2.6) is verified Proof. We use the chain rule and we get Then, by (2.5) The non degeneracy hypothesis considered in the previous lemma is sometimes too strong (this is the case in our framework). So we present now a localized version of the previous integration by parts formula. We recall that Ψ η (x) is a smooth function which is null for x ≤ η/2 and equal to one for And, by (2.1) we know that γ i,j F,η ∈ E.
Moreover, iterating this relation we get Proof. The proof is almost the same as above. The only change is that in the first step we multiply with Ψ η (det σ F ) and we write On the set Ψ η (det σ F ) > 0 the matrix σ F is invertible so we get and then, by (2.5)

Norms
In order to be able to give estimates of H α (F, G) we need to assume that Γ is given by a derivative operator as follows (this is actually always true, see Mokobodzki [15]): ♣ We assume that there exists a separable Hilbert space H and a linear application D : We also assume that we have the chain rule: Then we may define higher order derivatives in the following way.
In a similar way we define (by recurrence) We introduce now the norms Notice that since H is separable we may take an orthonormal base (e i ) i∈N and denote Moreover we denote Then, the following lemma is proved in the Appendix of [3]: Then, for every k and n there exists a constant C such that for every multi-index ρ with |ρ| ≤ n one has As an immediate consequence of (2.18) and of (2.12) we get Corollary 2.5 Let F ∈ E d and η > 0. Then

Regularization Lemma
We go now on and we give the regularization lemma. We recall that a super kernel φ : R d → R is a function which belongs to the Schwartz space S (infinitely differentiable functions wit rapid decrease) and such that for every multi-indexes α and β, one has φ(x)dx = 1 and y α φ(y)dy = 0 for |α| ≥ 1, As usual, if α = (α 1 , ...., α m ) then y α = m i=1 y α i . Since super kernels play a crucial role in our approach we give here the construction of such an object. We do it in dimension d = 1 and then we take tensor products. We take ψ ∈ S which is symmetric and equal to one in a neighborhood of zero and we define φ = F −1 ψ, the inverse of the Fourier transform of ψ. Since F −1 sends S into S the property (3.2) is verified. And we also have 0 = ψ (m) (0) = i −m x m φ(x)dx so (3.1) holds as well. We finally normalize in order to obtain φ = 1.
We fix a super kernel φ. For δ ∈ (0, 1) and for a function f we define For every q, m ∈ N there exists an universal constant C (depending on q and m only) such that for every multi-index γ with |γ| = m, every δ > 0 and every η > 0 with C q+m (F ) given in (2.15). In particular, taking m = 0 Remark 3.2 A similar estimate holds for |E(G∂ γ f (F )) − E(G∂ γ f δ (F ))| with G ∈ E. But in this case one has to replace P(det σ F ≤ η) with G 2 P 1/2 (det σ F ≤ η) and C q+m,1 (F ) by |G| q+m+1 2 C q+m,2 (F ) in the right hand side of (3.5). The proof is the same.
Proof. The proof is given in [3] in a particluar framework, but, for the convenience of the reader, we recall it here. Using Taylor expansion of order q (with integral remainder) and (3.1) we obtain By a change of variable we get As a consequence of (2.19) We also have, if |α| = q, We write now A similar inequality holds for ∂ γ f δ and so we conclude.
Example Let F = ∆ 2 with ∆ a standard normal random variable. We use standard Malliavin calculus to see what (3.10) gives in this case. We have DF = 2∆ so that σ F = 4∆ 2 and then (3.9) reads So κ = 1 2 here and (3.10) gives (one takes q → ∞)

But some informal computations give
So our calculus is not sharp: we get 1 4 instead of 1 2 .
In the previous lemma we have not assumed that σ F is invertible but we preferred to keep P(det σ F ≤ η). We give now a variant of the above lemma under the non degeneracy assumption for F : for every p ≥ 1 and C l (F ) given in (2.15).
and F ∈ E d such that (3.11) holds. For every q, m ∈ N there exists an universal constant C (depending on q and m only) such that for every multi-index γ with |γ| ≤ m, every δ > 0 and every η > 0 (3.13) Proof We follow the same reasoning as in the previous proof and we come back to (3.7), but we do no more multiply with Ψ η (det σ F ) : Using the standard integration by parts formula (2.9) we obtain, for some p ≥ 1 And by (3.8) we conclude that In [3] one gives the following more sophisticated version of the the regularization lemma for smooth functions with polynomial growth. More precisely we denote by C ∞ p (R d ) the space of smooth functions such that for every q ∈ N there exists L q (f ) and l q (f ) such that, for every multi index with |α| ≤ q and every

Then we have the following result (see [3] Lemma 5.3)
Lemma 3.5 Let F ∈ E d and q, m ∈ N. There exists some constant C ≥ 1, depending on d, m and q only, such that for every f ∈ C q+m pol (R d ), every multi-index γ with |γ| = m and every η, δ > 0 and p > 1 (3.14)

Distances
Let us introduce the following distances: In the case k = 0 this means that the test functions f are just measurable and bounded and in this case d 0 is the so called "total variation distance" that we will denote by d T V . Another interesting distance is the "Wasserstein distance" In many problems the estimate of the error involves some Taylor type extensions and then the test functions have to be differentiable and the norms of the derivatives come on. So we are able to estimate d k for some k. And then one asks about the possibility to obtain estimates for measurable test functions, as in total variation distance. And one may use the regularization lemma presented before in order to do it. We give several forms of such a result. But as we will show, the regularization lemma can be applied to other distances. As an example, in the sequel we consider also the following distance between random vectors in R d : we set So, d CF is the maximum distance between the characteristic functions of F and G. There are many situations where it is easier to obtain bounds on the difference of characteristic functions, especially when the targets in consideration are infinitely divisible. One may for instance consult [1] for an introduction to Stein's method theory for this kind of distribution. Again, the regularization lemma allows one to pass from such distance to the distance in total variation.
The key remark which allows to use the regularization lemma is the following.
Lemma 3.6 Let φ δ be the super-kernel introduced in the previous section and let F, G denote random vectors in R d . We also fix a multi-index β with |β| = r (including the void multi-index, in which case r = 0). Then for every k ∈ N there exists a constant C depending on k only such that for every We also have, for a universal constant C > 0, And moreover, for every ε > 0 one may find C (depending on ε) such that Proof The first estimate is an immediate consequence of the definition of d k and of Let us prove (3.18). Take first f ∈ S (the Schwartz space). Let Ff (ξ) = f (x)e −2πi<x,ξ dx denote the Fourier transform of f and F F (ξ) = E(e i F,ξ ) be the characteristic function of F . Then, using the inverse Take α such that ∂ α = ∂ 2 1 · · · ∂ 2 d and notice that, by using integration by parts, one obtains It follows that We will use this formula with f * φ δ instead of f. One has so that, by inserting above, (3.18) is proved. In order to prove (3.19) we take a truncation Notice that for q one has |y| q ∂ β φ δ (y) dy = δ q−r |y| q ∂ β φ(y) dy.
Moreover, since F has finite moments of any order, one has, So, for every q ≥ r The same is true for G and then, combining the two previous estimates we obtain, for every We optimize over M and we obtain (3.19). The proof of (3.20) is the same, but one has to use Lemma 3.7 Let F, G ∈ E d be such that (3.9) holds true for some fixed κ > 0. Then for every Moreover, for every ε > 0 Proof. By (3.17) (with r = 0) so, using (3.10) we get We optimize over δ and we get (3.21). The proof of (3.23) is the same, but one employes (3.19).
We give now a result under the strong non degeneracy condition (det σ F ) −1 ∈ ∩ p≥1 L p .
Lemma 3.8 Let F, G ∈ E be such that Q q (F ) + Q q (G) < ∞ for every q ∈ N (see (3.12)). Then, for every q, k ∈ N there exists a constant C (depending on q and k only) such that (3.24) In particular, for every ε > 0 and a similar estimate holds for G. Moreover by (3.17) (F, G)). We optimize on δ and we get (3.24). Using (3.19) we obtain, in the same way, (3.26). Remark 3.9 Compare with Corollary 2.8 pg 11/33 in [2] -there we have d 1/(k+1) k (F, F n ). So it is much less good there. The reason is that we have now a much stronger regularization lemma. Compare the estimate (2.29) pg 8/33 in [2] with (3.6) here: here we have "for every q" and this is what gives the much better result.
We finish this section with a variant of the previous Lemma: now we assume the non degeneracy condition (det σ F ) −1 ∈ ∩ p≥1 L p for F but we assume no non degeneracy condition non G. Then we get the following: Lemma 3.10 Let F, G ∈ E be such that Q q (F ) + C q,1 (G) < ∞ for every q ∈ N (see (3.12)). Then, for every p, p ′ ∈ N and every ε > 0 there exists a constant C (depending on ε, p, p ′ ) such that Take η > 0 and take φ η ∈ C ∞ b (R + ) such that 1 (2η,∞) ≤ φ η ≤ 1 (η,∞) and φ (k) η ∞ ≤ Cη −k . We recall that σ G is the Malliavin covariance matrix of G and we write the last inequality being true for every ρ. Then, using the regularization lemma, we get for every δ > 0 and every q ∈ N We also have P(σ F ≤ η) ≤ Q ρ (F )η ρ for every ρ so that the regularization lemma for F gives On the other hand Putting these together We fix now ε > 0 and we choose δ and η. First take δ = d 2ε p,p ′ so that We also have p;p ′ . Choose q = 1/ε and ρ = 2/ε in order to obtain Then Take now ε = max{2pε, p ′ ε/2} and q(ε) = max{[4p/ε] + 1, [p ′ /2ε] + 1}. The above estimate reads (3.27) is proved. In order to prove (3.29) proceed as befor but we use (3.20) in order to get Moreover, we use (3.19) instead of (3.17).

Remark 3.11
The above lemma essentially says that if (F n , det σ Fn ) → (F, det σ F ) in some "smooth distance" d p (for example in the Wasserstein distance d W ) then F n → F in total variation distance. And one obtains the estimate of the speed of convergence: one looses a little bit because we have the power 1 − ε. The striking fact is the we do not need the non degeneracy condition for F n but only for F.

Remark 3.12
In concrete applications it may be difficult to compute d p ′ (det σ F , det σ G ). Then we are obliged to come back to "strong distances": we have d 1 (det σ F , det σ G ) ≤ ( DF d−1 + DG d−1 ) DF − DG so we take p ′ = 1 and we get Notice however that here we loose the right order of convergence. For example, in the case of the convergence of the Euler scheme X n t to the diffusion process X t we have d 1 (X n t , X t ) ≤ C n but DX n t − DX t ∼ C n 1/2 . This is because we deal with the weak convergence in the first case and with the strong convergence in the second one. So we pass from 1 n to 1 n 1/2 . If we have an ellipticity property, then we do no need to estimate DX n t − DX t so we are at level 1 n .

The distance between density functions
In this section we estimate the distance between density functions, under the non degeneracy condition.
Proposition 3.13 Let F, G ∈ E d be such that Q q (F ) + Q q (G) < ∞ for every q ∈ N (see (3.12)). Then for every k ∈ N, every multi index α = (α 1 , ..., α m ) and every ε > 0 there exists some constants C and q (depending on ε, k and on m) such that In particular if p F respectively p G are the density functions for F respectively G then, for every The same estimates hold with d 1−ε k (F, G) replaced by d 1−ε CF (F, G).
Remark 3.14 Compare with the estimate (2.53) pg 14/33 in [2] : here the estimate is much better because, using d 1 for example, we have just d 1−ε 1 (F, G) ≤ E(|F − G|)) 1−ε and the Sobolev norms are not involved (as it is the case in [2]).
Proof We will prove just (3.31) because (3.32) follows by standard regularization methods. To begin we use (3.13) and we get and a similar estimate for G. Moreover, using (3.17) Optimizing over δ we get (3.31).

Euler Scheme
In this section we consider the d dimensional diffusion process We are concerned with the Euler scheme defined in the following way. We fix n ∈ N, we define τ (t) = k n for k n ≤ t < k+1 n and then the Euler scheme is given by Let us see what we may obtain using the results from the previous sections. We use the standard Malliavin calculus so now E = D ∞ (see the notation in Nualart). And under our assumptions on the coefficients (σ j , b ∈ C ∞ b (R d , R d )) standard estimates yield, for every q ∈ N C q (X t (x)) + sup n C q (X n t (x)) = C q < ∞.
So, using (3.25) first and (4.1) then, we obtain for every ε > 0 d T V (X t , X n t ) ≤ C × d 1−ε 6 (X t , X n t ) ≤ C n 1−ε . (4.4) Comparing with the result of Guyon (4.2) we see that we have lost a little bit because we have the power 1 − ε instead of 1. This is a structural drawback of our method which is based on optimization. However there is a slight gain because we need just ellipticity in the starting point x and not uniform ellipticity.
Let us see now what we are able to say under Hörmander's condition. We stress that we do not need the uniform Hörmander condition but only the condition in the starting point x : we just assume that Span{∪ k∈N A k (x) = R d }. This is sufficient in order to guarantee that det σ Xt(x) > 0 and this is all we need. As we mentioned above, we are no more able to prove that σ X n t (x) ≥ λ(x) > 0 so we have to use (3.27) (together with (4.1)) : d T V (X t , X n t ) ≤ C × (d 6 (X t , X n t ) + d p ′ (det σ Xt , det σ X n t )) 1−ε ≤ C × ( 1 n + d p (det σ Xt , det σ X n t )) 1−ε .
Now we have to estimate d p (det σ Xt , det σ X n t ). If we would be able to prove that for some p ∈ N one has d p (det σ Xt , det σ X n t ) ≤ C n then we come back to the same estimate as in the elliptic case. At a first glance this seems reasonable, but taking things seriously this is not so clearwe give up to answer this question here, and we just notice that easy standard arguments give det σ Xt − det σ X n t ) 1 ≤ C √ n which yields d 1 (det σ Xt , det σ X n t ) ≤ C n 1/2 . Finally we obtain: (4.5) We conclude that we are still able to prove that lim n d T V (X t (x), X n t (x)) = 0 but we loose much on the speed of convergence.

Central Limit Theorem for Wiener chaoses
Let us fix 1 ≤ q 1 ≤ q 2 · · · ≤ q d a sequence of d positive integers. Let us consider here F n = (F 1,n , · · · , F d,n ) a sequence of random vectors such that for all i ∈ {1, · · · , d} and all n ≥ 1, the random variable F i,n belongs to the q i -th Wiener chaos. We will further assume that the covariance matrix of F n is the identity matrix for every n ≥ 1. A central result of Nourdin-Peccati theory established in [18] , provides an explicit bound in total variation between the distribution F n and the distribution of a standard Gaussian vector, say N = (N 1 , · · · , N d ): Let us mention that an entropic result is actually proved in [18] and the previous bound is the corresponding total variation estimate which is derived from Pinsker inequality.
The proof of (4.6) uses clever arguments from information theory which are nevertheless rather specific to Gaussian targets. Our goal here is to apply the Lemma 3.10 to this situation and to compare the bounds. First, from [17] one has the following result regarding the 1-Wasserstein distance: