Stein's method for comparison of univariate distributions

We propose a new general version of Stein's method for univariate distributions. In particular we propose a canonical definition of the Stein operator of a probability distribution {which is based on a linear difference or differential-type operator}. The resulting Stein identity highlights the unifying theme behind the literature on Stein's method (both for continuous and discrete distributions). Viewing the Stein operator as an operator acting on pairs of functions, we provide an extensive toolkit for distributional comparisons. Several abstract approximation theorems are provided. Our approach is illustrated for comparison of several pairs of distributions : normal vs normal, sums of independent Rademacher vs normal, normal vs Student, and maximum of random variables vs exponential, Frechet and Gumbel.


Introduction
Stein's method is an important tool in applied and theoretical probability, widely used for Gaussian and Poisson approximation problems. The principal aim of the method is to provide quantitative assessments in distributional approximation problems of the form W ≈ Z where Z follows a known and well-understood probability law (normal, exponential, Gamma, Poisson, etc.) and W is the object of our interest. To this end, Charles Stein [79] in 1972 laid the foundation of what is now called "Stein's method"; for Poisson approximation his student Louis Chen [18] adapted the method correspondingly, and hence for Poisson approximation the method is often called "Stein-Chen method" or "Chen-Stein method". We refer to the monographs [8,62,20] for an overview. Broadly speaking, Stein's method consists of two distinct components, namely Part A: a framework allowing to convert the problem of bounding the error in the approximation of W by Z into a problem of bounding the expectation of a certain functional of W .
Part B: a collection of techniques to bound the expectation appearing in Part A; the details of these techniques are strongly dependent on the properties of W as well as on the form of the functional.

A brief sketch of the method
This equivalence is called a Stein characterization of P. Next let H be a measuredetermining class on I. Suppose that for each h ∈ H one can find a solution f = f h ∈ F (A) of the Stein equation where Z ∼ P. Then, if taking expectations is permitted, we have There exist a number of probability distances (such as the Kolmogorov, the Wasserstein, and the Total Variation distance) of the form see [62,Appendix E] or [34,72] for an overview. Hence where F (H) = {f h | h ∈ H} is the collection of solutions of (2) for functions h ∈ H. When only certain features of W are known, for example that it is a sum of weakly dependent random variables, then (3) is the usual starting point for Part B of Stein's method. Now suppose that, furthermore, a Stein operator A W (and a class F (A W )) is available for W . Suppose also that F (A Z ) ∩ F (A W ) = ∅ and choose H such that all solutions f of the Stein equation (2) for A Z and A W belong to this intersection. Then and Stein [79] has the masterstroke of insight that the r. h. s. of (3) or (4) provides a handle to assess the proximity between the laws of W and Z; this is precisely the object of Part B of Stein's method. In many cases, not only is the function f h well-defined, but also it possesses smoothness properties which render it particularly amenable to computations. Also there exist many ways by which one can evaluate E [Af (W )] or E[A W f (W ) − A Z f (W )] (even under unfavorable assumptions on W ) including exchangeable pairs (as for example in [80,45,74,17,15,23]), biasing mechanisms (as in [3,38,35,68,32]), and other couplings (as in [18,6]); see [73,76,14] for overviews. Nourdin and Peccati [60] paved the way for many elegant results in the context of Malliavin calculus, for an overview see [62]. See also [45,29,30,37,23,31] for examples where direct comparison (using the explicit distribution of W ) via (4) is used.
Of course the devil is in the detail -finding suitable Stein operators which are tractable to deal with for the random variables in question, highlighting the need of finding the "right" Stein operator for the desired stochastic approximation.

Finding Stein operators
There are three classical ways of constructing a Stein operator : • Stein's density approach [80,81] • Barbour and Götze's generator approach [4,39] • Diaconis and Zabell's orthogonal polynomial approach [22]. Barbour and Götze provide probabilistic intuition by viewing the target distribution as the stationary distribution of a suitable Markov process. Some references detailing this approach for univariate distributions are [37,29,45,51,25,52]. Diaconis and Zabell use Rodrigues type formulas for orthogonal polynomials associated with the target distribution, if these exist; see also [77,36]. Other approaches are possible, see for instance [23] where exchangeable pairs are used.
Stein's density approach relies on the target having a Lebesgue density p with mean ν and using integration by parts to show that the operators (see [81]) or (see [80,Lecture VI]) are valid Stein operators in the sense of (1) under different restrictions on p.
In this paper we propose a generalization of Stein's density approach, in the spirit of [55,56,57], which can be sketched as follows. Suppose that we are given a derivative-type operator D acting on a set of functions on some set X ⊆ R as well as an operator D ⋆ which we suppose to be skew-adjoint to D with respect to integration in a dominating measure µ, i.e.
for all suitable functions f and g. Now consider a random variable X which has density p with respect to µ. If we are allowed to replace f by f p in (7) then this equation becomes hence the operator T X defined by would be skew-adjoint to D ⋆ with respect to integration in pdµ (this will be made precise in Proposition 1). To see the connection with Stein's density approach outlined above, note that if D is the classical derivative, then T X (f ) = f ′ + p ′ p f (provided that p and f are differentiable) hence we recover (5). If we replace f by , hence we recover (6).
Operator T X is what we call the canonical Stein operator for X. Working out the classes of functions for which (8) holds will be carried out in Section 2. We will also see that, under natural assumptions, T X satisfies a product rule from which we will infer an ensemble of Stein equations for X of the form whose solutions are now pairs (f, g) of functions. A key observation is that different Stein operators can be obtained from (9) either by fixing f to be of a certain form and keeping g variable, or fixing g to be of a certain form and keeping f variable. For example, a choice which has been widely studied is the Stein equation obtained by fixing f = 1, if permitted, which yields where u(x) = T X 1(x) is the so-called "score function" of X. In the continuous case u(x) = p ′ (x)/p(x), and thus we recover (5). An alternative choice is the equation obtained by fixing f = τ , the function for which T X (τ (x)) = ν − x with ν the mean of X (see [60,62,63]). This yields where τ (x) is the so-called "Stein kernel" of X. In the continuous case τ (x) = ∞ x (y − ν)p(y)dy/p(x) and we recover (6). More generally, as we shall see, one can use our theory of Stein differentiation to obtain all matters of Stein operators (even of higher order) by considering operators A acting on functions f ∈ F (A) of the form for F some transformation of f (e.g. scaling, centering, pre-multiplication by another function, differentiation). Such constructions provide a unifying framework for part A of Stein's method. As for Part B of Stein's method, let X 1 and X 2 have canonical operators T 1 and T 2 associated with D 1 and D 2 two possibly different derivative-type operators. Using a similar argument as for (4) we will prove that, under reasonable assumptions on h, there exists a function g 0 = g 0 (f 1 , f 2 , h) such that for all x in the range of X 2 , for all f 1 and f 2 belonging to well identified classes (see Theorem 3 for the precise statement). Identity (11) is a powerful starting point for stochastic approximation problems, as one can handpick the functions f 1 and f 2 best suited to the problem under study.

Outline of the paper
In Section 2 we provide the set-up and we define our canonical Stein operator and its inverse. We derive Stein equations and consider some specifications of the general Stein operator. Thus this section provides a canonical Part A of Stein's method. In Section 3 we discuss different important standardizations (10), linking the canonical Stein equation to the classical Stein equations. In Section 4 we describe our new approach to Part B via (11), and provide abstract approximation theorems. In Section 5 we illustrate the power of our approach by tackling applications to specific approximation problems.

The setup
Let (X , B, µ) be a measure space with X ⊂ R (see Remark 2). Let X ⋆ be the set of real-valued functions on X . We require the existence of a linear operator such that dom(D) \ {0} is not empty. As is standard we define as the linear operator which sends any h = Df onto f . This operator is a rightinverse for D in the sense that D D −1 h = h for all h ∈ im(D) whereas, for f ∈ dom(D), D −1 (Df ) is only defined up to addition with an element of ker(D). We impose the following assumption.
and a constant l := l X ,D such that for all (f, g) ∈ dom(D) × dom(D ⋆ ) and for all x ∈ X .
Example 1 (Lebesgue measure). Let µ be the Lebesgue measure on X = R and take D the usual strong derivative. Then is the usual antiderivative. Assumption 1 is satisfied with D ⋆ = D and l = 0.
Example 2 (Counting measure). Let µ be the counting measure on X = Z and take D = ∆ + , the forward difference operator Then Also we have the discrete product rule for all f, g ∈ Z ⋆ and all x ∈ Z. Hence Assumption 1 is satisfied with D ⋆ = ∆ − , the backward difference operator, and l = −1.
Example 3 (Standard normal). Let ϕ be the standard normal density function so that ϕ ′ (x) = −xϕ(x). Let µ(x) be the standard normal measure on R and take D = D ϕ the differential operator defined by see e.g. [53]. Then Also we have the product rule Hence Assumption 1 is satisfied with D ⋆ g = g ′ and l = 0.
Example 4 (Poisson). Let γ λ be the Poisson probability mass function with parameter λ. Let µ(x) be the corresponding Poisson measure on Z + and take D = ∆ + λ the difference operator defined by Then which, as is well-known, is ill-defined at x = 0 (see, e.g., [8,5]). We have the product rule . Hence Assumption 1 is satisfied with D ⋆ g(x) = x∆ − g(x) and l = −1.

Remark 1.
In all examples considered above the choice of D is, in a sense, arbitrary and other options are available. In the Lebesgue measure setting of Example 1 one could, for instance, use D the derivative in the sense of distributions, or even Df (x) = ∂ ∂t f (P t x) for x → P t x some transformation of X ; see e.g. [56]. In the counting measure setting of Example 2 the roles of backward and forward difference operators can be exchanged; these operators can also be replaced by linear combinations as, e.g., in [44]. The discrete construction is also easily extended to general spacing δ = 1 : if X = δZ, then we can take In the Poisson example one could also consider In all cases less conventional choices of D can be envisaged (even forward differences in the continuous setting).

Remark 2.
Nowhere is the restriction to dimension 1 necessary in this subsection. The need for this assumption will become apparent when we use the setup to construct a general version of Stein's method. Indeed although our approach should in principle be able to provide useful insight into Stein's method for multivariate distributions, the method does not fare well in higher dimensions (this fact is well-known, see e.g. [16,63,74]) and we will not discuss multivariate extensions further in this paper.

Canonical Stein class and operator
Following [37] we say that a subset I ⊂ X is a finite interval if I = {a, b} ∩ X for a, b ∈ R with a ≤ b, and an infinite interval if either I = (−∞, b} ∩ X or I = {a, ∞) ∩ X or I = X (provided, of course, that X itself has infinite length).
Here { is used as shorthand for either ( or [, and similarly } is either ) or ]. In the sequel we consistently denote intervals by I = {a, b} where −∞ ≤ a ≤ b ≤ +∞ (we omit the intersection with X unless necessary). Now consider a real-valued random variable X on X such that P X (A) = P(X ∈ A) for A ∈ B is absolutely continuous w.r.t. µ. Let p = dP X /dµ be the Radon-Nikodym derivative of P X ; throughout we call p the density of X (even if X is a discrete random variable). In the sequel, we only consider random variables such that p ∈ dom(D) and whose support supp(p) = {x ∈ X | p(x) > 0} =: I is an interval of X . For any real-valued function h on X we write this expectation exists for all functions h : X → R such that E p |h| < ∞; we denote this set of functions by L 1 µ (p) ≡ L 1 µ (X). Definition 1. The canonical Stein class F (p) ≡ F (X)(= F µ (p)) for X is the collection of functions f ∈ L 1 µ (p) such that (i) f p ∈ dom(D), (ii) D(f p) ∈ L 1 (µ) and (iii) I D(f p)dµ = 0. The canonical Stein operator T p ≡ T X for p is the linear operator on F (X) defined as with the convention that T X f = 0 outside of I.
To avoid triviality we from hereon assume that F (X) \ {0} = ∅. Note that F (X) is closed under multiplication by constants. By definition, T X f ∈ L 1 µ (p) for all f ∈ F (X), and so that T X satisfies equation (1), qualifying it as a Stein operator.

Remark 3.
The assumption for F (X) that I D(f p)dµ = 0 is made for convenience of calculation but it is not essential. Indeed sometimes it may be more natural not to impose this restriction. For example if µ is the continuous uniform measure on [0, 1] and p = 1, with D the usual derivative, then imposing that may not be natural. The price to pay for relaxing the assumption is that in the definition of T X f (X) we would have to subtract this integral, as in [81], to assure that E[T X f (X)] = 0.
The canonical Stein operator (14) bears an intuitive interpretation in terms of the linear operator D. Then for all f ∈ F (X) and all g ∈ dom(D, X, f ).
Proof. Assumption 1 assures us that for all f ∈ F (X) and all g ∈ dom(D ⋆ ). If moreover g ∈ dom(D, X, f ) then with both integrals being finite. This yields (16).
As anticipated in the Introduction, relationship (16) shows that if D is skewadjoint with respect to D ⋆ under integration in µ then the canonical Stein operator is skew-adjoint to D ⋆ under integration in the measure P X . This motivates the use of the terminology "canonical" in Definition 1; we will further elaborate on this topic in Section 2.5.
Example 5 (Example 1, continued). Let X be a random variable with absolutely continuous density p with support I = {a, b}. Then F (X) is the collection of functions f : R → R such that f p ∈ W 1,1 the Sobolev space of order 1 on L 1 (dx) and lim xցa f (x)p(x) = lim xրb f (x)p(x); the canonical Stein operator is which we set to 0 outside of I. Also, for f ∈ F (X), dom((·) ′ , X, f ) is the class of differentiable functions g : R → R such that (gf p) ′ dx = 0, |g ′ f p|dx < ∞ or |g(f p) ′ |dx < ∞. (Note that the first requirement implicitly requires | (gf p) ′ |dx < ∞.) In particular all constant functions are in dom((·) ′ , X, f ). In the case that p itself is differentiable (and not only the function with I(·) the usual indicator function. This is operator (5) from Stein's density approach. Note that, in many cases, the constant functions may not belong to F (X). Operator (17) was also discussed (under slightly different -more restrictive -assumptions) in [17]. See also [57] for a similar construction.
Example 6 (Example 2, continued). Recall D = ∆ + and consider X some discrete random variable whose density p has interval support I = [a, b] (with, for simplicity, a > −∞). The (forward) Stein operator is which we set to 0 outside of I. We divide the example in two parts.
In particular all bounded functions g are in dom(∆ + , X, f ).
Similarly it is straightforward to define a backward Stein class and operator.
Example 7 (Example 3, continued). Let X be a random variable with density p with support I = {a, b} with respect to ϕ(x)dx the Gaussian measure. Recall In particular all constant functions are in dom(D ϕ , X, f ). We draw the reader's attention to the fact that the above construction is directly obtained by replacing p with pϕ in Example 5.
Example 8 (Example 4, continued). Recall D = ∆ + λ and consider X some discrete random variable whose density p has interval support I = [0, b]. The (forward) Stein operator is which we set to 0 outside of I. Then, as in the previous example, we simply recover the construction of Example 6 with f (x) replaced by xf (x) (and thus no condition on f (0)) and p(x) replaced by p(x)γ λ (x).

Remark 4.
As noted already in the classic paper [22], the abstract theory of Stein operators is closely connected to Sturm-Liouville theory. This connection is quite easy to see from our notations and framework; it remains however outside of the scope of the present paper and will be explored in future publications.

The canonical inverse Stein operator
The Stein operator being defined (in terms of D), we now define its inverse (in terms of D −1 ). To this end first note that if D(f p) = hp for f ∈ F (X) then T X (f ) = h. As D(f p + χ) = hp for any χ ∈ ker(D), to define a unique right-inverse of T X we make the following assumption.
This assumption ensures that the only µ-integrable χ is 0 and thus T X (as an operator acting on F (X) ⊂ L 1 (µ)) possesses a bona fide inverse, and also that ker We will use the shorthand with the convention that T −1 X h = 0 outside of I. We state the counterpart of Proposition 1 for the inverse Stein operator.

Proposition 2. Define the class of functions
for all h ∈ F (0) (X) and all g ∈ dom(D, X, T −1 X h). Example 9 (Example 1, continued). Let X have support I = {a, b} with Stein class F (X) and Stein operator T X (f ) = (f p) ′ /p. Then The inverse operator and corresponding sets in Example 3 (resp., Example 4) are simply obtained by replacing p with ϕp (resp., with γ λ p) in Example 9 (resp., in Example 10).

Stein differentiation and the product rule
. Taking f ≡ 1 ensures the first claim. The second claim then follows immediately.
From here onwards we make the following assumption.
Starting from the product rule (12) we also obtain the following differentiation rules for D and D ⋆ .
Proof. Claim 1. is immediate. To see 2., using Assumption 1 we write Applying Claim 1. to the second summand we then get The following result is the basis of our theory of "Stein differentiation". It is also the key to the standardizations leading to the different Stein operators that will be discussed in Section 3.
for f ∈ F (X) and g ∈ dom(D, X, f ).
Proof. Use Assumption 1 to deduce which is the claim.
To see how (19) can be put to use, let h ∈ L 1 µ (X) and consider the equation As discussed in the Introduction, equation (20) is indeed a Stein equation for the target X in the sense of (2), although the solutions of (20) are now pairs of functions (f, g) with f ∈ F (X) and g ∈ dom(D, X, f ) which satisfy the relationship We stress that although f g is uniquely defined by (21), the individual f and g are not (just consider multiplication by constants).
Equation (20) and its solutions (21) are not equivalent to equation (2) and its solutions already available from the literature, but rather contain them as illustrated in the following example.
Example 11 (Example 1, continued). Taking g = 1 (this is always permitted by Lemma 1) and p differentiable we get the equation whose solution is to be some function f ∈ F (X), as in e.g. [57]. If the constant function f ≡ 1 is in F (X) then keeping instead g variable but taking f ≡ 1 yields the equation whose solution is any function in dom(D, X, 1) the collection of functions g ∈ F (X) such that gp ′ /p ∈ L 1 µ (X), a family of equations considered e.g. in [81]. Similar considerations hold in the settings of examples 2, 3 and 4. We stress the fact that the difference between (22) and (23) lies in the space of solutions. (20) as pairs of functions (f, g) is, in a sense, a paradigm shift in the theory of Stein operators. We will discuss this in more detail in Section 3.

Stein characterizations
Pursuing the tradition in the literature on Stein's method, we provide a general family of Stein characterizations for X. Aside from Assumptions 1, 2 and 3 we will further need the following two assumptions to hold.
Both assumptions are simultaneously satisfied in all examples discussed in Section 2.
Theorem 2. Let Y be a random element with the same support as X and assume that the law of Y is absolutely continuous w.r.t. µ with Radon-Nikodym derivative q.
and that q p ∈ dom(D * ). Take g ∈ dom(D, X) which is X-a.s. never 0 and assume that g q p ∈ dom(D, X). Then for all f ∈ F (X).
2. Let f ∈ F (X) be X-a.s. never zero and assume that dom for all g ∈ dom(D, X, f ).
Remark 6. The assumptions leading to (24) and (25) can be relaxed by removing the assumption that Y and X share a support I but instead conditioning on the event that Y ∈ I and writing Y | Y ∈ I D = X to indicate that p = cq on I, for a constant c = P (Y ∈ I), see [57].
Proof. The sufficient conditions are immediate. Indeed, from (16), if Y has the same distribution as X then (24) and (25) hold true.
We now prove the necessity. We start with statement 1. Let g be such that gq/p ∈ dom(D, X). Then, gq/p ∈ dom(D ⋆ ) and, for all f ∈ F (X), we have D ⋆ (gq/p)f p ∈ L 1 (µ) as well as D g(· + l) q(·+l) p(·+l) f (·)p(·) dµ = 0 and we can apply (13) to get Supposing (24) gives for all f ∈ F (X). On the one hand, as F (X) is assumed to be dense in L 1 µ (p), it follows that D ⋆ g q p = q p D ⋆ (g) p − a.e. and, on the other hand, by Claim 2. in Lemma 2 we know that D ⋆ g q p = q p D ⋆ g + g(· + l)D ⋆ q p . Equating these two expressions gives that g(· + l)D ⋆ q p = 0 p − a.e. and, as g is p-a.e. never 0 we obtain that Assumption 4 now gives that there is a constant c such that p = cq except on a set of p-measure 0. As both p and q integrate to 1, it must be the case that c = 1, and so p = q on supp(p), which gives the first assertion. We tackle statement 2. If g q(·−l) p(·−l) ∈ dom(D, X, f ) then Supposing (25) gives Simplifying and using the fact that f is never 0 we deduce the equivalent score-like condition Assumption 5 gives the conclusion.
Theorem 2 generalizes the literature on this topic in a subtle, yet fundamental, fashion. To see this first take g ≡ 1 in (24) (recall that this is always permitted) to obtain the Stein characterization which is valid as soon as the densities of X and Y have same support and q/p ∈ dom(D, X, ·). This is the characterization given in [57,55] Here we assume that p and q share same support. The condition g ∈ dom(D, X, 1) is equivalent to g(· + l) ∈ F (X) and E |g(X)T X 1(X)| < ∞. This is the general characterization investigated in [81].
Remark 7. The hypothesis that the constant function 1 belongs to F (X) is not a small assumption. Indeed, we easily see that This condition is not satisfied e.g. by the exponential distribution p(x) = e −x I(x ≥ 0) (because the integral of the derivative is not 0) nor by the arcsine distribution p(x) = 1/ x(1 − x)I(0 < x < 1) (because the derivative is not integrable).

Remark 8.
Our approach is reminiscent of Stein characterizations of birthdeath processes where one can choose the death rate and adjust the birth rate accordingly, see [45] and [29].

Connection with biasing
In [35] the notion of a zero-bias random variable was introduced. Let X be a mean zero random variable with finite, nonzero variance σ 2 . We say that X * has the X-zero biased distribution if for all differentiable f for which EXf (X) exists, EXf (X) = σ 2 Ef ′ (X * ).
Furthermore the mean zero normal distribution with variance σ 2 is the unique fixed point of the zero-bias transformation. More generally, if X is a random variable with density p X ∈ dom(D * ) then for all f ∈ dom(D), by (12) we have and so This equation could lead to the definition of a transformation which maps a random variable Y to Y (X) such that, for all f ∈ dom(D) for which the expressions exist, For some conditions which give the existence of such Y * see [36]. As an illustration, in the setting of Example 1, if the density p is log-concave (so that −p ′ /p is increasing) then the existence of the coupling Y (X) is straightforward via the Riesz representation theorem, as in [35].
, and the assertion follows from the density assumption and (24). Hence (24) can be used to establish distributional characterizations based on biasing equations.

Stein operators
Let X be a random variable with support X , let D be a linear operator acting on X ⋆ and satisfying Assumptions 1 and 2. There are now two seemingly antagonistic points of view : -In the Introduction we mention the fact that Stein's method for X relies on a pair (A X , F (A X )) with A X a differential operator acting on F (A X ) a class of functions. For any given X, the literature on Stein's method contains many different such (not necessarily first order!) operators and classes. -In Section 2, we claim to obtain "the" canonical operator associated to X, denoted T X , acting on "the" canonical class F (X) (uniqueness up to the choice of D) with unique inverse T −1 X . In this third section we merge these two points of view. Our general point of view is that a Stein operator for a random variable X is any operator that can be written in the form and, given h ∈ L 1 µ (X), the corresponding Stein equation is h − Eh(X) = A X (f, g) whose solutions are any functions f ∈ F (X) and g ∈ dom(D, X, f ) such that f g = T −1 X (h − Eh(X)). There are many ways to particularise (26), such as 1. fix f ∈ F (X) and let g vary in dom(D, X, f ), 2. fix g ∈ dom(D, X) and let f vary in F (X), 3. let f and g vary simultaneously.
We refer to these mechanisms as standardizations.
For the first approach pick a function c ∈ F (X) and define the operator acting on functions g ∈ F (A X ) = dom(D, X, c). The corresponding Stein equation is h − Eh(X) = A X g whose solutions are g ∈ dom(D, X, c) given by g = T −1 X (h − Eh(X))/c. The second option is to fix a function g ∈ dom(D, X) and define the operator acting on functions f ∈ F (X). In this case solutions of the Stein equation are f ∈ F (X) given by f = T −1 X (h − Eh(X))/g. The third option is to consider operators of the form acting on functions (f, g) ∈ G 1 ×G 2 where G 1 , G 2 ⊆ X ⋆ are such that f (·)g(·+l) ∈ F (X). For example we could consider G i polynomial functions or exponentials and pick G j with j = i so as to satisfy the assumptions. Solutions of the Stein equation are pairs of functions such that f (·)g(· + l) = T −1 X (h − Eh(X)). Remark 9. The use of the notation c in (27) relates to the notation in [37], where the idea of using a c-function to generate a family of Stein operators (27) was first proposed (in a less general setting).
Remark 10. Although appearances might suggest otherwise, operators (27) and (28) are not necessarily first order differential/difference operators. One readily obtains higher order operators by considering, for example, classes F A (X) of functions of the form f = D kf forf appropriately chosen; see Section 3.6.
The difference between (27), (28) and (29) is subtle (the first two being particular cases of the third). The guiding principle is to find a form of Stein equation for which the solutions are smooth. The remainder of the Section is dedicated to illustrating standardizations under several general assumptions on the target density, hereby providing interesting and important families of Stein operators.

Stein operators via score functions
Suppose that X is such that the constant function 1 ∈ F (X) and define the so-called score function of X. Then taking c = 1 in (27) we introduce the operator acting on the class F (A X ) = dom(D, X, 1).

The corresponding Stein equation is
forh any function with X-mean zero; solutions of this equation are the functions and bounds on these functions (as well as on their derivatives) are crucial to the applicability of Part B of Stein's method through operator (31).
In the continuous setting of Example 1 we recover operator (5). In this case F (A X ) is the set of all differentiable functions g such that E |g ′ (X)| < ∞ and E |g(X)u(X)| < ∞.
These are the conditions (27) and (28)  Remark 11. The terminology "score function" for the function Dp(x)/p(x) is standard (at least in the continuous case); it is inherited from the statistical literature.

Stein operators via the Stein kernel
Suppose that X has finite mean ν and define a function which we call the Stein kernel of X (see forthcoming Remark 13). Next take c = τ in (27) (this is always permitted) and introduce the operator acting on the class F (A X ) = dom(D, X, τ ).

The corresponding Stein equation is
forh any function with X-mean 0; solutions of this equation are the functions and bounds on these functions (as well as on their derivatives) are crucial to the applicability of Part B of Stein's method via operator (33).
In the continuous setting of Example 1 we recover operator (6). In this case F (A X ) is the set of all differentiable functions such that E |g(X)(X − ν)| < ∞ and E |g ′ (X)τ (X)| < ∞.
The Stein kernel (32) has a number of remarkable properties. In particular, it plays a pivotal role in the connection between information inequalities and Stein's method, see [53,64,63].
Remark 13. Although the function τ = T −1 X (ν − Id) has already been much used in the literature, it has been given various names all marked with some ambiguity. Indeed [60,62,63] (among others) refer to τ as the "Stein factor" despite the fact that this denomination also refers to the bounds on the solutions of the Stein equations, see [75,21,7]. Other authors, including [13,10,11], rather refer to this function as the "ω-function" or the "covariance kernel" of X. We prefer to unify the terminology by calling τ a Stein kernel.
Two particular cases of (33) have already been perused in the literature in the following case.

Definition 3 (Pearson's class of distributions).
A continuous distribution p with support supp(p) is a member of Pearson's family of distributions if it is solution to the differential equation for some constants λ, α, β j , j = 0, 1, 2.
Properties of the differential operator T X f = (f p) ′ /p have already been studied in quite some detail for distributions p which belong to Pearson's class of distributions, see e.g. [22,50,46,67,54,58,2]. If X ∼ p, a Pearson distribution, then by definition its derivative p ′ exists and, using (17), its canonical Stein operator is for x ∈ supp(p). In general this operator is not easy to handle. It is shown in [50] that, in the setting of Example 1, a density p satisfies (34) if and only if its Stein kernel τ (x) is quadratic. This function can be calculated (using e.g. [22, equation (3.5)]), and is given by see also [67]. This observation leads us to considering distributions, discrete or continuous, which have Stein kernel of the form for some constants a, b and c. For distributions satisfying (35) we deduce a natural family of Stein operators acting on the class F (A X ) of functions such that gτ ∈ F (X) as well as E |g(X)(ν − X)| < ∞ and E D ⋆ g(X) a + bX + cX 2 < ∞. n−x / a+b n for some constants a, b and n, in other words, X has a generalized hypergeometric distribution. See also [1] where distributions satisfying (35) are referred to as Cumulative Ord distributions; see in particular their Proposition 2.1 for a characterization.
Example 12. Many "useful" densities satisfy (35) in which case operator (33) has a nice form as well. The following examples are easy to compute and will be useful in the sequel; for future reference we also provide the log derivative of the density.
1. Continuous setting, strong derivative : 2. Discrete setting, forward derivative : We refer the reader to the handbook [24] for more examples and further details.

Invariant measures of diffusions
Recent papers [25,51,52] provide a general framework for performing Stein's method with respect to densities p which are supposed to admit a variance and be continuous (with respect to the Lebesgue measure), bounded with open interval support. Specifically, [51] suggest studying operators of the form with γ ∈ L 1 (µ) a continuous function with strictly one sign change on the support of X, negative on the right-most interval and such that γp is bounded and E[γ(X)] = 0, x a γ(y)p(y)dy, for g ∈ F (A X ) the class of functions such that g ∈ C 1 and E|γ(X)g(X)| < +∞ and E|β(X)g ′ (X)| < +∞.
Then [51] (see as well [52] for an extension) use general theory of diffusions to prove that A X are indeed Stein operators in the sense of the Introduction (their approach falls within the generator approach). In our framework, (36) is a particular case of (27), with c = β/2 = T −1 X γ ∈ F (X) and γ = T X c (which necessarily satisfies E[γ(X)] = 0) and F (A X ) = dom((·) ′ , X, c).

Gibbs measures on non-negative integers
We can treat any discrete univariate distribution on non-negative integers by writing it as a Gibbs measure where N ∈ {0, 1, 2, . . .} ∪ {∞} and κ is a normalizing constant. Here the choice of V and ω is not unique. In [29], Stein's method for discrete univariate Gibbs measures on non-negative integers is developed, with operator acting on the class of functions such that f (0) = 0 and, in case N is infinite, which yields (37) via (29) using the pair (f (x), g(x)) = (f (x), x + 1). In [29], other choices of birth and death rates were discussed; here the birth rate b x is the pre-factor of g(x + 1), and the death rate d x is the pre-factor of g(x). Indeed any choice of birth and death rates which satisfy the detailed balance conditions for all x are viable. Our canonical Stein operator can be written as Choosing c(x) = d x and applying (27) gives the general Stein operator b with ν the mean of the distribution. This expression can be simplified in special cases; for example in the Poisson case V is constant and we obtain τ (x) = w, as before. Similar developments are also considered by [45].

Higher order operators
So far, in all examples provided we only consider first-order difference or differential operators. One way of constructing higher order operators is to consider for c well chosen and D k the kth iteration of D. This approach is strongly connected with Sturm-Liouville theory and will be the subject of a future publication. Here we merely give examples illustrating that our results are not restricted to first-order operators.
Example 13 (Kummer-U distribution). Let U (a, b, z) be the unique solution of the differential equation zd 2 U/dz 2 + (b − z)dU/dz − aU = 0. Then U (a, b, z) is the confluent hypergeometric function of the second kind (also known as the Kummer U function). A random variable X follows the Kummer-U distribution K s if its density is with Γ(s) the Gamma function and V s (x) = U s − 1, 1 2 , x 2 2s . The class F (X) contains all differentiable functions such that lim x→0 or ∞ f (x)κ s (x) = 0. As noted in [69], the canonical Stein operator (given (5)) is cumbersome. Applying (29) via the pair (f, g) = (f, g(f )) with we get (after quite some computations) the operator

Similar considerations as in Example 13 provide tractable operators for other distributions involving special functions.
Example 14 (Variance Gamma distribution). Let K ν be the modified Bessel function of the second kind, of index ν. A random variable has the variance gamma distribution V G(ν, α, β, η) if its density is given on R by where α > |β| > 0, ν > − 1 2 , η ∈ R. For simplicity we take η = 0, α = 1, and ν > 0. A generator for this distribution is (38) see [33]. The canonical operator is (with D the usual strong derivative) Applying (29) via the pair (f, g) = (f, g(f )) with with we retrieve (38).

Densities satisfying a differential equation
Lastly we consider the case where the density of interest p with interval support I = {a, b} is defined as the solution of some differential equation, say L(p) = 0 along with some boundary conditions. Suppose that L admits an adjoint (w.r.t. Lebesgue integration) which we denote L ⋆ so that, for X ∼ p, we can apply integration by parts to get with C b a (g, p) the constant arising through the integration by parts. We define acting on the class F (A X ) of sufficiently smooth functions such that C b a (g, p) = 0. To qualify A X as a Stein operator in the sense of (1), it still remains to identify conditions on g which ensure that this operator characterises the density.
This point of view blends smoothly into our canonical approach to Stein operators; we can moreover provide conditions on g in a generic way. To see this for some g. Then, reversing the integration by parts argument provided above, we get , p) the quantities resulting from the integration by parts (and using the fact that now L(p) = 0, by assumption). This leads to the standardization A X (g) = T X (F (g, p)) acting on the class of functions Note how, in particular, the assumption F (g, p) ∈ F (X) implies that C b a (g, p) = 0, as demanded in the beginning of the Section. Example 15. We illustrate this point of view in the case of the spectral density h n on [−2, 2] of a GU E(n, 1/n) random matrix studied in [41,42]. This density is defined through the third order differential equation along with a boundary condition. Letting X ∼ h n it is straightforward to show that Integrating by parts we then get hn(−2) g(−2). Considering only functions g such that F (g, h n ) ∈ F (X) leads to a Stein operator for X.

Approximate computation of expectations
Resulting from our framework, in this Section we provide a general "comparison of generators approach" (Theorem 3) which provides bounds on the probability distance between univariate distributions in terms of their Stein operators. This result is formal and abstract; it is our take on a general version of Part B of Stein's method. Specific applications to concrete distributions will be deferred to Section 5.

Comparing Stein operators
Let (X 1 , B 1 , µ 1 ) and (X 2 , B 2 , µ 2 ) be two measure spaces as in Section 2.1. Let X 1 and X 2 be two random variables on X 1 and X 2 , respectively, and suppose that their respective densities p 1 and p 2 have interval support. Let D 1 and D 2 be two linear operators acting on X 1 and X 2 and satisfying Assumption 1 (with l 1 and l 2 , respectively) and Assumption 2. Denote by T 1 and T 2 the Stein operators associated with (X 1 , D 1 ) and (X 2 , D 2 ), acting on Stein classes F 1 = F (X 1 ) and F 2 = F (X 2 ), respectively. Finally let E i h = Eh(X i ) denote the expectation of a function h under the measure p i dµ, i = 1, 2.
The framework outlined in Section 2 (specifically Section 2.4) is tailored for the following result to hold.
Theorem 3. Let h be a function such that E i |h| < ∞ for i = 1, 2.
Remark 15. Our approach contains the classical "direct" approach described in the Introduction (see (3)). Indeed, if allowed, one can take f 1 = 1 and with u 1 the score of X 1 (defined in (30)) and g h now the usual solution of the Stein equation. This yields the bound In this case one does not need to calculate T 2 . Proof. The starting point is the Stein equation (20) which, in the current context, becomes with • ∈ {1, 2}. Solutions of this equation are pairs of functions (f, g) with f ∈ F (X • ) and g ∈ dom(D • , X • , f ). Using • = 1, replacing x by X 2 and taking expectations gives (39). For (40), first fix f 1 ∈ F 1 and choose g = g h the corresponding solution of (42) with • = 1. By construction we can then take expectations and write . Finally we know that for all f 2 ∈ F 2 such that g h ∈ dom(D 2 , X 2 , f 2 ) we can use (42) Taking differences we get (40). Equation (41) follows in a similar fashion, fixing this time f = f h and letting g 1 and g 2 vary.
The power of Theorem 3 and of Stein's method in general lies in the freedom of choice on the r.h.s. of the identities : all functions f • , g • (where now • needs to be replaced by h, 1 or 2 according to which of (40) or (41) is used) can be chosen so as to optimise resulting bounds. We can even optimise the bounds over all suitable pairs (f, g). We will discuss two particular choices of functions in Section 4.2 which lead to well-known Stein bounds. We also will provide illustrations (discrete vs discrete, continuous vs continuous and discrete vs continuous) in Section 5.
In particular (40) and (41) provide tractable (and still very general) versions of (4). Indeed taking suprema over all h ∈ H some suitably chosen class of functions we get, in the notations of the Introduction, Different choices of functions f 1 and f 2 (resp. g 1 and g 2 ) will lead to different expressions bounding all distances d H (X 1 , X 2 ) in terms of properties of T 1 and T 2 . It remains the mission of the practitioner to determine which choice will lead to the best possible bound.
Remark 16. The idea of comparing distributions by comparing their operators was pioneered in [30,37,31], wherein this idea is referred to as the "comparison of generators" approach.

Remark 17.
If there exist no functions f 1 , f 2 (resp. g 1 , g 2 ) such that the assumptions are satisfied, then the claims of Theorem 3 are void. Such is not the case whenever X 1 and X 2 are "reasonable".
Remark 18 (About the Stein factors). In view of (40) and (41), good bounds on E 1 h − E 2 h will depend on the control we have on functions Bounds on these functions and on their derivatives are called, in the dedicated literature, Stein (magic) factors (see for example [21,75]). There is an important connection between such constants and Poincaré / variance bounds / spectral gaps, as already noted for example in [19,49,48,54,61]. This connection is quite transparent in our framework and will be explored in future publications.
In the sequel we will not use the full freedom of choice provided by Theorem 3, but rather focus on applications of identity (40) only. Indeed in this case much is known about g h and Dg h in case f 1 = 1 and X 1 is Gaussian (see [20]), Binomial (see [26]), Poisson (see [8]), Gamma (see [15,62]), etc. See also [25,51,23,57] for computations under quite general assumptions on the density of X 1 . We will make use of these results in Section 5. It seems in a sense hopeless wish for useful bounds on (43) in all generality. Of course one could proceed as in [23,17] or [57] by imposing ad hoc assumptions on the target density which ensure that the functions in (43) have good properties. Such approaches are not pursued in this paper. Specific bounds will therefore only be discussed in particular examples.

Comparing Stein kernels and score functions
There are two obvious ways to exploit (40), namely either by trying to make the first summand equal zero, or by trying to make the second summand equal zero. In the rest of this section we do just that, in the case X 1 = X 2 and D 1 = D 2 = D (and hence l 1 = l 2 = l); extension of this result to mixtures is straightforward, albeit cumbersome. It will be performed for a specific example in Section 5.6.
Cancelling the first term in (40) and ensuring that all resulting assumptions are satisfied immediately leads to the following result.
1. If the constant function 1 ∈ F 1 ∩F 2 , then we can take f = 1 in (44) to deduce that with u i = T i (1) the score function of X i (defined in (30)) and κ H,1 an explicit constant that can be computed in several important cases, see e.g. [57,Section 4] and [78,48] for applications in the Gaussian case. Note that J (X 1 , X 2 ) = E 2 (u 1 − u 2 ) 2 is the so-called generalized Fisher information distance (see e.g. [47,57]). 2.
The assumption that f ∈ F 1 ∩ F 2 can be relaxed; if I D 2 (f p 2 )dµ = 0 then this just adds terms which relate to the boundaries of I.
Cancelling the second term in (40) and ensuring that all resulting assumptions are satisfied immediately leads to the following result. Then If, moreover, X 1 and X 2 have common finite mean ν then one can choose ω(x) = ν − x in (45) to get with τ j , j = 1, 2, the Stein kernel of X j (defined in (32)) and κ H,2 an explicit constant that can be computed in several cases. In [11], and references therein cited, consequences of (46) are explored in quite some detail. In particular in the Gaussian and central Gamma cases, (46) has been exploited fruitfully in conjunction with Malliavin calculus, leading to an important new stream of research known as "Nourdin-Peccati analysis", see [62,60]. See also aforementioned references [52,51,25] where several extensions of the Nourdin-Peccati analysis are discussed. Note that, in the Gaussian case X 1 ∼ N (0, 1) we readily obtain τ 1 = 1. The quantity is the so-called Stein discrepancy from [63,53].

Sums of random variables and the Stein kernel
Let X i , i = 1, . . . , n, be independent random variables with respective means ν i , and put W = n i=1 X i . Following [80, Lecture VI] and [64,63] we obtain an almost sure representation formula for the Stein kernel of sums of independent random variables.
Proof. For every g ∈ dom(D, W, τ W ) we have with (18) that where the first equality follows de-conditioning w.r.t. W i and then conditioning w.r.t. W . The assertion follows by denseness.
Combining this representation lemma with Corollary 2 leads to the following general result.
Proposition 3. Suppose that the assumptions in Lemma 3 are satisfied. Let X be a random variable with finite mean ν = for all h ∈ H a class of functions as in Corollary 2.
Proof. Lemma 3 with Corollary 2 (whose conditions are satisfied) gives that The assertion now follows by Jensen's inequality for conditional expectations.

Stein bounds
As anticipated, in this section we discuss stochastic approximation via Stein differentiation in several concrete examples. The main purpose of this Section is illustrative and most of the examples we discuss lead to well-known situations. Relevant references are given in the text.

Binomial approximation to the Poisson-binomial distribution
An immediate application of Proposition 3 can be found in binomial approximation for a sum of independent Bernoulli random variables. Writing X for a Bin(n, p) and W = n i=1 X i with X i ∼ Bin(1, p i ), i = 1, . . . , n, and np = n i=1 p i (the distribution of W is called a Poisson-binomial distribution, see e.g. [26]), we readily compute Here we use D = ∆ + , the forward difference. Thus for any measurable function h such that E|h(X)| < ∞ and E|h(W )| < ∞ , An alternative angle on this problem is to use the score function approach, although here with T (Id) instead of T (1). It is easy to show (see e.g. Example 12.2.(b)) that so that for f = Id, the identity function, By Example 6 we find that f = Id ∈ F (X)∩F (W ) because Id(0) = 0. Now let h be such that E|h(X)| < ∞ and E|h(W )| < ∞, and let g h = T −1 X (h − Eh[X])/Id; then g ∈ dom(∆ + , W, Id) ∩ dom(∆ + , X, Id). From (40) we obtain that By (16), using the notation g a (x) = g(x + a) for a function in x, Hence The fact that we obtain two different bounds for the same problem illustrates the freedom of choice in specifying f and g in the Stein equation. In [26], bounds for sup x |D g h (x) x+1 | are calculated, and in [29] a bound for sup x | g h (x) x+1 | is given.

Distance between Gaussians
Consider two centered Gaussian random variables X 1 and X 2 with respective variances σ 2 1 ≤ σ 2 2 , say. Denote φ the density of Z, a standard normal random variable. The canonical Stein operators are then of the form acting on the classes F 1 (X 1 ) = F 2 (X 2 ) = F (Z) of Z-integrable differentiable functions such that (f φ) ′ ∈ L 1 (dx). In this simple toy-setting it is possible to write out (40) in full generality. Indeed we have withh(u) = h(σ 1 u) and g h,0 the solution of the classical Stein equation given by In the particular case where one is interested in the total variation distance, then one only considers h : R →  (40) becomes .
for any f 1 , f 2 ∈ F (Z). There are many directions that can be taken from here, of which we illustrate three (to simplify notation we write g h for gh ,0 ).
• Taking f 1 = 1 and f 2 = 0 (see Remark 15) leads to the identity ] for any differentiable function ζ, and also noting that one can interchange the roles of X 1 and X 2 , we deduce the bound already obtained e.g. in [62, Proposition 3.6.1]. • Taking f 1 = σ 2 1 and f 2 = σ 2 2 (thus a particular case of the comparison of kernels from Corollary 2) yields exactly the same result as above.
• Taking f 1 = f 2 = 1 (thus a particular case of the comparison of scores from Corollary 1) yields the identity which is better than the previous bound whenever σ 2 /σ 1 < 2.

From Student to Gauss
Set X 1 = Z standard Gaussian and X 2 = W ν a Student t random variable with ν > 2 degrees of freedom. In this case the Stein kernels for both distributions are well defined and given, respectively, by τ 1 = 1 and τ 2 (x) = x 2 +ν ν−1 , see Example 12. All assumptions in Corollary 2 are satisfied so that we can plug these functions with H the class of Borel functions in [0, 1] to get where, as in the previous example, we make use of our knowledge on the solutions of the Gaussian Stein equation. It is straightforward to compute (47) explicitly (under the assumption ν > 2, otherwise the expectation does not exist) to get A similar result is obtained with Corollary 1, namely which is of the same order as (48), with a better constant, but arguably much less elegant.
Remark 20. It is of course possible to exchange the roles of the Student and the Gaussian in the above computations.

Exponential approximation
Let X (n) be the maximum of n i.i.d. uniform random variables on [0, 1]. It is known that M n = n(1 − X (n) ) converges in distribution to X 1 a rate-1 exponential random variable. Note that E[M n ] = n n+1 = 1. In order to apply Corollary 2 most easily we are led to consider the slightly transformed random variable X 2 = n+1 n M n = (n + 1)(1 − X (n) ). The canonical operator for X 1 is T 1 f = f ′ − f acting on the class of differentiable f such that f (0) = 0. The Stein equation (20) becomes Then the solution pairs (f, g) = (f h , g h ) are such that (f g)(x) = T −1 exp (h) so that for x > 0. If h(x) = I(x ≤ t) we need to understand the properties of This function is bounded and differentiable on R, with limit 0 at the left boundary and constant with value 1 − e −t for all x ≥ t (see also [15,Lemma 3.2]). Taking g(x) = x ǫ in (49) the corresponding function f from (49) is as well as lim x→0 f t,1 (x) = e −t (see [15] for details on the cases ǫ = 0 and ǫ = 1). We now turn our attention to the problem of approximating the law of X 2 , whose density is p(x) = n n+1 (1 − x n+1 ) n−1 with support [0, n + 1]. Taking derivatives we get acting, as above, on the class of differentiable functions such that f (0) = 0. Clearly f t,ǫ (0)g ǫ (0) = 0 for all 0 < ǫ < 1 and therefore which yields the non-uniform bound The quantity on the rhs of (50) can be optimised numerically in (ǫ, t). For example for n = 100 and t = 1/2, we can compute the upper bound at ǫ = 0 to get 0.00497143 and 0.00852033 at ǫ = 1. The optimal choice of ǫ in this case is ǫ ≈ 0.138 for which the bound is 0.00488718. Obviously, in this simple situation, it is also easy to evaluate the expressions ∆(t) = sup t |P (X 2 ≤ t) − P (X 1 ≤ t)| numerically; first explorations show that there is some interesting optimization (depending on the magnitude of t) to be performed in order to obtain good bounds.

Gumbel approximation
Let X (n) be the maximum of n i.i.d. exponential random variables. It is known that M n = X (n) − log n converges in distribution to X 1 a Gumbel random variable with density p(x) = e −x e −e −x on R. The Stein kernel of the Gumbel does not take on a tractable form, hence we shall here rather use Corollary 2 with another choice of function ω. A natural choice for ω is the score function, here u Gumbel (x) = e −x − 1, since in this case T −1 Gumbel (u Gumbel ) = 1. As for the exponential example, we here also run into the difficulty that E[e −Mn − 1] = n n+1 − 1 = 0, leading us to consider the transformed random variable X 2 = M n + log n n+1 . Simple calculations give T −1 2 (e −x − 1) = 1 − e −x n+1 and we can use Corollary 2 to obtain with g h (x) = T −1 Gumbel (h). Since, furthermore, Ee −X2 = 1 we deduce Again it is easy to express g h explicitly in most cases. For example, taking h(x) = I(x ≤ t) we readily compute g h (x) = e x e −(e −t −e −x ) + − e −e −t which can be shown to satisfy g h ≤ e t (1 − e −e −t ) ≤ 1 and g ′ h ≤ 1. This provides the uniform bound which is of comparable order (though with a worse constant) with, e.g., [43]. See [9] for a similar application towards Fréchet convergence.
A classical example of convergence towards the Beta distribution is the Pólya-Eggenberger urn which contains, at time zero, α ≥ 1 white and β ≥ 1 black balls, at every positive integer time a ball is chosen uniformly from the urn, independently of the past, and replaced along with m ≥ 1 additional balls of the same color. Let S α,β,m n be the number of white balls drawn from a Pólya-Eggenberger urn by time n. Then where (x) 0 = 1 and otherwise (x) k = x(x + 1) · · · (x + k − 1) is the rising factorial (this distribution is also known as the Beta-binomial and the negative hypergeometric distribution) and it is known that W n = S α,β,m n n D −→ Z the Beta distribution with parameters α, β. Now let ∆ 1/n f (x) = f (x + 1/n) − f (x) be the forward derivative on the set X n = {0, 1/n, . . . , (n − 1)/n, 1} and T α,β,n be the canonical Stein operator for W n (acting on functions with support X n ). It can be seen from [37] that the Stein kernel of the distribution of W n is the quantity τ n (x) = x β nm + 1 − x and thus, from operator (33), for all bounded f for which the expectations exist. We can therefore apply Corollary 2, which boils down to comparing relationships (52) and (51). This is exactly what has been done in [37] to obtain a bound of the correct order for the Wasserstein distance d W (S n /n, Z) between S n /n and Z, with Z ∼ B(α/m, β/m), hence we do not repeat their calculations here. As in the previous examples, the quality of the bounds will depend on the properties of the solutions of the Stein equations, here given by pairs (f, g) such that f (x)g(x) = x −(α−1) (1 − x) −(β−1) x 0 (h(u) − Eh(Z))p(u; α, β)du where Z ∼ p(·; α, β); in [37] the case g(x) = x −α (1 − x) −β is considered.
discussions about Stein's method which have helped shape part of this work. In particular, we thank Larry for his input on Section 2.6, Christian for the idea behind Section 3.6 and Oliver for the impetus behind the computations shown in Section 5.1.