Remarks on kernel Bayes' rule

Kernel Bayes' rule has been proposed as a nonparametric kernel-based method to realize Bayesian inference in reproducing kernel Hilbert spaces. However, we demonstrate both theoretically and experimentally that the prediction result by kernel Bayes' rule is in some cases unnatural. We consider that this phenomenon is in part due to the fact that the assumptions in kernel Bayes' rule do not hold in general.


Introduction
The kernel Bayes' rule has recently emerged as a novel framework for Bayesian inference (Fukumizu, Song, & Gretton, 2013;Song, Fukumizu, & Gretton, 2014;Song, Huang, Smola, & Fukumizu, 2009). It is generally agreed that, in this framework, we can estimate the kernel mean of the posterior distribution, given kernel mean expressions of the prior and likelihood distributions. Since the distributions are mapped and nonparametrically manipulated in infinite-dimensional feature spaces called reproducing kernel Hilbert spaces (RKHS), it is believed that the kernel Bayes' rule can accurately evaluate the statistical features of high-dimensional data and enable Bayesian inference even if there were no appropriate parametric models. To date, several applications of the kernel Bayes' rule have been reported (Fukumizu et al., 2013;Kanagawa et al., 2014). However, the basic theory and the algorithm of the kernel Bayes' rule might need to be modified because of the following reasons:

PUBLIC INTEREST STATEMENT
This paper examines the validity of the kernel Bayes' rule, a recently proposed nonparametric framework for Bayesian inference. The researchers on the kernel Bayes' rule are aiming to apply this method to a wide range of Bayesian inference problems. However, as we demonstrate in this paper, the way of incorporating the prior in the kernel Bayes' rule seems wrong in the context of Bayesian inference. Several theorems of the kernel Bayes' rule rely on a strong assumption which does not hold in general.
The problems of the kernel Bayes' rule seem to be nontrivial and difficult, and we have currently no idea to solve them. We hope that this study would trigger reexamination and correction of the basic framework of the kernel Bayes' rule.
respectively. Due to the reproducing properties of   and   , the kernel means satisfy ⟨f , m Π ⟩   = E[f (U)] and ⟨g, m Q  ⟩   = E[g (W)] for any f ∈   and g ∈   .
THEOREM 2 (Fukumizu et al., 2013, Theorem 2) If C XX is injective, m Π ∈ Ran(C XX ), and E[g(Y) | X = ⋅] ∈   for any g ∈   , then (x, y) and From Theorem 2, Fukumizu et al. (2013) proposed that m (ZW) and m (WW) can be given by In case m Π is not included in Ran(C XX ), they suggested that m (ZW) and m (WW) could be approximated by Remark 1 (Fukumizu et al., 2013, p. 3760) m (ZW) and m (WW) can respectively be identified with C ZW and C WW .
Here, we introduce the empirical method for estimating the posterior kernel mean m Q |y following (Fukumizu et al., 2013).
Definition 4 Suppose we have an independent and identically distributed (i.i.d.) from the observed distribution P on  ×  and a sample {U j } l j=1 from the prior distribution Π on . The prior kernel mean m Π is estimated by where 1 , … , l are weights. Let us put m Π = (m Π (X 1 ), … ,m Π (X n )) T , G X = (k  (X i , X j )) 1≤i,j≤n , and Proposition 1 (Fukumizu et al., 2013, Proposition 3, revised) Let I n denote the identity matrix of size n. The estimates of C ZW and C WW are given by The proof of this revised proposition is given in Section 6.1. It is suggested in Fukumizu et al. (2013) that Equation (4) can be empirically estimated by THEOREM 3 (Fukumizu et al., 2013, Proposition 4) Given an observation y ∈ , m Q |y can be calculated by If we want to know the posterior expectation of a function f ∈   given an observation y ∈ , it is estimated by

Theoretical arguments
In this section, we theoretically support the three arguments raised in Section 1. First, we show in Section 3.1 that the posterior kernel mean m Q |y is completely unaffected by the prior distribution Π under the condition that Λ and G Y are non-singular. This implies that, at least in some cases, Π does not properly affect m Q |y . Second, we mention in Section 3.2 that the linear operators C XX and C WW are not always surjective, and address the problems associated with the setting of the regularization parameters and . Third, we demonstrate in Section 3.3 that conditional expectation functions are not generally contained in the RKHS, which means that Theorems 1, 2, and 5-8 in Fukumizu et al. (2013) do not work in some situations.

3.1.
Relations between the posterior m Q |y and the prior Π Let us review Theorem 3. Assume that G Y and Λ are non-singular matrices. (This assumption is not so strange, as shown in Section 6.2.) The matrix R X|Y = ΛG Y ((ΛG Y ) 2 + I) −1 Λ tends to G −1 Y as tends to 0. Furthermore, if we set = 0 from the beginning, we obtain R X|Y = G −1 Y . This implies that the posterior kernel mean m Q |y = k T X R X|Y k Y (y) never depends on the prior distribution Π on , which seems to be a contradiction to the nature of Bayes' rule.
Some readers may argue that, even in this case, we should not set = 0. Then, however, there is ambiguity about why and how the regularization parameters are introduced in the kernel Bayes' rule, since Fukumizu et al. originally used the regularization parameters just to solve inverse problems as an analog of ridge regression (Fukumizu et al., 2013, p. 3758). They seem to support the validity of the regularization parameters by Theorems 5, 6, 7, and 8 in Fukumizu et al. (2013), however, these theorems do not work without the strong assumption that conditional expectation functions are included in the RKHS, as will be discussed in Section 3.2. In addition, since the theorems work only when n , etc. decay to zero sufficiently slowly, it seems that we have no principled way to choose values for the regularization parameters, except for cross-validation or similar techniques. It is worth mentioning that, in our simple experiments in Section 4.2, we could not obtain a reasonable result with the kernel Bayes' rule using any combination of values for the regularization parameters.

The inverse of the operators C XX and C WW
As noted by Fukumizu et al. (2013), the linear operators C XX and C WW are not surjective in some usual cases, the proof of which is given in Section 6.3. Therefore, they proposed an alternative way of obtaining a solution f ∈   of the equation C XX f = m Π , that is, a regularized inversion f = (C XX + I) −1 m Π as an analog of ridge regression, where is a regularization parameter and I is an identity operator. One of the disadvantages of this method is that the solution f = (C XX + I) −1 m Π depends upon the choice of . In Section 4.2, we numerically show that the prediction using the kernel Bayes' rule considerably depends on the regularization parameters and . Theorems 5-8 in Fukumizu et al. (2013) seem to support the appropriateness of the regularized inversion. However, these theorems work under the condition that conditional expectation functions are contained in the RKHS, which does not hold in some cases as proved in Section 3.3. Furthermore, since we need to assume sufficiently slow decay of the regularization constants and in these theorems, it is practically difficult to set appropriate values for and . A cross-validation procedure seems to be useful for tuning the parameters and we may obtain good experimental results, however, it seems to lack theoretical background.
Instead of the regularized inversion method, we can compute generalized inverse matrices of G X and ΛG Y , given a sample {(X i , Y i )} n i=1 . Below, we briefly introduce a generalization of a matrix inverse. For more details, see Horn and Johnson (2013).
Definition 5 Let A be a matrix of size m × n over the complex number space ℂ. We say that a matrix Remark 2 In fact, any matrix A has the Moore-Penrose generalized inverse matrix A † . Note that A † is uniquely determined by A. If A is square and non-singular, then

Conditional expectation functions and RKHS
In this subsection, we show that conditional expectation functions are in some cases not contained in the RKHS.

Definition 7
For a function f ∈ L 1 (ℝ, ℂ) ∩ L 2 (ℝ, ℂ), we define its Fourier transform as We can uniquely extend the Fourier transform to an isometry ̂: L 2 (ℝ, ℂ) → L 2 (ℝ, ℂ). We also define the inverse Fourier transform ̌: L 2 (ℝ, ℂ) → L 2 (ℝ, ℂ) as an isometry uniquely determined by Definition 8 Let us define a Gaussian kernel k G on ℝ by As described in Fukumizu (2014), the RKHS of real-valued functions and complex-valued functions corresponding to the positive definite kernel k G are given by respectively, and the inner product of f , g ∈  G or f , g ∈  G (ℝ, ℂ) on the RKHS is calculated by where the overline denotes the complex conjugate. Remark that  G is a real Hilbert subspace contained in the complex Hilbert space  G (ℝ, ℂ).

Fukumizu et al. (2013) mentioned that the conditional expectation function
becomes a constant function on , the value of which might be non-zero. In the case that  = ℝ and k  = k G , the constant function with non-zero value is not contained in   =  G .
Additionally, in order to prove Theorems 5 and 8 in Fukumizu et al. (2013), they made the assump- where (X,Ỹ) and (Z,W) are independent copies of the random variables (X, Y) and (Z, W) on  × , respectively. We also see that this assumption does not hold in general. Suppose that X and Y are independent and that so are X and Ỹ . Then E[k  (Y,Ỹ) | X = x,X =x] is a constant function of (x,x), the value of which might be non-zero. In the case that  = ℝ and k  = k G , the constant function having non-zero value is not contained in where the Fourier transform of f :ℝ 2 → ℝ is defined by Thus, the assumption that conditional expectation functions are included in the RKHS does not hold in general. Since most of the theorems in Fukumizu et al. (2013) require this assumption, the kernel Bayes' rule may not work in several cases.

Numerical experiments
In this section, we perform numerical experiments to illustrate the theoretical results in Sections 3.1 and 3.2. We first introduce probabilistic classifiers in Section 4.1 based on conventional Bayes' rule assuming Gaussian distributions (BR), the original kernel Bayes' rule (KBR1), and the kernel Bayes' rule using Moore-Penrose generalized inverse matrices (KBR2). In Section 4.2, we apply the three classifiers to a binary classification problem with computer-simulated data sets. Numerical experiments are implemented in version 2.7.6 of the Python software (Python Software Foundation, Wolfeboro Falls, NH, USA).

Algorithms of the three classifiers, BR, KBR1, and KBR2
Let (X, Y) be a random variable with a distribution P on  × , where  = {C 1 , … , C g } is a family of classes and  = ℝ d . Let Π and Q be the prior and the joint distributions on  and  × , respectively. Suppose we have an i from the distribution P. The aim of this subsection is to derive algorithms of the three classifiers, BR, KBR1, and KBR2, which respectively calculate the posterior probability for each class given an observation y ∈ , that is, Q |y (C 1 ), … , Q |y (C g ).

The algorithm of BR
In BR, we estimate the posterior probability of j-th class (j = 1, … , g) given a test value y ∈  by where P |C j (⋅) is the density function of the d-dimensional normal distribution  (M j ,Ŝ j ) defined by The mean vector M j ∈ ℝ d and the covariance matrix Ŝ j ∈ ℝ d×d are calculated from the training data of the class C j .

The algorithm of KBR1
Let us define positive definite kernels k  and k  as for X, X � ∈  and Y, Y � ∈ , and the corresponding RKHS as   and   , respectively. Here we set ‖Y‖ = � ∑ d i=1 y 2 i for Y = (y 1 , y 2 , … , y d ) T ∈  = ℝ d . Then, the prior kernel mean is given by where I n is the identity matrix of size n and , ∈ ℝ are heuristically set regularization parameters. Note that stands for the indicator function of a set A described as Following Theorem 3, the posterior kernel mean given a test value y ∈  is estimated by Here, we estimate the posterior probabilities for classes given a test value y ∈  by

The algorithm of KBR2
Let G † X denote the Moore-Penrose generalized inverse matrix of G X . Let us put ̂ Replacing R X|Y in Section 4.1.2 for R ′ X|Y , the posterior probabilities for classes given a test value y ∈  is estimated by

Probabilistic predictions by the three classifiers
Here, we apply the three classifiers defined in Section 4.1 to a binary classification problem using computer-simulated data-sets, where  = {C 1 , C 2 } and  = ℝ 2 . In the first step, we independently generate 100 sets of training samples with each training sample being {(  (Figure 4), and = = 10 −1 ( Figure 5). In Figures 2-5, BR_th represents the theoretical value of BR, which coincides with BR if the parameters M 1 , M 2 , Ŝ 1 , and Ŝ 2 are set to be M 1 , M 2 , S 1 , and S 2 , respectively. Consistent to Section 3.1, Q |y (C 1 ) calculated by KBR1 is poorly influenced by Π(C 1 ) compared with that by BR when and are set to be small (see Figures 2 and 3). In addition, Q |y (C 1 ) calculated by KBR2 also seems to be uninfluenced by Π(C 1 ). When and are set to be larger, the effect of Π(C 1 ) on Q |y (C 1 ) becomes apparent in KBR1, however, the value of Q |y (C 1 ) becomes too small (see Figures 4 and 5). These results suggest that in the kernel Bayes' rule, the posterior does not depend on the prior if and are negligible, which might be a contradiction to the nature of Bayes' theorem. Moreover, even though the prior affects the posterior when and become larger, the posterior seems too much dependent on and , which are initially defined just for the regularization of matrices.

Conclusions
One of the important features of Bayesian inference is that it provides a reasonable way of updating the probability for a hypothesis as additional evidence is acquired. The kernel Bayes' rule has been expected to enable Bayesian inference in RKHS. In other words, the posterior kernel mean has been considered to be reasonably estimated by the kernel Bayes' rule, given kernel mean expressions of the prior and likelihood. What is "reasonable" depends on circumstances, however, some of the results in this paper seem to show obviously unreasonable aspects of the kernel Bayes' rule, at least in the context of Bayesian inference.
First, as shown in Section 3.1, when Λ and G Y are non-singular matrices and so we set = 0, the posterior kernel mean m Q |y is entirely unaffected by the prior distribution Π on . This means that, in Bayesian inference with the kernel Bayes' rule, prior beliefs are in some cases completely neglected in calculating the kernel mean of the posterior distribution. Numerical evidence is also presented in Section 4.2. When the regularization parameters and are set to be small, the posterior probability calculated by the kernel Bayes' rule (KBR1) is almost unaffected by the prior probability in comparison with that by conventional Bayes' rule (BR). Consistently, when the regularized inverse matrices in KBR1 are replaced for the Moore-Penrose generalized inverse matrices (KBR2), the posterior probability is also uninfluenced by the prior probability, which seems to be unsuitable in the context of Bayesian updating of a probability distribution.  Second, as discussed in Sections 3.2 and 4.2, the posterior estimated by the kernel Bayes' rule considerably depends upon the regularization parameters and , which are originally introduced just for the regularization of matrices. A cross-validation approach is proposed in Fukumizu et al. (2013) to search for the optimal values of the parameters. However, theoretical foundations seem to be insufficient for the correct tuning of the parameters. Furthermore, in our experimental settings, we are not able to obtain a reasonable result using any combination of the parameter values, suggesting the possibility that there are no appropriate values for the parameters in general. Thus, we consider it difficult to solve the problem that C XX and C WW are not surjective by just adding regularization parameters.
Third, as shown in Section 3.3, the assumption that conditional expectation functions are included in the RKHS does not hold in general. Since this assumption is necessary for most of the theorems in Fukumizu et al. (2013), we believe that the assumption itself may need to be reconsidered.
In summary, even though current research efforts are focused on the application of the kernel Bayes' rule (Fukumizu et al., 2013;Kanagawa et al., 2014), it might be necessary to reexamine its basic framework of combining new evidence with prior beliefs.

Proofs
In this section, we provide some proofs for Sections 2 and 3.

Estimation of C ZW and C WW
Here we give the proof of Proposition 1.

Non-singularity of G Y and Λ
Here we show that the assumption in Section 3.1 holds under reasonable conditions.
Definition 9 Let f be a real-valued function defined on a non-empty open domain Dom(f ) ⊆ ℝ d . We say that f is analytic if f can be described by a Taylor expansion on a neighborhood of each point of Dom(f ).
Proposition 2 Let k be a positive definite kernel on ℝ d . Let be a probability measure on ℝ d which is absolutely continuous with respect to Lebesgue measure. Assume that k is an analytic function on ℝ d × ℝ d and that the RKHS corresponding to k is infinite dimensional. Then for any i.i.d. random variables X 1 , X 2 , … , X n with the same distribution , the Gram matrix G X = (k(X i , X j )) 1≤i, j≤n is non-singular almost surely with respect to n = × × ⋯ × (n times).
Proof Let us put f (x 1 , x 2 , … , x n ): = det(k(x i , x j )) 1≤i, j≤n . Since the RKHS corresponding to k is infinite dimensional, there are 1 , 2 , … , n ∈ ℝ d such that {k(⋅, i )} 1≤i≤n are linearly independent. Then f ( 1 , 2 , … , n ) ≠ 0 and hence f is a non-zero analytic function. Note that any non-trivial subvarieties of the euclidean spaces defined by analytic functions have Lebesgue measure zero. By this fact, the subvariety has Lebesgue measure zero. Since is absolutely continuous, n ((f )) = 0. This completes the proof. ✷ From Proposition 2, we easily obtain the following corollary.
Corollary 1 Let k be a Gaussian kernel on ℝ d and let X 1 , X 2 , … , X n be i.i.d. random variables with the same normal distribution on ℝ d . Then the Gram matrix G X = (k(X i , X j )) 1≤i, j≤n is non-singular almost surely.
Proposition 3 Let k be a positive definite kernel on  = ℝ d , a probability measure on  which is absolutely continuous with respect to Lebesgue measure. Assume that k is an analytic function on  ×  and that the RKHS  corresponding to k is infinite dimensional. Then for any ( , 1 , 2 , … , l , U 1 , U 2 , … , U l ) ∈ ℝ + × ℝ l × (ℝ d ) l except Lebesgue measure zero, and for any i.i.d. random variables X 1 , X 2 , … , X n with the same distribution , each i for i = 1, 2, … , n is (defined almost surely and) non-zero almost surely, where ( 1 , 2 , … , n ) T = (G X + n I n ) −1m Π , m Π = (m Π (X 1 ),m Π (X 2 ), … ,m Π (X n )) T , and m Π (⋅) = ∑ l j=1 j k(⋅, U j ). Here ℝ + denotes the set of positive real numbers.
Corollary 2 Let k be a Gaussian kernel on ℝ d and let X 1 , X 2 , … , X n be i.i.d. random variables with the same normal distribution on ℝ d . All other notations are as in Proposition 3. Then Λ: = diag( 1 , 2 , … , n ) is non-singular almost surely for any ( , 1 , 2 , … , l , U 1 , U 2 , … , U l ) ∈ ℝ + × ℝ l × (ℝ d ) l except for those in a set of Lebesgue measure zero.

Non-surjectivity of C XX and C WW
The covariance operators C XX and C WW are not surjective in general. This can be verified by the fact that they are compact operators. (If the operators are surjective on the corresponding RKHS which is infinite-dimensional, then they cannot be compact because of the open mapping theorem.) Here we present some easy examples where C XX and C WW are not surjective. Let us consider for simplicity the case  = ℝ. Let X be a random variable on ℝ with a normal distribution  ( , 2 0 ). We prove that C XX is not surjective under the usual assumption that the positive definite kernel on ℝ is Gaussian. In order to demonstrate this, we use the symbols defined in Section 3.3 and several proven results on function spaces and Fourier transforms (see Rudin, 1987, for example). Note that the following three propositions are introduced without proofs. Definition 10 Let p(⋅) denote the density function of the normal distribution  ( , 2 0 ) on ℝ, that is, Let X be a random variable on ℝ with  ( , 2 0 ). The linear operator C XX :  G →  G is defined by ] for any f , g ∈  G , which is also described as for any f ∈  G .