Nonparametric Goodness of Fit via Cross-Validation Bayes Factors

A nonparametric Bayes procedure is proposed for testing the fit of a parametric model for a distribution. Alternatives to the parametric model are kernel density estimates. Data splitting makes it possible to use kernel estimates for this purpose in a Bayesian setting. A kernel estimate indexed by bandwidth is computed from one part of the data, a training set, and then used as a model for the rest of the data, a validation set. A Bayes factor is calculated from the validation set by comparing the marginal for the kernel model with the marginal for the parametric model of interest. A simulation study is used to investigate how large the training set should be, and examples involving astronomy and wind data are provided. A proof of Bayes consistency of the proposed test is also provided. MSC 2010 subject classifications: primary 62G10, 62F15; secondary 62G05.


Introduction
Nonparametric testing of the fit of a parametric model for a distribution has a long and rich history in frequentist statistics; see, e.g., Rayner et al. (2009). However, the literature on Bayesian goodness-of-fit tests is much smaller. Müller and Quintana (2004) and Tokdar et al. (2010) review some Bayesian approaches to goodness of fit, an important one being that of Berger and Guglielmi (2001) based on Pólya trees. More recently Tokdar and Martin (2013) have proposed a Bayesian test of normality versus a Dirichlet process mixture alternative. The purpose of the current paper is to introduce a Bayesian approach to goodness of fit that has the virtues of (i) simplicity, and (ii) transparency to users unfamiliar with the somewhat daunting notions of Dirichlet processes and Pólya trees.
Given data X ≡ (X 1 , . . . , X n ) from a density f , a kernel estimator of f (x) has the formf where K is an appropriate kernel, typically a finite variance, unimodal density that is symmetric about 0, and h is a positive bandwidth. Kernel estimators have reached the point of being a familiar means of describing the distribution of a data set. They are no more intimidating than their primitive cousin the histogram, and have the appeal of being smooth, like one envisions the density from which the data are drawn.
Kernel estimators are attractive in the nonparametric goodness-of-fit problem since they are nonparametric estimators of the underlying density. Given a fitted parametric density, it seems natural to test the fit of this density by seeing how close it is to a kernel estimate. Such an approach in the frequentist realm dates at least to Bickel and Rosenblatt (1973). From a Bayesian point of view, however, it is not immediately clear how kernel estimates could be used in the goodness-of-fit problem. The Bayesian approach requires models for the underlying density that are well defined prior to data collection. By its nature, though, a kernel estimate only becomes a model after the data are observed. Kernel estimates do, nonetheless, have a connection with Bayesian methodology. Ferguson (1983) showed that when n is large and K is a Gaussian density, estimate (1) approximates a posterior predictive density in a model where the densities are mixtures of normals and the prior for the parameters of the mixtures is a Dirichlet process.
In this paper we use data splitting to sidestep the problem that kernel estimates are not a priori models. Given a random sample X 1 , . . . , X n , suppose one randomly splits the data set into two parts, X 1 and X 2 . Letf ( · |X 1 , h) be a kernel estimate computed from the data X 1 . The key idea of this paper is that the collection of densities {f ( · |X 1 , h) : h > 0} comprises a parametric model (with parameter h) for the data X 2 . To test the fit of a parametric model, such as normality, one may compute a Bayes factor based on marginal distributions of X 2 corresponding to {f ( · |X 1 , h) : h > 0} and the parametric model of interest. The kernel model for X 2 should be a reasonably good one, since X 1 and X 2 come from the same distribution. So, if the parametric model is wrong, the Bayes factor should favor the kernel model. If the parametric model is correct, then the parametric estimate is more efficient than the best kernel estimate, and the Bayes factor should favor the null model.
An appealing aspect of our approach is that it does not require specification of alternative parametric models. In a traditional Bayesian, nonparametric test of a parametric hypothesis, the alternative would be a rich collection of parametric models. The prior distributions for such rich collections do not always reflect sensible beliefs about the alternative, and hence the resulting Bayes factor does not necessarily have the desirable properties that Bayes factors have when testing one parametric model versus another. In contrast, under the alternative our approach only involves one parameter, the bandwidth h of the kernel estimate, and so choosing a prior is relatively straightforward. If the null hypothesis is rejected in our approach, one may consider the kernel density estimate that led to this conclusion to seek guidance as to an appropriate parametric model for the underlying density. Such an approach has long been promoted by advocates of kernel estimates.
Our method based on data splitting has an obvious connection to cross-validation. Indeed, we term the Bayes factor computed from the partitioned data as a crossvalidation Bayes factor, or CVBF. Our idea may be regarded as a Bayesian analog of the method proposed by van der Laan et al. (2004) for choosing the bandwidth of a kernel estimator. The most common version of cross-validation in density estimation is leave-one-out, in which an estimate using all but one data value is evaluated at the deleted observation. In contrast, and using the notation of the previous paragraph, van der Laan et al. (2004) advocate computing a likelihood by evaluatingf ( · |X 1 , h) at the observations in X 2 , and then choosing the bandwidth that maximizes this likelihood. The use of cross-validation in Bayesian model selection has been previously considered by Alqallaf and Gustafson (2001).
The purpose of this paper is to explore the use of CVBFs for testing goodness of fit. The paper is by no means a comprehensive study of this idea. We do, however, establish conditions under which our test is Bayes consistent, and we present simulations and real data examples that indicate considerable promise for the method. The rest of the paper proceeds as follows. The method is described in detail in Section 2, and choosing a prior for the bandwidth of the kernel estimator is discussed in Section 3. The effect of the size of the training set X 1 is investigated by simulation in Section 4, and real data examples are the subject of Section 5. A theorem on Bayes consistency of the proposed goodness-of-fit test is provided in Section 6, and concluding remarks given in Section 7.

The method
We describe the method in the context of checking the fit of a parametric model for a probability distribution. The method could also be applied to other testing problems, such as lack-of-fit in regression, although we do not pursue that possibility here. For n > k, suppose that X 1 = (X 1,1 , . . . , X 1,k ) and X 2 = (X 2,1 , . . . , X 2,n−k ) are independent random samples from density f , and definef ( · |X 1 , h) to be a kernel estimate as follows:f where K is an appropriate kernel, typically a finite variance, unimodal density that is symmetric about 0, and h is a positive bandwidth. Of interest is testing the fit of M 0 , a parametric model for f . Given X 1 , M 1 (X 1 ) = {f ( · |X 1 , h) : h > 0} is another parametric family of densities for the data X 2 . Assuming only smoothness of f , one can expect a member of M 1 (X 1 ) to be relatively close to f , especially when k is large. So, we may test the fit of M 0 by computing a Bayes factor from the data set X 2 that compares M 0 and M 1 (X 1 ). The posterior of h given X 2 is where π( · ) is the prior for h and If m(X 2 |M 0 ) is the marginal of X 2 assuming that M 0 is the correct model, then a Bayes factor would be m(X 2 |M 1 (X 1 ))/m(X 2 |M 0 ).
Given a single data set X = (X 1 , . . . , X n ), the previous idea may be applied by randomly splitting the data set into two parts. Using the parlance of cross-validation, these two sets are called the training and validation sets. The kernel estimate is computed from the training set and the Bayes factor from the validation set. We shall refer to a Bayes factor computed in this way as a cross-validation Bayes factor, or CVBF. In principle one could obtain Bayes factors for all splits corresponding to a given k, and then compute either an arithmetic or geometric mean of all the Bayes factors. The idea of averaging Bayes factors has been used by Berger and Pericchi (1996) in defining intrinsic Bayes factors. Unless n is very small, it would be prohibitive to consider all n k splits for a given k. Instead, it seems to be sufficient to randomly choose a large number (say 1000 to 10,000) of random splits, compute a Bayes factor for each and average the results. We also believe it is worthwhile to consider the distribution of the Bayes factors.
What should one use for k? Ideally one would choose k so that, with high probability, the resulting Bayes factor is less than 1 under H 0 and considerably larger than 1 under alternatives. Calibration of CVBF may be performed by investigating how it behaves when data are generated from the null model. Doing so has somewhat the same flavor as methodology proposed by Xu et al. (2011), who suggest that data splitting be used to calibrate Bayes factors. Their approach uses a training sample to construct posteriors which are used as priors for models that are fitted to the validation data. The calibration is aimed at finding priors that contain a given amount of information. Calibration in our setting consists of determining a value of k such that CVBF has a very small probability of exceeding, say, 0.50 when the null hypothesis is true. Having determined a k that produces the desired CVBF behavior, one would then have to accept the resulting behavior in the event that H 0 is false. This is similar to what a frequentist does in choosing a test to have a small type I error probability, and then accepting the fact that test power will not be good unless the alternative is sufficiently discrepant from the null.
Another consideration in choosing k is Bayes consistency of the test. In Section 6 we establish conditions under which a test based on CVBF for a single random split is Bayes consistent. A condition required to prove our result is that k tend to ∞, but at a much slower rate than n.

Prior for the bandwidth
In the sequel, we will always use a Gaussian kernel for K. The density estimation literature (e.g., Silverman, 1986) tells us that the Gaussian kernel is nearly optimal in a mean squared error sense. We do not claim that a similar property of the Gaussian kernel is true in the current context. However, the point of this paper is to introduce ideas and to demonstrate the potential of CVBFs, and hence we defer an investigation of kernel effect to future work.
Our experience with CVBFs indicates that choosing a good prior for the bandwidth h can be quite important. A good prior will produce stable Bayes factors that tend to be much less than 1 when the null hypothesis is true and much larger than 1 when the null hypothesis is false. For this purpose we have found that a prior of the following form works quite well: where β > 0. When π(h|β) is used as the prior for the bandwidth of kernel estimatê f ( · |X 1 , h), we take β to be R/1.35, where R is the interquartile range of the validation data X 2 . Using the validation data to tune the prior for h is consistent with the practice of using the same data set to compute parameters of reference priors and a Bayes factor.
Two aspects of prior (2) make it appealing. First of all, it is proper, which is a necessary condition for a Bayes factor to be well-defined. Secondly, the prior tends to 0 as h tends to 0. As a prior for the bandwidth of an observed kernel estimate, it seems that this should almost be a requirement, since one can plot the estimate for different values of h and identify an h below which the corresponding empirical kernel models are clearly unsatisfactory. On the other hand, if one uses prior (2) with the same β for all n, then for all n sufficiently large the prior is giving very low probability to values of h that are a priori the most likely under H 0 . (This is because an asymptotically optimal choice of h tends to 0 as n tends to ∞.) Such a prior may seem odd, but is in the same spirit as the idea of a non-local prior espoused by Johnson and Rossell (2010). Suppose Θ 0 and Θ 1 are mutually exclusive and exhaustive subsets of a parameter space Θ. When testing H 0 : θ ∈ Θ 0 vs. H 1 : θ ∈ Θ 1 , a local alternative prior, in the terminology of Johnson and Rossell (2010), is one that assigns nonzero probability to Θ 0 , and a nonlocal alternative prior assigns probability 0 to Θ 0 . Johnson and Rossell (2010) advocate the use of non-local alternative priors that are quite small for values of θ that are in Θ 1 but close to Θ 0 . They prove that certain priors of this type can induce an exponential rate of convergence of a Bayes factor to 0 when H 0 is true, while not disturbing the exponential convergence rate of the Bayes factor under alternatives. We will show in Section 6 that a Bayes factor using prior (2) also converges to 0 at an exponential rate under H 0 .
Another motivation for (2) comes from an idea as in Berger and Pericchi (1996) for circumventing the problem of using noninformative priors in model selection. Berger and Pericchi (1996) suggest using a small part of the data to compute a posterior based on a noninformative prior, and to then use this posterior as a prior for the remainder of the data. One uses the smallest number of observations such that the posterior for those observations is proper. Arguably, a reasonable noninformative prior for h would be the improper prior h −1 I (0,∞) (h), which results from an invariance-underscale-transformations argument. Suppose one computes a kernel density estimate using a Gaussian kernel and a single observation X i . A posterior can be formed by then evaluating that estimate at an independent data value X j , and multiplying by h −1 . This posterior is proportional to which obviously has the same form as (2). Our data-driven choice of β 2 targets E[(X i − X j ) 2 ]/2 when the observations are Gaussian.
Though our subsequent results will only use the prior (2), we have experimented extensively with other priors, for example gamma distributions with shape parameters less than 1. Our finding is that priors that do not tend to 0 as h tends to 0 produce CVBFs that are somewhat unstable. In particular, the CVBFs are sensitive to the choice of k. A choice of k that is acceptable under the null hypothesis is not necessarily a good choice under alternative hypotheses. In contrast, we show numerically in the next section that CVBFs based on (2) seem to perform well for any k under the null hypothesis, which means that one is free to choose k so that it is optimal for alternatives.

Choice of training set size
An important consideration in using CVBFs is the size k of the training set. Ideally one would consider this problem theoretically, but at this early stage of our investigation we study it using simulation. We consider testing the fit of two parametric models: normal and Laplace. For each data set generated, values of k equal to 0.20n, 0.30n, . . . , 0.80n were considered, and a CVBF for each k was computed by averaging the log(CVBF) values from 100 random data splits. Whether testing for a normal or Laplace density, data were generated from three different distributions. These were normal, Laplace and skew-normal distributions when testing for normality, and Laplace, normal and skew-Laplace when testing the fit of a Laplace density.
Two sample sizes, n = 200 and 500 were considered, and five hundred independent data sets were generated for each of the twelve combinations of n and testing/distribution scenario. The skew-normal density considered was where φ and Φ are the pdf and cdf of a standard normal distribution, respectively, and the skew-Laplace density was where f E ( · |θ) is an exponential density with mean θ, and we take α = 1/4 and β = 1.
Whether testing the fit of a normal or Laplace distribution, our prior for the location and scale parameters, (μ, σ), has the form where g 1 and g 2 are densities with respective supports the real line and the positive reals, m is a location estimate and γ a scale estimate. When testing normality, we take g 1 to be standard normal, m to be the sample mean and γ =σ/ √ 2, whereσ is the sample standard deviation. The resulting prior is a unit-information reference prior, in that it is centered at the observed data and contains an amount of information equivalent to that in a single observation.
When testing the fit of a Laplace density, we take g 1 to be a standard Laplace distribution, g 1 (x) = exp(−|x|)/2, and g 2 an inverse gamma distribution: g 2 (s) = s −2 e −1/s I (0,∞) (s). For γ we use R/ log 2, and for m the sample median. As in the normal case, this prior is a unit-information reference prior. Finally, as prior for h we use (2) with β = R/1.35. The data used in computing the parameters of the priors for both null and kernel models were always the validation data X 2 , i.e., the data from which the Bayes factor is computed. It is important to note that defining the priors as we do makes our Bayes factors invariant to the location and scale of the density from which the data were generated.
The results of our simulations are summarized in Figures 1-4. We desire two things from a Bayes factor: that it be smaller than 1 when the null hypothesis is true, and quite a bit larger than 1 when the null is not true. Of course, on the log scale, 0 is the value representing indifference between the two models. With this in mind, Figures 1-4 show that the average log(CVBF) always increased with k when the null hypothesis was true, but never exceeded 0. Jeffreys (1961) considered Bayes factors larger than √ 10 = 3.16 to be "substantial evidence" against H 0 . When the null hypothesis was false, the only time the geometric mean of the CVBF was smaller than 3.16 was when testing normality at n = 200. This occurred with Laplace data at k = 40 and skew-normal data at k = 160, both of which are extreme choices for k. Under alternatives, the average log(CVBF) was either approximately flat or increasing between k = 0.2n and k = 0.4n, and then decreasing from k = 0.4n to 0.8n. This suggests that 0.4n is a good choice for k in the 8 scenarios where H 0 was false. At k = 0.4n, the smallest geometric mean for CVBF Figure 2: Averages of log-CVBF as a function of the training set size. The caption for this case is identical to that in Figure 1 except that n = 500. Figure 3: Averages of log-CVBF as a function of the training set size. The null hypothesis tested is that the data are Laplace, and the sample size is n = 200. The solid, dashed and dotted lines correspond to Laplace, normal and skew-Laplace data, respectively. The horizontal line indicates Jeffreys' cutoff between substantial and insubstantial evidence against the null hypothesis. over these 8 scenarios was 16.81. On the other hand, at k = 0.4n the geometric mean of CVBF was no larger than 0.089 for any of the four scenarios where H 0 was true.
On the basis of our simulation results, it seems reasonable to suggest using a value of k between 0.3n and 0.4n, at least for sample sizes between 200 and 500. Of course, much more work needs to be done on the problem of choosing the training set size. In order for CVBF to be fully efficient, it may be necessary that k/n tend to 0 as n tends to ∞. Only then is CVBF computed on a data set whose size is asymptotic to n. Our simulation results suggest that choosing k very small is not a problem when the null Figure 4: Averages of log-CVBF as a function of the training set size. The caption for this case is identical to that in Figure 3 except that n = 500.
hypothesis is true. However, if H 0 is false, it is conjectured that k should tend to ∞, albeit at a rate slower than n.
It would be interesting to develop a theory that would indicate an optimal rate at which k should grow with n. A possible definition of an optimal k is as follows. First, determine a set of k-values, call it K, such that for each k ∈ K the expected value of log(CVBF) under H 0 is no larger than a given threshold (which would certainly be less than 0). For a specific alternative, define the optimal k to be a member of K that maximizes expected value of log(CVBF) when the data come from that alternative.

Analysis of planetary nebula luminosity data
Here we analyze observations of planetary nebula luminosity in the Messier 31, or Andromeda, galaxy. For background on the data analyzed, the reader is referred to Ciardullo et al. (1989) and Ciardullo et al. (2002). The frequency distribution of planetary nebula luminosities is referred to as a planetary nebula luminosity function, or PNLF. From a visual inspection of a histogram of Messier 31 data, Ciardullo et al. (1989) claimed that the PNLF of Messier 31 does not follow a power law. We wish to validate this claim using our CVBF.
The available data from Messier 31 are brightness readings from 238 planetary nebulae, which can be downloaded from http://astrostatistics.psu.edu/datasets/ plan neb.html. These readings are measured on a scale such that smaller values correspond to brighter objects. Ciardullo et al. (1989) note that dimmer readings are subject to more measurement error. Therefore, they used only observations smaller than 22 to fit the PNLF curve. In statistical parlance, their data are censored. Let X 1 , . . . , X n be the actual observations and Y 1 , . . . , Y n be the underlying data that are free of errors. When Figure 5: Estimates of log-density of luminosity. The solid line is a fitted power law curve that maximizes the posterior density. The dashed line is the log of a kernel density estimate. Each estimate is computed from 61 luminosity readings smaller than 22.
We wish to test whether the PNLF curve follows a power law at brightnesses less than 22. Plots of kernel and power law estimates of log-density are shown in Figure 5. Of the 238 observed luminosities, only 61 were smaller than 22.
The null density of Y i is such that The parameters p and α are unknown, with α > 0. Let X 1 be an arbitrary subset of k of the 238 observations, and let X 2 be the other 238 − k observations. Under the null hypothesis, the likelihood function for X 2 is where n T is the number of observations among X 2 that are at least 22 and S contains the observations in X 2 that are smaller than 22. Now letf ( · |X 1 , h) be a kernel estimate computed from the observations in X 1 that are smaller than 22. Treating such kernel estimates as a model for X 2 , the kernel likelihood has the form If p is assumed a priori independent of h and α, then a Bayes factor is simply In other words, the Bayes factor has exactly the same form it would have if the uncensored data were a random sample of size 238 − k − n T from density (1 − p) −1 f (y)I (−∞,22) (y), where f is the true density of Y i . To compute the Bayes factor we assumed that the prior for h is π( · |β), as defined in Section 3. We used a gamma prior for α, which is a conjugate prior in this case, taking the shape parameter to be 2 and the rate parameter equal to M = xi∈S log((44 − x i )/22)/(238 − k − n T ). This prior has an amount of information equivalent to one observation from the power law density.
As is well-known, kernel estimates are subject to edge effects when the underlying density does not tend to 0 at the endpoints of its support. Such is the case here at x = 22 because of the data censoring. We thus used a reflection technique (Silverman, 1986, p. 30) to computef ( · |X 1 , h). Suppose there are m observations in X 1 that are less than 22. If x i is any one of these m, then the reflection technique involves defining a new data value larger than 22 that is the same distance from 22 as is x i . One then computes a standard kernel estimate based on the new data set of 2m observations, and multiplies it by 2, to ensure that it integrates to 1 over (−∞, 22).
Initially we considered the effect of using different choices for k, the size of the training set. At each of 18 choices for k ranging from 30 to 200, we considered 1000 random data splits, computed CVBF for each split, and then found the geometric mean of the 1000 Bayes factors. The results are summarized in Figure 6. The largest evidence against H 0 occurred when k was 50, which is 21% of n = 238. The flat line is at √ 10, which Jeffreys (1961) deemed to be the smallest value indicating substantial evidence against the null hypothesis.
Using k = 50, we did a more detailed analysis. First we computed a Bayes factor for each of 10,000 randomly selected data splits. The geometric mean of these 10,000 Bayes factors was 164.23, a value generally regarded as strong evidence against the null hypothesis. An empirical quantile function for the logarithm of the 10,000 Bayes factors is shown in Figure 7. It is advisable to ensure that CVBF behaves appropriately when the null hypothesis is true. If it does, then one is justified in regarding a Bayes factor of 164.23 as evidence against the null hypothesis. We thus generated 1000 data sets, each of size 61, from density (3) with α = 31.48, which is the maximum likelihood estimate of α for the luminosity data. Each data set was augmented with 177 = 238 − 61 values larger than 22, and then 100 random splits with k = 50 were considered for each data set of size 238. (The values ascribed to data larger than 22 are inconsequential since only values less than 22 are used in computing CVBF.) The geometric mean of the 100 values of CVBF was computed for each data set. These 1000 means had a geometric mean of 0.0176, only 0.6% were larger than √ 10, and the largest was 17.95.

Wind direction data
Here we consider daily peak wind directions from Dallas, TX, USA for the year 2014. The available data consist of n = 364 directions in degrees, with 0 (or 360) degrees indicating a peak wind from due north. Defining W to be the reading in degrees, the data were transformed to radians as follows: With this transformation northerly winds correspond to x = π/2 and southerly winds to −π/2.
The circularity of the data should be taken into account when estimating their distribution. One anticipates that the density of W is continuous at 270 degrees, which entails that the density, f , of X is such that f (−π) = f (π). This condition is ensured by using a kernel estimate of the following form: where Kernel (5) is a wrapped normal kernel (Mardia and Jupp, 2000), and the estimate (4) integrates to 1 over the interval (−π, π).
For descriptive purposes the bandwidth h of the kernel estimate was selected using the likelihood cross-validation method of van der Laan et al. (2004). The resulting kernel estimate is shown in Figure 8. The estimate shows that peak winds are usually from the north or south, with southerly peak winds being more frequent than northerly ones. A commonly used model in circular data applications is the von Mises distribution, which is a circular analog of the normal distribution. The von Mises density is where −π ≤ μ ≤ π, κ > 0 and I 0 is the modified Bessel function of order 0. The maximum likelihood estimates (MLEs) of μ and κ using all of the wind data are −1.245 and 0.3982, respectively. Figure 8 shows the MLE of the von Mises density along with the kernel estimate.
We shall compute CVBFs for comparing kernel estimates with the von Mises model. The prior for h was taken to be (2) with β equal to a circular data scale estimate computed from the validation data. The parameters μ and κ of the von Mises distribution were taken to be a priori independent, with μ being uniform on (−π, π) and κ having an exponential density with mean equal to the MLE of κ from the validation data. The size k of each training set was taken to be 146. To ensure that this choice of k produces Bayes factors that behave appropriately under the null hypothesis, we generated 200 random samples of size 364 from a von Mises distribution with μ = −1.245 and κ = 0.3982. For each data set, 50 random splits, each with k = 146, were considered. The geometric mean of the 50 CVBF values was computed for each of the 200 data sets. These 200 geometric means ranged between 0.080 and 0.738 and had a geometric mean of 0.237. We then considered 1000 random splits, each with k = 146, of the wind data and computed CVBF for each split. The smallest value of CVBF among the 1000 data splits was 3.48 · 10 11 and the geometric mean of the CVBF values was 7.51 · 10 20 , providing overwhelming evidence against the von Mises model. The von Mises model may seem like something of a straw man since the nonparametric estimate is strikingly bimodal. To make the goodness of fit problem more challenging, we thus consider testing the fit of a mixture of two von Mises distributions. Taking μ 1 ≤ μ 2 , the model considered is We use a prior in which κ 1 , κ 2 and w are mutually independent, the pair (μ 1 , μ 2 ) is independent of (κ 1 , κ 2 , w), w is uniform on (0, 1), each κ has a standard exponential density, and μ 1 and μ 2 are order statistics for a random sample of size 2 from the uniform density on (−π, π). Initially we used the Markov chain Monte Carlo (MCMC) method to determine a good point estimate of the mixture model. A multivariate normal proposal distribution was used. Several iterations of MCMC were used to tweak the parameters of the proposal, and ultimately excellent mixing was obtained. The mixture density corresponding to the average of 10,000 values generated from the posterior is shown in Figure 9. The mixture density matches the kernel estimate reasonably well.
We computed values of CVBF corresponding to training set sizes (k) of 73, 109, 146, 182, 218, 255 and 291. At each k we computed CVBF for 200 random splits. The marginal for the mixture model was determined (for each data split) by using importance sampling. Denoting the prior by π and letting θ = (μ 1 , μ 2 , κ 1 , κ 2 , w), the marginal for data x 1 , . . . , x m is approximated by where θ 1 , . . . , θ N is a random sample from g, and g is a multivariate normal distribution whose mean and covariance matrix are determined from our initial MCMC analysis. The geometric mean of the 200 CVBF values at each k is shown in Figure 10. These values show that the von Mises mixture model is strongly favored over the kernel model.

Bayes consistency
In our goodness-of-fit context, Bayes consistency occurs if the Bayes factor tends to 0 in probability when H 0 is true and to ∞ when H 0 is false. Here we provide a theorem on consistency in a case where the underlying density is defined on a known, finite interval. While the finite support assumption is less than ideal, we at least provide a first rigorous step towards a theoretical justification of our methodology.
• The kernel K is a standard normal density.
• The priors for h and θ are π and p, respectively.

Assumptions.
1. X 1 , . . . , X n is a random sample from density f 0 that has known, finite support (a, b). 4. Under both null and alternative hypotheses the marginal for the parametric model is asymptotic to the following Laplace approximation:

The density
5. The MLEθ converges in probability to θ 0 under both null and alternative hypotheses as n → ∞. Also, I and p are continuous at θ 0 , and p(θ) is positive for all θ in a neighborhood of θ 0 .
6. Under both null and alternative hypotheses, (θ) − (θ 0 ) converges in distribution to a positive random variable Y as n → ∞.
Theorem 1. Suppose that Assumptions 1-7 hold. If the null hypothesis is true, then BF kn is bounded above by a random variable B n satisfying as n → ∞, where P is a positive constant and 1 − c(2α + 4/5) > 1/2. If the alternative is true log BF kn = Qn + o p (n) as n → ∞, where Q is a positive constant.
Some remarks are in order concerning our theorem.
• Note that our proof is for the case where only one random split is used. Clearly, though, a Bayes factor based on the geometric mean of many random splits will perform even better.
• The Bayes factor analyzed actually uses a Riemann sum approximation to the marginal of the kernel model. This seems perfectly reasonable, however, since in practice numerical methods (such as a Riemann sum) must be used to approximate this marginal.
• It is noteworthy that under the null hypothesis our Bayes factor converges to 0 at a rate of exp(−P n η ) for some 1/2 < η < 1. This is in contrast to typical results when testing a parametric null hypothesis against a nonparametric alternative, where under the null hypothesis the Bayes factor converges to 0 at slower than an exponential rate. (See, for example, McVinish et al. (2009).) The reason for exponential convergence in our case is the inefficiency of the kernel estimate relative to the parametric model under H 0 . This creates a sharper contrast between null and alternative models when using our methodology. This is a tangible benefit of using our method as opposed to one in which the null is a special case of a larger alternative model.
• Exponential convergence of Bayes factors under alternatives is typical, and we note that our method also has this desirable property.
• Our proof requires that k = n c , where c must be less than 5/18.
• Finally, it is worth noting that our proof of consistency does not require using any of the optimality properties of likelihood cross-validation as proven by van der Laan et al. (2004). In particular, we do not need to assume (as do van der Laan et al. (2004)) that the density is bounded away from 0.
Proof. We begin with a lemma and its proof.
Lemma 1. The statistic I k = min h∈H M I 2 kh converges in probability to 1 as n → ∞.
Proof. We have where Φ is the standard normal cdf, and hence I kh is between 0 and 1 almost surely.
For an arbitrarily small positive number, Letting E j denote E(I khj ), consider We have and hence Using Assumption 2, it follows from the last expression that where γ 1 , . . . , γ M are all bounded by the same number. Assumption 7 then implies that, for some positive constant C 1 , |E j − 1| ≤ C 1 k −β for all j, and so for all k sufficiently large P (E j − 1 < a ) = 0 for j = 1, . . . , M.
Theorem 1 of Hoeffding (1963) implies that for 0 < t < E j , But for all k sufficiently large −a < E j for j = 1, . . . , M, and therefore Combining previous results and using (6) now implies that, for all k sufficiently large, from which Lemma 1 follows since M = O(k) by Assumption 7.
Null case. Using Assumptions 4-6, it is enough to prove that is bounded by a quantity that converges in probability to 0 at an exponential rate. To this end, for some positive constant C 2 , the last quantity is bounded by wheref 0 (x) is betweenf (x|h, X T ) and f 0 (x).
If D is a bound on f 0 , then for all x ∈ (a, b), and hence where dx for all h, andĥ 0 is the minimizer of ISE(h) over all h in an appropriately large set of which H M is a subset. It follows that We may use results of Hall and Marron (1987) to argue that either k 3/4 ISE(ĥ 0 ) or k 4/5 ISE(ĥ 0 ) converge in probability to a positive constant, depending on how f 0 behaves at the boundary of its support. We take the more conservative approach and assume that k 4/5 ISE(ĥ 0 ) p −→ A 1 > 0. From Lemma 1 C k ∼ k −2α /K 2 (0), in probability, as k → ∞, and hence we may write Combining previous results, Since k = n c , to guarantee consistency at an exponential rate, we need 1−c(2α+4/5) > 1/2, or c < (1/2)(2α + 4/5) −1 , which holds by the condition on c in Assumption 7.
It is easy to establish that for all x ∈ (a, b) and hence We may now use Theorem 1 of Hoeffding (1963) to assert that Importantly, the bound is free of X T , and hence Since d > 1/2, n 2d /(n − k) ∼ n (2d−1) → ∞. Furthermore, with the last inequality holding for all n sufficiently large. Therefore, We need to choose c and d to satisfy 2d − 4cα > 1 and 1/2 < d < 1 − c(2α + 4/5). These two inequalities imply that 1/2 + 2cα < d < 1 − c(2α + 4/5), which requires 1/2 + 2cα < 1 − c(2α + 4/5). The last inequality is true under the condition imposed on c in Assunption 7, thus establishing the existence of an appropriate d and completing the proof of consistency in the null case.
Alternative case. Because of Assumptions 4 and 5, the result will be established by showing that tends to ∞ at the stated exponential rate. The last quantity is at least By Assumptions 2 and 3 and the result of Ahmad and Lin (1976), Therefore, k (h 1 ) − (θ 0 ) + log π(h 1 ) equals which is equal to

Concluding remarks
We have proposed a Bayesian goodness-of-fit test based on ordinary kernel estimators. The idea that makes this possible is that kernel estimates indexed by bandwidth comprise a parametric model for the distribution of data that are not used in calculating the kernel estimates. This idea in conjunction with data splitting leads to straightforward calculation of Bayes factors that compare a specific parametric model with the kernel model. Prior specification for alternative models is straightforward since the only prior required is for the bandwidth of the kernel estimate. A particular prior for the bandwidth of a Gaussian-kernel estimator was proposed and shown by examples and simulations to perform very well. We also applied our method to a wrapped Gaussian-kernel estimator in a setting with circular data.
The ideas proposed herein for testing model fit easily extend to other settings. For example, the same method could be used to test the fit of a multivariate distribution, to test the fit of a parametric regression model, or to compare multiple curves in either k-sample goodness-of-fit or testing the equality of regression functions. A challenging but undoubtedly fascinating problem for future research would be to determine, at least asymptotically, optimal sizes for the training and validation sets.