Low Information Omnibus (LIO) Priors for Dirichlet Process Mixture Models

Dirichlet process mixture (DPM) models provide flexible modeling for distributions of data as an infinite mixture of distributions from a chosen collection. Specifying priors for these models in individual data contexts can be challenging. In this paper, we introduce a scheme which requires the investigator to specify only simple scaling information. This is used to transform the data to a fixed scale on which a low information prior is constructed. Samples from the posterior with the rescaled data are transformed back for inference on the original scale. The low information prior is selected to provide a wide variety of components for the DPM to generate flexible distributions for the data on the fixed scale. The method can be applied to all DPM models with kernel functions closed under a suitable scaling transformation. Construction of the low information prior, however, is kernel dependent. Using DPM-of-Gaussians and DPM-of-Weibulls models as examples, we show that the method provides accurate estimates of a diverse collection of distributions that includes skewed, multimodal, and highly dispersed members. With the recommended priors, repeated data simulations show performance comparable to that of standard empirical estimates. Finally, we show weak convergence of posteriors with the proposed priors for both kernels considered.


Introduction
The Dirichlet process mixture (DPM) model was first proposed by Lo (1984). The marginal distribution of a DPM is a convolution of a kernel density function and a Dirichlet process, g(y) = f (y|G)DP (dG). This model uses the Dirichlet process (DP) of Ferguson (1973) effectively to estimate density functions even though the DP almost surely generates discrete distributions. The DPM model can be written also as: Here each observation y i arises from a density function f (·|θ i ) with corresponding parameter θ i , which in turn arises from a discrete distribution G. The distribution G is randomly generated from a DP with baseline distribution G 0 and concentration parameter ν. The choice of kernel density f (·|θ) determines the mixture components to use in a DPM; for example, if f (·|θ) is a normal kernel, then this DPM is a mixture of Gaussians.
The Gaussian kernel was employed and computationally implemented by Escobar and West (1995). Kottas (2006) considered a mixture of Weibulls model for positive valued survival data. In contrast with much development of the DPM model itself in various directions, the prior specification for it is often undertaken in an ad-hoc fashion with little formal guidance available in the literature. The method proposed here attempts to address this gap in cases where prior information is scant or intentionally avoided in the analysis.
Using as input some simple scaling information for the data to be analyzed, we transform the data to a common axis on which we construct the prior. This transformation and axis depends on the kernel chosen for the DPM, as does the method of construction of the prior. Once constructed, the LIO prior is fully specified and can be used a black-box for the DPM with this kernel. Inference on the original data scale is recovered by a back transformation of the posterior samples. In the case of the Gaussian DPM, we do this construction for univariate as well as multivariate data.
The paper is organized as follows. Section 2 introduces general guiding objectives in constructing low information priors. Sections 3 and 4 apply these notions to the construction of particular prior specifications for Gaussian and Weibull DPMs, illustrating the priors' use through implementation on real and simulated data sets. Section 5 conducts sensitivity analysis and compares repeated data simulation results from the Gaussian and Weibull DPMs using the proposed priors with those from empirical methods. The comparison is intended primarily to demonstrate the low information nature of the prior, not to claim performance superiority. Any advantages over empirical methods accrue from well-recognized aspects of Bayesian nonparametric models, such as distributional flexibility and easy estimation of functionals of the underlying distribution (e.g., density and hazard rate) along with attendant uncertainty quantification for each. Finally, Section 6 establishes posterior weak convergence properties with the priors, while Section 7 concludes the paper with a brief discussion.

Rationale and Construction Outline for Low-information Omnibus (LIO) Priors
When applying a DPM model to data, the base distribution G 0 should be specified with care, as G 0 represents prior knowledge about the distribution of the data in an intricate combination with ν. One's first instinct might be to use a vague G 0 . However, it is well known that this is not advisable. For example, the authors of Chapter 23 of Gelman et al. (2014) point out that using such a choice of G 0 places "a heavy penalty on the introduction of new clusters". In effect, a highly dispersed choice of G 0 is highly informative, as it implies that all data points belong to a common cluster in the posterior predictive distribution. They recommend standardizing the data and using "an informative G 0 that places high probability on introducing clusters near the support of the data". Similar uses of data scaling and low information prior can be seen in parametric Bayesian data analysis. Gelman et al. (2008) suggested specific scaling and a low information prior that is "vague enough to be used as a default in routine applied work" instead of aiming for a no-information prior. The latter pursuit can be challenging both theoretically and computationally.
With this rationale, we propose a specific data-scaling that depends on the DPM kernel and a particular hierarchical specification of the prior on G 0 for the scaled observations, which jointly serve as a "black box" for various data contexts. The prior elicitation requires minimal scale-related information (such as a high percentile of the population distribution for the mixture-of-Weibulls model; and the median and 95th percentile for the mixture-of-Gaussians model) from the investigator knowledgeable in the subject matter. While fitting a particular DPM with an already constructed LIO prior might be seen as a black box method, deriving the prior for the kernel chosen requires much care, and is certainly not a black box. For some kernels, even a trial and error process with visual inspection might be needed (as for the Weibull kernel below) but only once per kernel.
In using the prior, there are three simple steps: 1. With the scaling information provided by the investigator, transform the data to a suitable fixed scale.
2. Apply the recommended LIO prior to the fixed-scale data and obtain posterior samples using established computational methods. The prior specification is aimed at providing a variety of mixture components rich enough to allow flexible modeling of observations on the fixed scale. We thus find a set of hyperparameters capable of generating such components. The process of finding these hyperparameters is discussed for two specific DPMs in the sequel.
3. Transform back the samples representing posterior inference to obtain originally targeted inference.
Currently, we have two such distinct black-box implementations: one is for the mixture-of-Gaussians model that is well-suited to modeling real-valued and vectorvalued data. The other is for the mixture-of-Weibulls model, which is more appropriate for time-to-event data as the Weibull distribution has a positive domain and convenient mathematical forms for interpretable functions such as the survival and hazard functions.
When considering the kernel density components needed for fixed-scale data, we keep a modest goal in sight: give a reasonable and rich variety of components a fair chance to be selected by the data. The DPM model itself is robust in that the information in the data will be dominant when the prior is sufficiently flexible. Specifics of prior construction are given in the next two sections. In implementing inference with the proposed priors, for all computational results reported here, we used the 8th algorithm of Neal (2000).

LIO Prior for DPM of Gaussian Distributions
A Dirichlet process mixture of Gaussian distributions is versatile in that it applies straightforwardly to univariate as well as multivariate data. Below, after establishing notation, we develop LIO prior specifications; first for univariate and then for multivariate data.

Model Specification
We use a Gaussian DPM model similar to that employed by the DPdensity function in the R package DPpackage (Jara et al., 2011). Assume y 1 , · · · , y n are conditionally iid vector observations, each of length p. Our approach is to make a location-scale transformation of the data, apply the DPM model to estimate the transformed data's distribution, and then estimate the original data's distribution by transforming back to the original scale. More specifically, we choose some quantities a ∈ R p and a positive definite p × p matrix B to rescale the data as z i = B −1 (y i − a). Then, the following model is fitted to the transformed data: Here No(m, U) denotes a normal distribution with mean m and precision matrix U, and Ga(a, b) denotes a Gamma distribution with shape parameter a and rate parameter b. With W i(k, W) denoting a Wishart distribution with degrees of freedom k and inverse scale matrix W (expectation kW −1 ), G 0 has a hierarchical specification, the first level being a normal-Wishart distribution with parameters m μ , λ, k T , and Ψ and the second level having independent Gamma and Wishart distributions for λ and Ψ, respectively. To be specific, (μ, T) ∼ NoW i(m, λ, k, Ψ) means μ|T, λ ∼ No(m, λT) and T|Ψ, k ∼ W i(k, Ψ). Because the support of the Wishart distribution is the set of p × p positive definite matrices, all T i obtained from this model are positive definite. The concentration parameter ν is set to have a Ga(a, b) prior with a = 1 and b = 1 (Escobar and West, 1995).
As the z i arise from an infinite mixture of normal distributions, their cumulative distribution function (CDF) is where Φ p is the CDF of a p-variate normal distribution No(0, I), Λ i resulting from the unique Cholesky decomposition T i = Λ i Λ i , and ∞ i=1 p i = 1. By the correspondence between the y i and z i , this implies that the original data's distribution is an infinite mixture of normal distributions with CDF Thus, fitting this model to the transformed data induces a DPM model on the original data and provides an estimate of its CDF. Through this, one can estimate any functionals of the distribution of the original data through posterior sampling of θ i = (μ i , T i ). We want to transform the data to a common location and dispersion; this will justify applying the same model to all transformed data sets. In our location-scale transformation, a and B are measures of the location and scale of the original data that need specification. We employ contextual choices of some quantiles of the data's underlying distribution. The investigator supplies values c k and d k that are reasonable pre-data estimates of the median and the 95 th percentile of each component y 1k of the data vector. These percentiles are natural quantities to consider and should facilitate elicitation based on existing results or expert opinion. The scale (or standard deviation) of the k th component can be estimated roughly by (d k − c k )/2, so we set a = c and B = Diag{(d − c)/2}. The transformation z i = B −1 (y i − a), then, is a standardization of the data based on the investigator's input. Finally, we state a theorem -a corollary of Theorem 3 in Ferguson (1973) -that is of use in the next two sections.

Hyperparameter Selection for Scalar Data
We first consider the scalar data case, where p = 1. The DPM model requires choosing 6 scalar hyperparameters of the distribution of G 0 : m μ , k τ , a λ , b λ , k ψ , and W ψ . We consider the standardization of the data in choosing values for the prior moments of G.
Having specified these moments, which are functions of the hyperparameters, we can solve for the hyperparameters themselves. Given its parameters θ i , the distribution of a data point z i is normal with mean μ i , precision T i , and variance T −1 i . Because the data are standardized, we expect that, on average, these means are near 0 and variances are near 1. Thus, we set the expectations of μ i and T −1 i equal to these values: and Next, we desire for the μ i drawn from the prior distribution to lie near any of the standardized data points. That is, we choose the prior variance of μ i to be large enough so that the spread of the μ i 's matches, a priori, the spread of the standardized data.
To choose v, we appeal to Chebyshev's inequality. Since we are concerned with the spread of the standardized data, we apply this inequality to its empirical distribution, which has mean 0 and variance (n − 1)/n. This gives Suppose we require that the left hand side is at least a proportion π ∈ (0, 1). Chebyshev's inequality implies that choosing c = (n − 1)/[n(1 − π)] satisfies this condition that the proportion π of the z i will fall in [−c, c]. Now, μ 0 |T 0 , λ has a normal distribution, so we expect that π of its density lies within d = z 1−(1−π)/2 standard deviations of its mean, m μ = 0. From (3), we have so we expect π of the area under the probability density of μ 0 |T 0 , λ to lie in [−dv, dv].
To capture most of the data in this range and to ensure that newly sampled μ i 's lie near these data points, we would choose π to be large, say 95% or 99%. Experimentation suggests that π = 99% works well for a wide range of data distributions, so we recommend this value. Also, the factor (n − 1)/n can be replaced by 1 as this inflates v by a small amount for most practical sample sizes. Now, we have three equations (1)-(3) and two constraints, namely k T > 2 and a λ > 1, for the six hyperparameters m μ , k ψ , k T , W ψ , , a λ , b λ . While (1) yields m μ = 0, it is unclear how to choose others exactly. However, smaller values of k ψ , k T and a λ give less informative priors for the corresponding Gamma and Wishart distributed parameters. A choice of a λ = 3/2 implies λ has a scaled χ 2 distribution with 3 degrees of freedom, the minimal integer degrees that give a λ > 1. Similarly, in the case p = 1, W i(k, W ) is a scaled χ 2 distribution with k degrees of freedom. Then 3 is the minimal integer degrees of freedom that will satisfy the constraint k T > 2, so we set k T = 3 and k ψ = 1. With these choices and v as chosen above, (1)- (3) give unique values for m μ , b λ , and W ψ , completing the prior specification.

Hyperparameter Selection for Vector Data
Here again, we need to specify 6 hyperparameters; the only changes are that m μ is a vector and W ψ is a matrix. Similar to the univariate case, on the average we expect z i |μ i , T i has mean close to 0 and covariance matrix close to I, since the data is standardized. This implies and provided a λ > 1 and using (4), (5), and prior independence of λ and T 0 .
The empirical distribution of the standardized data has mean 0 and covariance matrix I(n − 1)/n. Applying a multivariate version of Chebyshev's Inequality (Chen, 2007) to the empirical distribution, we get To ensure that the Euclidean length of the z i is within c units of the origin for a proportion π of the data, we set = v 2 I, so we expect π of the volume under the density of μ 0 to lie within Euclidean distance dv of the origin for some d > 0. Then, on the average, V ar Therefore, we set d = χ 2 p,π . Setting dv = c as before, we get v = p(n−1) nχ 2 p,π (1−π) . Similar to the univariate case, we set a λ = 3/2 and k T = p + 2 and k ψ = p, the minimal integer degrees of freedom that satisfy k T > p + 1 as required. Then we can obtain m μ , b λ , and W ψ from equations (4)-(6). Using the fact that χ 2 1,π = z 2 1−(1−π)/2 for any π, it is easy to see that the choice of hyperparameters for the vector data case reduces to the scalar case when p = 1.

A Different View: Prior Specification on Mixture Components
In the preceding, we derived a prior for G by placing constraints on moments of its distribution. This, in turn, places a prior on the θ i , since θ i |G ∼ G. From another viewpoint, we have specified a prior for the normal mixture components f (·|θ i ). We wish to have mixture components that are suitable for density estimation of the standardized data. Because the majority of data points will lie near 0, we set E(μ i ) = 0 and V ar(μ i ) = v 2 I in order to ensure that, a priori, most mixture components are centered near 0. Setting E(T −1 i ) = I places a constraint on how dispersed the components are, providing mixture components that are, on the average, neither extremely dispersed nor extremely concentrated.  Figure 1 shows two sets of randomly generated mixture components from our prior in the scalar data case; each plot contains 50 components. To obtain the components, we generated a sample of θ i from G using the stick breaking procedure in Sethuraman (1994). The black line shows the height of a standard normal density at 0 and is included as a benchmark. We see many mixture components centered near the origin, in the range [−5, 5]; this includes both sharply peaked and more diffuse curves. By Chebyshev's inequality, 96% of standardized data points will lie in [−5, 5], so this set of components will be useful for estimating density at points near the origin. A few curves, including sharply peaked ones, are centered outside of the range [−5, 5] and help to estimate the density at outliers. Our specification intends for 99% of mixture components to be centered in [−10, 10]; in these plots, 98% of the components are centered there.
As a result of specifying a prior on the mixture components, we have also specified a prior for the infinite mixture of these components. In Figure 2, we show 20 prior predictive densities, 10 in each plot. Though the majority of these curves are centered near 0, we do see densities centered outside of [−1, 1]. Moreover, the sample includes skewed, multimodal, heavy-tailed, and sharply peaked densities. This permits the model to accommodate many data distributions and shows that, though we expect the transformed data to be centered at 0 with unit scale, the LIO prior does not strictly enforce these conditions.

Examples
In the first example, we test this prior with 200 points generated from a univariate standard Cauchy distribution. In Figure 3, we see the estimates and 95% pointwise credible intervals (CI) for the density of this distribution along with the true density curve. A rug plot is included; 25 points fell outside the range [-6,6] and are not shown.  The credible intervals contain all of the true density, showing that this model performs well even with such "badly behaved" data. The plot also shows the density of a t distribution with 2 degrees of freedom. The credible intervals exclude the t 2 density for the range [−0.5, 1.0]. This demonstrates that the Gaussian DPM with our LIO prior can adeptly estimate a Cauchy distribution and, furthermore, is sensitive enough to discriminate between Cauchy/t 1 and t 2 distributions. In this simulated example, we used the true median and 95th percentile of the Cauchy distribution as scaling input. Sensitivity to such choices is considered in Section 5. The next example uses data from air quality measurements in New York, from May to September 1973, contained in the R dataset "airquality". We estimate the bivariate distribution of ozone and solar radiation levels from 111 pairs of measurements in this set. Figure 4 has a scatter plot of the data and the density estimate. The estimate appears to fit the data quite well. Because the ozone and radiation levels only take on positive values, however, some density is placed outside the possible range of values. Using a log transformation of the levels before fitting might give even better estimation while ensuring that all density is placed within the possible range of values. In the absence of external information, for illustrative purposes, we used needed scaling percentiles from the data.
Example 3 illustrates density estimation using 400 data vectors from a bivariate mixture distribution, F = 0.5F 1 + 0.5F 2 . Here F 1 is the bivariate t distribution with 5 degrees of freedom and an identity covariance matrix, while F 2 is a bivariate normal with mean 2 0 and covariance matrix 1/3 1/3 1/3 4/3 . Figure 5 shows four plots: a scatter plot of the data, contour plots of the true and estimated density of the mixture distribution, and a coverage plot. The density was estimated on a 127x127 grid of points. The coverage plot shows whether the true density falls within the 95% pointwise credible interval at each point in the grid, with white squares indicating coverage and red indicating noncoverage. The density estimate is quite similar to the true density. This is impressive, considering that the data's distribution is a mixture of a bivariate normal distribution with positive correlation of 0.5 and a more dispersed, uncorrelated t distribution. Furthermore, the 95% CIs contain the true density at approximately 98% of the grid points.

Prior's Effect on Number of Clusters
Although the main focus of this article is to construct a widely usable low information prior for the purpose of density estimation, the DPM model has also been used for clustering observations in many applications. See, for example, Dorazio et al. (2008), Canale and Prünster (2017) and the references therein. It is well known (Antoniak, 1974) that the so-called total mass parameter ν in the DPM of Section 3.1 strongly controls the prior distribution of the number of components in a DPM. This prior distribution also depends on the sample size n. To mitigate the issue Escobar and West (1995) first recommended a prior on ν, a gamma prior. It has been standard practice since to use such a prior, and we have chosen this to be Ga(1, 1). Here we describe simulations intended to shed some light on how the posterior distribution of the number of components, using the LIO prior, responds to the data generating distribution and the sample size.
We explored dependence of the posterior distribution of clusters under three choices of prior for ν: Ga(0.1, 0.1), Ga(1, 1), and Ga(10, 10), all having mean 1 and with variances 10, 1, and 0.1, respectively. Sample sizes of 100, 1000, and 10000 were simulated, 200 times each, from three possible distributions: No(5, 1/4), a mixture of two normals (0.8No(0, 4)+0.2No(2, 25)), and a standard Cauchy distribution, which can be viewed as an infinite mixture of normals. We expect to see larger numbers of clusters appear more frequently as the number of true normal mixture components of the data generating distribution increases. Figure 6 displays averaged distributions of the posterior number of clusters, obtained from histograms produced by retaining 10000 mcmc iterations after burnin under each sample size, data distribution, and prior for ν considered. Under the Ga(0.1, 0.1) and Ga(1, 1) priors, this posterior distribution is quite responsive to the number of true mixture components of the data, with the bulk of the densities placed on cluster sizes of 5 or less for the normal and normal mixture and much larger cluster sizes being prevalent for the Cauchy distribution. Under the Ga(10, 10) prior for ν, the posterior number of components are much more similar across data distributions.
While the results seem to indicate reasonable behavior with the recommended prior, we caution that for posterior number of clusters these are early investigations, and more work is warranted. Other possibilities, using different models, may have better promise as mentioned in the last paragraph of the Discussion section. On the other hand, we are confident in our recommendation of the prior for inference on functions such as density, cumulative distribution and hazard ( Figure 1 in Supplementary Material (Shi et al., 2018)) with the LIO prior parameter settings.

DPM of Weibull Distributions
The proposed prior here is designed for the model of Kottas (2006). When both parameters of the Weibull distribution are given a flexible DP prior, this model approximates arbitrarily closely any distribution on the positive real line. The model is especially convenient for time-to-event data as the Weibull distribution offers simple mathematical expressions for the survival, hazard, cumulative hazard and density functions. Moreover, likelihood expressions for right, left and interval censored data remain tractable. After establishing notation for the model, we construct a LIO prior for it. Although the details apply only to the DPM of Weibulls, we note that the method of construction can be adapted to any DPM model with kernel family closed under scale change; for example, the Gamma family.

Model Specification
We begin with y 1 , . . . , y n denoting conditionally iid observations modeled with a DPM of Weibulls. As in the Gaussian case, the first step is to rescale the data to a convenient fixed scale. Using a contextually specified value c for the 95th percentile of the data's underlying distribution, we make the transformation z i = 10y i /c. Then, generally following Kottas (2006), we fit this model: Ga(a, b).
The model here differs slightly from that in Kottas (2006) in one aspect: the form of G 0 . The original model of Kottas (2006) uses λ ∼ Ga(·, ·) and an independent Uniform-Pareto distributed α denoted α ∼ UP ar (a, b) and defined by α|φ ∼ U (0, φ), φ ∼ P areto(a, b) with density of φ given by ba b φ −(b+1) I (a,∞) (φ), a > 0, b > 0. We use instead a bivariate prior for (α, λ) employing a product of two gammas with a restriction that keeps G 0 's support away from the origin through a choice of f (λ) made in Section 4.2 below.
As in Section 3, inference for quantities related to the original data y 1 , . . . , y n can be recovered from fitting the above model to z 1 , . . . , z n since

Hyperparameter selection
The approach here is distinct from that for a mixture of Gaussians where we used Chebyshev's inequality and some expectation arguments. Here we work more directly with Weibull distributed mixture components that are deemed desirable with our lowinformation goals on the pre-fixed data scale. We generate (α, λ) pairs corresponding to such components, inspect these visually, and use heuristics to find parameter specifications that generate similar collections. Details of the process follow.
As two distinct percentiles determine (α, λ) for a Weibull distribution, we began by working with the 5th and 95th percentiles, denoted t 1 and t 2 , respectively. We let t 1 range from 0.1 to 24.5 and t 2 from t 1 + 0.5 to 25, both by increments of 0.1. We also added a restriction, t 1 /t 2 < 0.95, to avoid very spikey distributions. This generated the 29487 pairs (α, λ) plotted in the left half of Figure 8.
With the marginal of λ in hand, the next task was to specify α α and λ α in the prior for α. In the model specification, the lower limit f (λ) is intended to avoid near-zero  values for both α and λ as such values correspond to distributions that have an infinite spike at 0 yet assign substantial probabilities to large values. Since z 1 , . . . , z n are on a pre-fixed scale not greatly exceeding 10, restricting the 95 th percentile to 25 or less is a reasonable specification. This leads to f (λ) = max(0, log{log(20)/λ}/ log (25)). Using a trial and error process with visual inspections of scatter-plots of data generated under various combinations of (α α , λ α ) resulted in the right half of Figure 8 with α α = 0.2 and λ α = 0.1. This completes the hyperparameter selection we recommend for the LIO prior.

Examples
In this section we present inference demonstrations using the LIO prior for survival, density and hazard functions with single datasets of 200 observations each, with 10% right censoring and 10% interval censoring, generated from a mixture of lognormal distributions, 0.8LN (0, 0.25) + 0.2LN (1.2, 0.02), which was used in Kottas (2006) (Figure 10). Figure 11 demonstrates the case of heavy right censoring as often occurs at end of study.

The Supplementary Material includes additional examples.
In all examples the specified 95th percentile equaled the true value. Blue lines are the estimates (solid lines) and 95% pointwise credible intervals (dashed lines) provided by the DPM of Weibulls model with the LIO prior. Red lines show survival, density and hazard functions from which observations were generated. Black lines in the survival plots are the NPMLE (nonparametric maximum likelihood estimate) (Turnbull, 1974) estimates and 95% pointwise confidence intervals for them. Figure 11 data generation consisted of 2000 observations, 95% right censored at 0.5, generated from the same mixture of log-normals as in the previous figure. It is interesting to see the credible intervals beyond 0.5 immediately reflecting the lack of information there. In practice, elicitation of the 95th percentile may be challenging in  the presence of heavy right censoring at the largest observed time t max . We recommend eliciting the survival probability q at t max and using the prior mean survival (solid black line in Figure 9) to find t q such that E(S(t q )) = q. Then specify the 95th percentile as 10t max /t q .

Sensitivity Analysis and Comparison with Empirical Methods
The only information that the LIO prior requires from the investigator is a specification of the scale of the data's underlying distribution, obtained from the median and 95th percentile for the mixture-of-Gaussians model and the 95th percentile for the mixtureof-Weibulls model. A question of interest is how much any misspecification of the scale would affect the results. We address this question through simulations. In addition, we compare the performance of the two DPM models under their respective LIO priors with empirical methods. Figure 12: Sensitivity to median misspecification, Gaussian DPM: bias in top row, rmse in bottom row for CDF at 9 deciles; Section 5.1.

Sensitivity Analysis
To evaluate sensitivity to specification of the median (95th percentile) we varied this input, setting it to the true 30th, 40th, 50th, 60th, 70th (75th, 90th, 95th, 99th and 99.9th) percentiles of the underlying distribution. We then studied the performance of the posterior mean CDF at 9 deciles of the underlying distribution. Thus the true value of the estimation targets are 0.1 to 0.9 by increments of 0.1. We randomly generated 200 datasets of 100 observations each from the following three distributions: 1. t 2 : the standard t distribution with 2 degrees of freedom, representing a distribution with tails heavier than those of the Gaussian; 2. lnorm: the lognormal distribution, exp[Normal(2, 1)], representing a skewed distribution; 3. mixnorm: a mixture of two Gaussians, 0.5 Normal(0, 1 2 ) + 0.5 Normal(4, 1.5 2 ), representing a multimodal distribution. correspond to the three data generating distributions. Horizontal axis markings in each plot indicate the percentile at which posterior CDF means were calculated. Different colors represent scaling input percentiles. An empirical estimate in black is also included as a benchmark. Bias is slightly worse and rmse slightly higher with misspecification of the median in the normal mixture case. Misspecification of the 95th percentile appears to be even less consequential. Overall, agreement with empirical estimates is reasonable.

Comparison with Empirical Methods
Using the median and three specifications (90th, 95th, 99.9th) for the 95th percentile of the data generating distribution as input to the LIO prior, we compared the performance of the Gaussian DPM and the ECDF (empirical cumulative distribution function) for the three specified distributions. Here, we used 200 simulated datasets of 100 or 1000 observations each. Figure 14 shows bias and root mean squared error (rmse) at the deciles of the data's underlying distribution. For each distribution and 95th percentile specification, the plotted performance measures are averages of the corresponding quantities at the 9 deciles. On the horizontal axis we use "100D" and "100E" to denote respective results from the Gaussian DPM and ECDF on datasets of size 100; similarly, "1000D" and "1000E" show these for sets of size 1000. Unlike the previous figure, colors here represent data generating distributions. Plot symbol shapes indicate prior specifications.  The DPM with the LIO prior and the ECDF perform very similarly with respect to bias and rmse.
For the mixture-of-Weibulls model, we used three specifications (90th, 95th, 99.9th) for the 95th percentile of the data generating distribution and compared performance with an empirical method, again using the same 4 data generating distributions as in the examples of the previous section. To see the impact of censoring rate and sample size, we added scenarios with 50% censoring (25% right censoring, 25% interval censoring) and 1000 observations. In Figure 15, the "S" on the x-axis represents the NPMLE estimates from the R package "survival", while the "D" represents DPM of Weibulls model with LIO prior. The numerals preceding these letters indicate the censoring rate 20% or 50% (the latter figure in Supplementary Material). In each plot, the first 4 estimates are based on datasets with 100 observations while the rest are based on datasets with 1000 observations. Again, we see that the performance of the DPM is quite similar to the frequentist estimates in terms of bias and rmse.
A bivariate example is included in the Supplementary Material.

Convergence Considerations
Consistency of a Bayesian procedure is in a sense a frequentist validation of the procedure: For a nonparametric or semi-parametric Bayesian procedure, consistency implies convergence to a true unknown density as the number of data observations goes to ∞. We use Ghosal et al. (1999), Tokdar (2006), Wu and Ghosal (2008) and Wu and Ghosal (2010) to show convergence properties with the two LIO priors in previous sections.
Measuring convergence for a density estimation procedure is done in terms of concentration of the posterior probability around a neighborhood of the true unknown density. Let X 1 , . . . , X n be observed data ∈ R p for some integer p ≥ 1. Let F denote the space of all densities on R p . Let f 0 be some density on R p . Also for any > 0 let us denote by N w (f 0 ) and N s (f 0 ) the neighborhoods of f 0 under weak and strong topology respectively. Let P f denote the probability measure corresponding to a density f . Also let P ∞ f denote the probability measure on the infinite dimensional random vector where the X i are iid and ∼ f . We begin with the formal definitions of posterior consistency.

Definition 1.
A prior Π is said to be weakly consistent at a density f 0 if for any > 0, the random variable, as n → ∞ almost surely with respect to the measure P ∞ f0 .
Replacing the neighborhood under weak topology with the neighborhood under strong topology (also called L 1 topology), the definition of strong consistency is given as, Definition 2. A prior Π is said to be strongly consistent at a density f 0 if for any > 0, the random variable, as n → ∞ almost surely with respect to the measure P ∞ f0 .

Equivalence Results
To establish posterior consistency properties of the LIO prior, we first show that it suffices to study consistency on the scaled data.
Proof. In Supplementary Material.
Next we consider the class of densities at which consistency is shown. In the next lemma, we show that in addition to equivalence for posterior consistency, the regularity conditions and the density classes are also equivalent between the observed data and the scaled data.

Lemma 2. Let
be a linear rescaling of the observed data {X i } n i=1 as previously stated, with induced densities and priors between them. The following conditions for the induced density on rescaled data, imply equivalent conditions on the density f 0 (x) on the observed data.
Proof. In Supplementary Material.
Earlier work in the literature (Walker, 2004;Choi and Schervish, 2007) contains other slightly different regularity conditions on the true density f 0 , for all of which, equivalence can be shown. We omit a detailed description here for the sake of brevity.

Consistency Results on the Scaled Data
The LIO prior in this article is used for the following three scenarios:  Ghosal et al. (1999) and Wu and Ghosal (2008). The work in Wu and Ghosal (2008) is restricted to showing consistency at true densities having a finite second moment, which excludes some commonly used densities, e.g., the Cauchy. Tokdar (2006) significantly weakens the second moment condition, while adding additional regularity conditions on the base measure. For our item (1), results of Tokdar (2006) Theorem 3.3 directly apply. This implies weak consistency for our procedure on a wide class of true densities, including those such as the Cauchy density.
We show here briefly that a similar weakening on conditions for our item (2) is also possible as our base measure satisfies similar regularity conditions in the next lemma.
Proof. In Supplementary Material.
The proof of weak consistency for the multivariate case -for our item (3) follows from Theorem 2 in Wu and Ghosal (2010). These results also do not permit true densities for which second moment is not finite. It is possible to further impose conditions on the base measure, implying conditions on the eigenvalues of covariance matrix, but this is fairly involved, not following from earlier results; a discussion of this is omitted here.
Strong consistency (also referred to as L 1 consistency) on a restricted class of densities as given by Theorem 3 in Wu and Ghosal (2010) applies directly to our scaled data procedure, and by virtue of our equivalence results, to the induced procedure on the observed data. Some weakening of the conditions of Theorem 3 is possible for admitting a broader class of true densities, once again by imposing strict decay conditions on the tails of the base measure, but further details are omitted here.

Discussion
We offer a technique for an omnibus low information prior specification that can handle data of various scales in a mixture-of-Gaussians model and a mixture-of-Weibulls model. Using data simulated from a variety of distributions we demonstrated the effectiveness of these prior specifications. To implement the Gaussian DPM model with our prior, we have developed a wrapper for the DPdensity function of the R package DPpackage (Jara et al., 2011) that provides density estimation for scalar and vectorvalued random samples. This is included in the R package DPWeibull (https:cran.rproject/web/packages/DPWeibull) which also includes functions for DPM of Weibulls.
We illustrated this method of prior specification for DPMs of Gaussian and Weibull distributions. A similar approach can be used to obtain a LIO prior for a DPM of any location-scale family, such as t distributions. Additionally, a similar application could be used for mixtures of distributions from a family that, like the Weibulls, are closed under a change of scale; Gamma distributions are one such family.
The process of obtaining a low information prior only needs to be done once. It is designed to generate a vague but robust prior. The construction process could be based on moment arguments as in the mixture-of-Gaussians case, or may require a more constructive effort with trial and error as we described in the mixture-of-Webulls case.
Similar to De Iorio et al. (2004)'s Dependent Dirichlet process (DDP) of Gaussian mixtures model, we have extended DPM-of-Weibulls model to a DDP regression model for survival data that can directly model event time and address censoring as well as competing risks. This work is contained in the first author's recently completed dissertation at the Medical College of Wisconsin and will be published elsewhere.