Addressing extrema and censoring in pollutant and exposure data using mixture of normal distributions.

BACKGROUND
Volatile organic compounds (VOC), which include many hazardous chemicals, have been used extensively in personal, commercial and industrial products. Due to the variation in source emissions, differences in the settings and environmental conditions where exposures occur, and measurement issues, distributions of VOC concentrations can have multiple modes, heavy tails, and significant portions of data below the method detection limit (MDL). These issues challenge standard parametric distribution models needed to estimate the exposures, even after log-transformation of the data.


METHODS
This paper considers mixture of distributions that can be directly applied to concentration and exposure data. Two types of mixture distributions are considered: the traditional finite mixture of normal distributions, and a semi-parametric Dirichlet process mixture (DPM) of normal distributions. Both methods are implemented for a sample data set obtained from the Relationship between Indoor, Outdoor and Personal Air (RIOPA) study. Performance is assessed based on goodness-of-fit criteria that compare the closeness of the density estimates with the empirical density based on data. The goodness-of-fit for the proposed density estimation methods are evaluated by a comprehensive simulation study.


RESULTS
The finite mixture of normals and DPM of normals have superior performance when compared to the single normal distribution fitted to log-transformed exposure data. The advantages of using these mixture distributions are more pronounced when exposure data have heavy tails or a large fraction of data below the MDL. Distributions from the DPM provided slightly better fits than the finite mixture of normals. Additionally, the DPM method avoids certain convergence issues associated with the finite mixture of normals, and adaptively selects the number of components.


CONCLUSIONS
Compared to the finite mixture of normals, DPM of normals has advantages by characterizing uncertainty around the number of components, and by providing a formal assessment of uncertainty for all model parameters through the posterior distribution. The method adapts to a spectrum of departures from standard model assumptions and provides robust estimates of the exposure density even under censoring due to MDL.


INTRODUCTION
Volatile organic compounds (VOCs) have been used extensively in personal, commercial and industrial products (MDE, 2010;Linget al., 2011;Weschler, 2011;USEPA, 2012b), and these chemicals are widely found in air in indoor, outdoor and occupational settings. Many VOCs are hazardous, and exposure through inhalation has been associated with a variety of acute and chronic health effects, such as respiratory disease and cancer (Kim and Bernstein, 2009;USEPA, 2012a;b). While concentrations of VOCs in environmental settings are generally much lower than those in occupational settings (Rappaport and Kupper, 2004), moderate and sometimes high concentrations and exposures can be encountered among the general population during certain activities, such as filling vehicles with gasoline and home renovations, in hobbies such as furniture restoration, small engine repair and gun cleaning, and using cleaners, pesticides, pest repellants and air fresheners in poorly ventilated spaces (Battermanet al., 2006;Jiaet al., 2008a;D'Souzaet al., 2009;Jia and Batterman, 2010;USEPA, 2012b).
The high concentrations found for a portion of the population, along with the much lower concentrations for the bulk of the population, typically results in highly right-skewed concentration distributions (Jiaet al., 2008b). Extreme value theory and other techniques can model the upper percentiles of VOC concentration distributions, and generalized extreme value (GEV) distributions have been shown to fit VOC data much more closely than lognormal or other types of distributions (Jia, et al., 2008b;Battermanet al., 2011;Suet al., 2012). Most data sets also contain many low observations, often including measurements that fall below the method detection limit (MDL). These "non-detects," which represent leftcensored data, can be treated by substitution, single or multiple imputation, regression on order statistics (modeling using probability plots of known distributions to estimate summary statistics), and laboratory-generated data (using the original data without replacement) (Antweiler and Taylor, 2008). The extent of data below MDLs can significantly affect the quality of the results (Lubinet al., 2004;Antweiler and Taylor, 2008). The statistical issues associated with the analysis of data with MDL issues are well known (Tayloret al., 2001;Krishnamoorthyet al., 2009).
Due to the variation in source emissions, differences in the settings and environmental factors where exposures occur, and the measurement issues just noted, distributions of VOC concentrations can have multiple modes, heavy tails, and significant portions of data falling below the MDL that are replaced by a single value. These issues, which can be encountered in exposure and as well as other types of data sets, challenge standard parametric distribution models. While GEV distributions can fit the upper portions of distributions, they do not represent the full distribution of the data. Information on the full distributions of exposure levels is needed to establish exposure/risk guidelines, to estimate health risks and uncertainty estimates across a population (Su, et al., 2012), and to facilitate probabilistic analyses (Hammondset al., 1994).
Mixture distributions, which extend parametric families of distributions to fit datasets that are not adequately fit by a single common distribution, provide a flexible and powerful approach of representing the distribution of a random variable (Titteringtonet al., 1985;McLachlan and Basford, 1988;McLachlan and Peel, 2000). As examples, a finite mixture of normals applies a set of 'mixing weights' to a specified and finite number of component distributions, while a nonparametric Dirichlet process mixture (DPM) of normals relaxes the need to pre-specify the number of component distributions and is potentially advantageous in terms of handling smoothing, modality and uncertainty (Escobar, 1994;Mueller and Quintana, 2004). Mixture of normals have been extensively used in a variety of important and practical situations, although environmental applications have been very limited (Burmaster and Wilson, 2000;Razzaghi and Kodell, 2000;Taylor, et al., 2001;Chuet al., 2005). This paper evaluates the applicability of mixture of normal distribution method to environmental data, specifically, air pollution concentration and exposure data. Both the traditional finite mixture of normal and the nonparametric DPM of normals are evaluated using a VOC exposure dataset that includes seasonal measurements for approximately 300 individuals, which was collected as part of the Relationship between Indoor, Outdoor and Personal Air (RIOPA) study. Goodness-of-fit for the density estimation methods are evaluated by a comprehensive simulation study.

VOC measurements
The RIOPA study was designed to evaluate contributions of outdoor and indoor sources to personal exposures of air pollutants, including VOCs and PM 2.5 , among residents of three cities (Elizabeth, NJ, Houston, TX and Los Angeles, CA) selected to reflect potential differences in emissions and other factors likely to influence exposures (Weiselet al., 2005a). Sampling was conducted in two seasons for approximately 100 adults (and a smaller number of children) in each city from summer 1999 through spring 2001. Indoor, outdoor and personal (worn by participants) measurements were obtained using passive samplers for 48 hour periods, and 18 VOCs were measured using gas chromatography and mass spectrometry. Analytical work was performed by two laboratories. The RIOPA study represents one of the larger VOC studies in the USA that collected personal samples, which are generally considered to provide exposure estimates that are more accurate than indoor or outdoor samples.
Three VOCs (chloroform, 1,4-DCB and styrene) were selected to evaluate mixture distributions. These VOCs differ in terms of their distributions, detection frequencies and other properties. Personal samples for adults were selected, primarily because the sample size for the adult cohort (n = 544 for each VOC) was largest, and because the personal samples should best reflect exposure. The two laboratories used to analyze samples had different MDLs. Since the use of two laboratories is somewhat unusual, all data under MDLs were replaced with a single value using 0.5 × the higher MDL. Because the VOC data in RIOPA had many extreme values (Su, et al., 2012), the density estimation methods were implemented using logarithms of the concentration value, as described next.

Finite mixture of normal distributions
Finite mixture distributions are commonly used to identify and model sub-populations within an overall population. Rather than identifying the sub-population that an individual observation belongs to, these models assume that the observed data randomly arise from distributions with certain probabilities. Let Y = (Y 1 , …, Y n ) be a random sample of size n from the overall population with the probability density function of Y i given as f(y i ). Y is assumed to have arisen from a mixture of an initially specified number of distributions. A Kcomponent mixture of distributions supposes that the density of Y i can be written as (1) where f k is the component density of the k-th cluster, and λ k is the corresponding weight with the constraint that 0 ≤ λ k ≤ 1 and . In many applications, component densities f k are assumed to be standard parametric families, such as normal distribution , then (2) The finite mixture of normals represented by (2) is a potential choice for handling concentration and exposure data that can have multiple modes and heavy tails. Such normal mixtures are popular choices with attractive properties (Titterington, et al., 1985). Since the mixtures are constructed as a linear combination of normal distributions, they are computationally and analytically tractable, well behaved in the limiting case, and scalable to higher dimensions.
Mixture distributions can be fitted using many techniques, e.g., graphical methods, the method of moments, maximum likelihood estimation (MLE) and Bayesian approaches (Redner and Walker, 1984;Titterington, et al., 1985;McLachlan and Peel, 2000). Since closed forms of MLEs of (1) are not available, mixture distributions are commonly fitted using expectation maximization (EM) type algorithms (Dempsteret al., 1977;Meng and Pedlow, 1992;McLachlan and Krishnan, 1997). We used the EM algorithm and considered a constrained maximum likelihood method to estimate (2) with a further constraint that the location of the first cluster (generally the lowest) is under the MDL, i.e., μ 1 ≤ MDL. This constraint ensures that a fitted cluster covers the MDL, which allows it to be interpreted as the subpopulation of the data below the MDL.
An important issue in fitting finite mixture distributions is selection of the number of components K. Criteria based on penalized likelihood, such as the Akaike information criterion (AIC), have been applied successfully to mixture distributions (McLachlan and Peel, 2000). While this criterion generally favors larger K, there is considerable practical support for its use due to simplicity (Fraley and Raftery, 1998). The Bayesian information criterion (BIC) appears attractive due to their statistical properties as well as the simplicity of implementation. Though the BIC always leads to a smaller (or equal) number of components than AIC, the BIC can also lead to an overestimate of the number of clusters regardless the clusters' separation (Biernackiet al., 2000). In general, with limited amount of data, a corrected version of AIC such as AICc (Hurvich and Tsai, 1989) may be preferable. For these finite mixture distributions, we fitted model (2) with K=2 to 5 clusters, and selected the optimal model based on AICc. This analysis was conducted for each of the three VOCs.
As a benchmark for comparison, we also fitted the traditional normal distribution, which is a special case of mixture of normals with K=1. (As noted earlier, log-transformed VOC data were used in all cases.) The finite mixture of normals were implemented using the 'mixtools' package (Benagliaet al., 2009) in R (R Foundation for Statistical Computing, Vienna, Austria). This package fits the finite mixture of normals using EM algorithms through the function normalmixEM.

Dirichlet process mixture of normal distributions
Bayesian density estimation methods using Dirichlet process mixture (DPM) of normal densities have several practical advantages, including optimally trading off local versus global smoothing, assessing modality, and propagating uncertainty on inferences regarding the number of components and thus uncertainty about the density estimate (Ferguson, 1983;Escobar, 1994;Mueller and Quintana, 2004). Instead of pre-specifying the number of clusters, these models allow the number of clusters to be chosen in a data-adaptive way. Let and let . The DPM of normal distributions assumes that these normal parameters θ i follow a random distribution G generated from Dirichlet process (Ferguson, 1973), which can be represented as: (3) DP (α G 0 ) is a Dirichlet process with concentration parameter α and base distribution G 0 , which is also known as the prior expectation of G. The precision parameter α determines the concentration of the prior for G around G 0 . Blackwell and Macqueen provided the following representation for the leave-one-out conditional distributions (Blackwell and MacQueen, 1973): (4) In this approach, θ = (θ 1 ,…,θ n ) will be reduced to certain K distinct values (K < n) with positive probability. From (4), two well-known extreme cases of the DPM can be derived. As α → ∞, the DPM reduces to a parametric model, namely θ i ~ G 0 independent and identically distributed (n clusters), whereas α → 0 implies a common parametric model, namely θ 1 = ⋯ = θ n =θ* with θ* ~ G 0 (1 cluster). The baseline distribution G 0 is chosen to be the conjugate normal-inverse-gamma distribution. Hyperpriors could be used on this normal-inverse-gamma distribution to complete the model specification.
The DPM of normals does not require specification of the number of clusters as needed for parametric mixture distributions, such as the finite mixture of normals discussed previously. In practice, suitable values of K will typically be small relative to the sample size n. The implicit prior distribution on K is stochastically increasing with and is related to the prior distribution on α (Antoniak, 1974). For moderately large n, E(K|α, n) ≈ α log (1 + n/α) (Antoniak, 1974). A formal assessment of uncertainty regarding the number of components K can be obtained through generated draws from the posterior distribution of K as a part of the Bayesian computation scheme.
For the VOC data, the precision parameter α was chosen to follow a Gamma prior distribution, and a sensitivity analysis was conducted with respect to choice of the Gamma parameters. Given the sample size in the test dataset (n=544), for prior information, α G amma(0.3, 0.4) favors K=1-3 clusters; α ~ Gamma(1.2, 2.5) favors 1-5 clusters; α G amma(2, 1.5) favors 2-10 clusters; and α ~ Gamma(5, 2) favors 5-20 clusters. A sensitivity analysis was conducted on these prior specifications.
Computational methods were followed that allowed the evaluation of posterior distributions for all model parameters and the number of components, and also the resulting predictive distributions (Escobar and West, 1995). Density estimation using DPM can be directly implemented using the DP package (Jara, 2007;Jaraet al., 2011) in R (R Foundation for Statistical Computing, Vienna, Austria), which provides posterior draws of all model parameters under a DPM using Markov chain Monte Carlo methods. A sample code is presented in the Appendix.

Goodness of fit Criteria
Goodness of fit for the density estimation methods was determined by comparing the estimated cumulative distribution function (CDF) to the empirical CDF based on the observed data. Although all observed/generated data were used to estimate the CDF by each method, goodness of fit was evaluated using only the data above the MDL. Both the mean squared error ( ), and the mean absolute error were considered. The estimated proportion of observations above the MDL, which is often termed the detection frequency, for empirical and estimated distributions was compared.

Simulation study
For further evaluation of the mixture distributions, several forms of underlying true distributions and varying amounts of left censored data (below MDL) were considered as true generation models. Three methods were compared: a single normal distribution; a finite mixture of normals; and DPM of normals. Two underlying distributions with features similar to the three VOC samples from the RIOPA study were selected: a normal(0,2 2 ) and a mixture specified as . The former is symmetric and the latter is right-skewed with heavy tails, and both have multiple modes when data under MDL were replaced by 0.5 MDL. The proportion of data below the MDL, P 0 , was set to 15%, 30% and 50% in separate simulations. Goodness of fit measures (MSE and MAE described above) were calculated for each method, target distribution, and choice of P 0 . A dataset of size n=1000 was generated for each simulation under each setting. The average values of MSE and MAE across 500 simulations are reported.
For the finite mixture of normals, the number of components K was based on the smallest AICc. A convergence problem was encountered when P 0 was high (in the range of 30 to 50%), possibly because the censored data were set to a single value (0.5 MDL), which resulted in a very small variance of the first (lowest) cluster. Additionally, the MLE method for finite mixture models is susceptible to other problems, e.g., nonunique solutions (Redner and Walker, 1984;Titterington, et al., 1985;McLachlan and Peel, 2000). Thus, data below the MDL was replaced by uniformly generated pseudo-data from U(0, MDL) if the finite mixture of normals did not converge. In contrast, all of the single normal and DPM method simulations converged.

Descriptive analysis
The distributions of the sample data sets are shown as histograms in panel A of Figures 1, 2 and 3 for chloroform, 1,4-DCB and styrene, respectively. (Fitted density plots for the three density estimation methods are also shown in these figures.) These VOCs show distinct features. For chloroform, 17% of observations fell below the MDL, and observations exceeding the MDL were approximately lognormally distributed. For 1,4-DCB, a larger amount of data, 34%, was below the MDL and the remainder was highly right skewed even after log transformation and contained a number of extreme values. For styrene, most of the data, 66%, fell below the MDL and, like 1,4-DCB, was highly right skewed after log transformation with extreme values. These three VOCs were selected to demonstrate the sensitivity and performance of the mixture models across a broad range of levels of censoring.
The selected VOCs differ with respect to sources, exposures, and health effects. Chloroform is considered a probable human carcinogen (causing renal tumors) based on an adequate data of animal studies (USEPA, 2012a). Most inhalation, ingestion and dermal exposures result indoors through contact with chlorinated water, e.g., showering, drinking, and swimming, from which chloroform is released as a by-product of chlorine disinfection (ATSDR, 1997). The median chloroform indoor (0.94 μg m −3 ) and personal (1.04 μg m −3 ) concentrations in RIOPA considerably exceeded outdoors levels, which were mostly at the MDL (replaced with 0.21 μg m −3 ), thus outdoor levels of chloroform provided only a negligible contribution to personal exposure. (MDLs in the two laboratories used in RIOPA were 0.28 and 0.42 μg m −3 .) The unit risk factor (URF) for chloroform in air, 2.3 × 10 −5 (μg m −3 ) −1 , corresponds to a one-in-a-million cancer risk level for long term exposure (USEPA, 2012a). Of the personal samples collected in RIOPA, 9.6% exceeded a cancer risk level of 10 −4 , and the maximum chloroform level (46.5 μg m −3 ) corresponds to of 1.1 × 10 −3 . Notably, RIOPA participants did not wear their samplers during showering and swimming activities, and thus the highest RIOPA measurements are likely biased downwards from true exposures. The RIOPA measurements may be compared to personal exposure measurements in the National Health and Nutrition Examination Survey (NHANES), a nationally representative sample conducted about the same time period. NHANES showed a similar median chloroform exposure (1.1 μg m −3 ) as RIOPA (Jia, et al., 2008b;CDC, 2012). However, NHANES had slightly higher concentrations at the upper percentiles than RIOPA, e.g., 16% of individuals had estimated cancer risk exceeding 10 −4 , and the maximum exposure was 53.9 μg m −3 . VOC exposures in RIOPA and NHANES can differ for multiple reasons, including the use of different sampling strategies (convenience sampling in three cities for RIOPA versus national probability-based sampling for NHANES), demographics (RIOPA participants were older and more likely to be females), occupations (RIOPA participants had fewer VOC-related occupations), and employment (fewer employed in RIOPA).
Like 1,4-DCB, styrene is considered as a possible carcinogen (lymphatic, hematopoietic and pulmonary tumors) based on limited human and animal evidence (IARC, 2012). Styrene is widely used in industry as a raw material for plastic and rubber products, and is a component of cigarette smoke, vehicle exhaust and other combustion processes, and thus exposure occurs via inhalation from styrene-contained products in many settings, near certain industrial facilities, in traffic and near smokers (ATSDR, 2010). Because most data fell below the MDL (0.34 and 0.84 μg m −3 ), personal, indoor and outdoor measurements in RIOPA had the same median (0.42 μg m −3 ); 75th percentile values were 1.10, 1.07 and 0.42 μg m −3 , respectively. The URF for styrene in air is 2.0 × 10 −6 (μg m −3 ) −1 (Caldwellet al., 1998), and 0.2% of personal samples in RIOPA exceeded the risk level of 10 −4 . The maximum styrene concentration (59.5 μg m −3 ) gives a risk of 1.2 × 10 −4 . RIOPA excluded smokers and smoking households, and participants were predominantly women (75%) (Weiselet al., 2005b), thus, styrene levels may be lower than national norms. Personal styrene exposure was not collected in NHANES. Table 1 contrasts goodness of fit statistics for the three density estimation methods (normal, finite mixture of normals, DPM of normals) and the three VOCs from RIOPA (chloroform, 1,4-DCB and styrene). The table helps to identify situations where a particular method may be advantageous, as discussed below.

Single normal and GEV distributions
For chloroform, which is roughly lognormally distributed except that 17% of the data is under the MDL, the single normal distribution model fits about as well as the finite mixture of normals and DPM of normals (described below) on the basis of MSE and MAE values, and gives a 21% probability of being below the MDL, similar to that observed (Table 1). However, for 1,4-DCB and styrene, which have more data under the MDL as well as heavy tails, the fit of the single normal distribution model is inferior compared to those of the mixture models. For example, the predicted probability of being below MDL is 28% and 56% for 1,4-DCB and styrene, respectively, compared to 34% and 66% observed, and 33% and 64% estimated by the mixture models. The single normal distribution overestimated the mean of these VOCs since it underestimated the non-detection frequency.
Using the top 5 and 10% of concentrations as extrema, the RIOPA VOC data previously has been fitted to GEV and Gumbel distributions (Su, et al., 2012). However, distributional characteristics such as multimodality and left censoring are not captured by such analyses. For example, the finite mixture of normals discussed next show that at least two (chloroform) to four (1,4-DCB) components are needed to reflect the multiple modes, skewness and extreme values in the observed distributions.

Mixture of normals
Fitted density plots (and component clusters) are shown in Figures 1B, 2B and 3B for chloroform, 1,4-DCB and styrene, respectively. The fitted parameters (weight λ k , location μ k and dispersion σ k 2 ) of each cluster k for the mixture of normals are given in Table 2. The optimal Ks (based on the AICc) were 2, 4 and 3 for chloroform, 1,4-DCB and styrene, respectively. These choices of K clearly reflected the multi-modality and right-skewness of the VOC data, and the resulting mixture of normals closely fitted the observed distributions. For example, Figure 2B represents the four clusters that fitted the 1,4-DCB data: the first (red) cluster captured the left censoring due to the MDL, the second and third (green and blue) clusters reflected the majority of the data and the skewness, and the fourth (blue) cluster modeled the heavy tail.

DPM of normals
Fitted densities using DPM of normals for the three VOCs are shown in Figures 1C, 2C and 3C. This method clearly captures the censoring, right-skewness, and potential multimodality of the exposure data. In terms of MSE and MAE, the DPM approach attained slightly lower values than the finite mixture of normals (Table 1). Figures 1 to 3 show results of the sensitivity analysis with the four different gamma distributions used as priors for precision parameter α. As noted before, K stochastically increases with α as E(K|α, n) ≈ α log(1 + n/α) for moderately large n (Antoniak, 1974). The four prior distributions were informative and formed up to 20 clusters that reflected more specific subject matter information. Estimated densities obtained using the four priors nearly overlapped and showed very similar MSE and MAE for each of the VOCs, although the corresponding posterior distribution of the number of clusters K varied ( Table 3). The posterior mean of K under all prior settings of α (Table 3) slightly exceeded the K selected using the AICc ( Table 2). The higher K in the DPM is due to the prior information of α, and does not introduce any additional complexity or more model parameters. The initial prior variance of critically influences the extent of smoothing (Escobar and West, 1995). Given K distinct values among the elements of θ, a larger variance leads to increased dispersion among the K group means, which increases the likelihood of multiple modes and decreased smoothness in the resulting predictive distribution (Escobar and West, 1995). The general goals in selecting α and K to partition the data is to avoid over-smoothing and also excessive jaggedness. The prior distributions of α regarding the number of clusters K reflect a subjective assessment that balances these competing goals. Prior distributions might also reflect normative and objective representations of parameter values, although there is no consensus on the "best" way to elicit a subjective prior (Dey and Liu, 2007).

Panel D on
No convergence issues using the DPM method were encountered, and density estimation results were robust given the moderate sample size (n=544). Another advantage of the DPM method is that a constraint to ensure a cluster below MDL is not required since the sampling scheme (4) is data driven. As shown in (4), the DPM can handle values under the MDL that are represented as a point mass, because a newly sampled value has equal probability 1/(n -1 + α) to be drawn from the observed set of values.

Simulations
Simulation results, summarized in Table 4, show similar patterns for the MSE and MAE criteria. Both finite mixture and DPM of normals provided much better fits than a single normal distribution, except that the former two methods are only slightly better under distribution 1 with P 0 = 0.015. For both distributions, as the fraction P 0 of data below the MDL increased, there is evidence of increasing trend of lack of fit for a single normal distribution, while the finite mixture and DPM of normals fitted considerable better and without such trend. The DPM of normals shows advantage of robustness regarding P 0 . It fits equally well, or even better, as P 0 increased. For distribution 1, the finite mixture of normals provided a slightly better fit than the DPM of normals, but this trend can be offset since the prior variance of α can be decreased to promote smoothness. In this regard, DPM is much more flexible than the finite mixture of normal. Here, we have used α ~ Gamma(1.2, 2.5) which favors 1-5 clusters given our sample size (as the prior information of K). For distribution 2 which is right skewed and with a heavy tail, the DPM of normals provided a much better fit than finite mixture of normals under all settings.

DISCUSSION
Finite mixture of distributions have been used to address problems of classification, density estimation and pattern recognition problems across a wide range of areas (Titterington, et al., 1985;McLachlan and Basford, 1988;McLachlan and Peel, 2000). However, applications in the environmental literature have been limited. For human exposure and risk assessment, mixtures of lognormal distributions have been used to model concentrations of radon 222 in drinking water, and the model yielded a high-fidelity fit that was not achievable with any single parametric distribution (Burmaster and Wilson, 2000). In another risk assessment application, a mixture dose-response model was used to derive the upper confidence limits on risk and benchmark doses (Razzaghi and Kodell, 2000). A mixture of distributions of true zero exposures and log-normal distributions with left censoring was used to account for non-detects (Taylor et al. 2001). In a medical intervention that examined biomarkers of aflatoxin, finite mixture of distributions were selected using bootstrap-and cross-validation-based information criterion to portray bimodal biomarker distributions that had a significant fraction of measurements below MDL, and that also varied due to differences in exposure, metabolism, intervention effect and other factors (Chu, et al., 2005). To quantify variability and uncertainty in emission inventories, the use of a mixture of lognormal distributions showed a better fit and more efficient estimates of uncertainty to NO x emission data than the use of a single distribution (Zheng and Frey, 2004). This paper is one of the first demonstrations of mixture distribution models for environmental exposure data. No application of the DPM method specifically for pollutant concentration and exposure data was identified. Our analysis compared both finite mixture of normals and DPM of normals methods to parametric full distribution models and extreme value models, and included a sensitivity analysis of smoothing parameters. None of this information is available in the literature.
Our intent was to develop and characterize the performance of mixture models for fitting environmental exposures. The methods can be applied to other types of data, e.g., airborne concentrations, biomarkers, and pollutants other than VOCs. In exposure and risk applications, such models can improve the accuracy and realism in estimating cumulative exposure, cumulative risk, population attributable risk (PAR), and uncertainty (of exposure, risk or PAR) based on Monte Carlo simulations or other methods. For example, it would be straightforward and potentially informative to evaluate the differences in risk or numbers of individuals affected at a specific risk level using the cumulative distributions presented in the paper.
Finite mixture of distribution models continued to receive attention from both theoretical and practical points of view. In environmental applications, such as the description and analysis of air pollutant concentration and exposure data, the key advantage of this class of parametric mixture of distributions over a single parametric distribution model is their ability to simultaneously portray the multimodal nature of exposure data, the heavy tails, and observations around MDLs. These models also have advantages over extreme value distributions (e.g., Gumbel or GEV) since the entire distribution is fitted (not just the tail), and it is not necessary to select a cut-off to define extrema. Information on the full distributions of exposure levels can be used to estimate health risks and uncertainty estimates across a population (Su, et al., 2012) and facilitate probabilistic analyses. Even if the goal is only to predict extreme values, DPM methods can provide closer fits to any empirical distribution than GEV models, although GEV models often, but not always, perform well in such applications, as demonstrated by Su et al. (2012) for NHANES data. Further, the use of a cluster to represent observations below the MDL is appealing for datasets where a fair fraction of data is believed to be at or near this level. Finally, the parameter estimates corresponding to this lowest cluster may be heuristically interpretable as the parameters corresponding to MDL and its uncertainty. Importantly, we have found that a constrained MLE was required in the presence of data censoring due to the MDL, otherwise the estimated clusters may not contain the mode below the MDL.
Further work may be needed to develop consistent rules and guidance for finite mixture of distributions that address the number of components K, the selection of performance criteria, the effect of the estimation algorithm, use of constraints in the presence of data censoring, and the minimum sample size needed for analysis (Mendellet al., 1991). Such decisions can depend on the nature of the original data and the purpose of the analysis, e.g., focus on extrema or the entire distribution. We assessed performance in terms of the accuracy of point predictions above the MDL, but also compared the proportion of data points predicted to be below the MDL, thus reflecting datasets with censored data. Other techniques to select appropriate models include resampling information criteria (McLachlan, 1987;Chu, et al., 2005), likelihood methods (McLachlan, 1987;Chenet al., 2001), and Bayesian approaches that treat K as a parameter (Richardson and Green, 1997). Chu (2005) also suggests that the D-test via the L2 distance between competing models (Charnigo and Sun, 2004), which has a closed-form expression, is advantageous for selecting K as compared to likelihood statistics, which are nontrivial functions of both the parameter estimators and the full dataset.
We note that estimating the "true" numbers of clusters is not a necessary goal for practical applications of mixture of distributions. However, marginalizing over the distribution of K is critical for capturing the uncertainty in the density estimates. Rather than using some prespecified number of clusters, which is always a concern of the traditional mixture distribution problem, the DPM of normals provides a semi-parametric Bayesian alternative to non-parametric histograms and provides greater flexibility and precision in modeling the underlying characteristics of the sample data. This was clearly demonstrated in simulation results, where the performance of the DPM models was not substantially altered (and sometimes even improved) as the fraction of data below the MDL increased.
The nonparametric DPM of normal distributions assume that observed data randomly arise from sub-distributions with certain probabilities as the finite mixture of distribution models. (Again, sub-populations that an individual observation belongs are not identified.) Compared to the finite mixture models, DPM distributions have advantages in providing a formal assessment of uncertainty for all model parameters, including the number of components K, through generated draws from the posterior distribution. With a suitable Dirichlet process prior structure (Escobar and West, 1995), these models produce predictive distributions qualitatively similar to kernel techniques, and they allow for differing degrees of smoothing by the choice on priors for precision parameter α. The density estimation results were robust given a moderate sample size (n=544) without any convergence issues noted.
Both types of mixture models explored in this study are well suited to VOC data containing a large fraction of censored data due to MDLs, fat tails, and multiple modes. They offer clear advantages over parametric full distribution models and extreme value models, and also appear appropriate for many other types of environmental data, such as concentrations or doses of persistent and/or emerging compounds and biomarkers. The use of mixture models has the potential to improve the accuracy and realism of models used in a variety of exposure and risk applications, and further environmental applications are warranted.

CONCLUSIONS
Compared to the finite mixture of normals, DPM of normals has advantages by characterizing uncertainty around the number of components, and by providing a formal assessment of uncertainty for all model parameters through the posterior distribution. The method adapts to a spectrum of departures from standard model assumptions and provides robust estimates of the exposure density even under censoring due to MDL.

Highlights
Distributions of Volatile organic compounds concentrations can have multiple modes, heavy tails, and significant portions of data below the method detection limit.
The finite mixture of normals and Dirichlet process mixture of normals have superior performance when compared to the traditional single normal distribution.
Methods are implemented through the Relationship between Indoor, Outdoor and Personal Air study and through a simulation study.
Dirichlet process mixture of normals has advantages of robustness and of characterizing uncertainty for model parameters. Fitted density plots for chloroform (log scale) for three models: normal, finite mixture of normals (with the smallest AIC and components indicated), and Dirichlet process mixture (DPM) of normals. Vertical line shows MDL. Sensitivity of priors for DPM used, settings 1: α ~ Gamma(0.3, 0.4); settings 2: α ~ Gamma(1.25, 2.5); settings 3: α ~ Gamma(2, 1.5); settings 4: α~ Gamma(5, 2)..  Fitted density plots for styrene (log scale). Otherwise as Figure 1. Table 1 Goodness of fit statistics of each density estimation method for chloroform, 1,4-DCB and styrene sample data from the RIOPA study.  Atmos Environ (1994). Author manuscript; available in PMC 2014 October 01. Posterior distribution of the number of clusters K based on various prior settings of α as a sensitivity analysis.  Table 4 Summary of goodness of fit statistics of each density estimation method in the simulation study Distribution 1:Normal(0,2 2 ); Distribution 2: . Prior distribution on α is Gamma(1.2, 2.5).