Abstract
In risk evaluation, the effect of mixtures of environmental chemicals on a common adverse outcome is of interest. However, due to the high dimensionality and inherent correlations among chemicals that occur together, the traditional methods (e.g. ordinary or logistic regression) suffer from collinearity and variance inflation, and shrinkage methods have limitations in selecting among correlated components. We propose a weighted quantile sum (WQS) approach to estimating a body burden index, which identifies “bad actors” in a set of highly correlated environmental chemicals. We evaluate and characterize the accuracy of WQS regression in variable selection through extensive simulation studies through sensitivity and specificity (i.e., ability of the WQS method to select the bad actors correctly and not incorrect ones). We demonstrate the improvement in accuracy this method provides over traditional ordinary regression and shrinkage methods (lasso, adaptive lasso, and elastic net). Results from simulations demonstrate that WQS regression is accurate under some environmentally relevant conditions, but its accuracy decreases for a fixed correlation pattern as the association with a response variable diminishes. Nonzero weights (i.e., weights exceeding a selection threshold parameter) may be used to identify bad actors; however, components within a cluster of highly correlated active components tend to have lower weights, with the sum of their weights representative of the set.
Supplementary materials accompanying this paper appear on-line.
Similar content being viewed by others
References
Billionnet C, Sherrill D, Annesi-Maesano I; GERIE Study (2012). Estimating the health effects of exposure to multi-pollutant mixture. Annals of Epidemiology 22(2): 126-141.
Breiman L (1996). Stacked regressions. Machine Learning 24:49-64.
Brunekreef B. Exposure science, the exposome, and public health. Environmental and molecular mutagenesis. Feb 26 2013.
Buck Louis GM, Yeung E, Sundaram R, Laughon SK, Zhang C. The exposome–exciting opportunities for discoveries in reproductive and perinatal epidemiology. Paediatr Perinat Epidemiol. May 2013;27(3):229-236.
Center for Disease Control. National Health and Nutrition Examination Study. http://www.cdc.gov/nchs/nhanes.htm.
Christensen, KLY, Carrico, CK, Sanyal, AJ, Gennings, C (2013). Multiple classes of environmental chemicals are associated with liver disease: NHANES 2003-04. International Journal of Hygiene and Environmental Health. [epub March 8, 2013].
Colt J, Severson R, Lubin J, Rothman N, Camann D, Davis S, Cerhan JR, Cozen W, Hartge P. (2005). Organochlorines in carpet dust and non-Hodgkin lymphoma. Epidemiology 16(4): 516-525.
Dominici F, Peng RD, Barr CD, Bell ML. (2010) Protecting human health from air pollution: shifting from a single-pollutant to a multipollutant approach. Epidemiology. 21(2):187-194.
Ferguson KK, Loch-Caruso R, Meeker JD (2011) Exploration of oxidative stress and inflammatory markers in relation to urinary phthalate metabolites: NHANES 1999-2006. Environ Sci Technol. 2012 Jan 3;46(1):477-85. Epub 2011 Dec 1.
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J of Statistical Software, 33(1),1–22. [http://www.jstatsoft.org/v33/i01/]
Gennings C, Sabo RT, Carney E. (2010). Identifying subsets of complex mixtures most associated with complex diseases polychlorinated biphenyls and endometriosis as a case study. Epidemiology, 21, S77-S84.
Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction, 2nd edn, Springer Series in Statistics.
Harville, DA (1997). Matrix algebra from a statistician’s perspective. Dordrecht: Dordrecht Springer-Verlag New York Inc.
Hoerl AE and Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1): 55-67.
Kim S, Kang S, Lee G, Lee S, Jo A, Kwak K, Kim D, Koh D, Kho YL, Kim S, Choi K (2014). Urinary phthalate metabolites among elementary school children of Korea: sources, risks, and their association with oxidative stress marker. Sci Total Environment, 472:49-55.
Leblanc M and Tibshirani R (1993). Combining estimates in regression and classification. J American Statistical Association, 91:1641-1650.
Mustapha BA, Blangiardo M, Briggs DJ, Hansell AL (2011). Traffic ari pollution and other risk factors for respiratory illness in schoolchildren in the Niger-Delta region of Nigeria. Environ Health Perspect. 119:1478-1482.
Meinshausen N and Buhlmann P (2010) Stability selection. Journal of the Royal Statistical Society, 72, 417-473.
Nocedal J and Wright S (2006). Numerical optimization. New York: New York: Springer
R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. [http://www.R-project.org]
Rappaport SM, Smith MT (2010) Epidemiology. Environment and disease risks. Science. 330(6003):460-461.
Roberts S and Martin MA (2006) Investigating the mixture of air pollutants associated with adverse health outcomes. Atmospheric Environment 40(5):984-991.
SAS Institute Inc (2008). SAS 9.2 Help and Documentation. Cary, NC: SAS Institute Inc.
Schecter A, Lorber M, Guo Y, Wu Q, Yun SH, Kannan K, Hommel M, ImranN, Hynan LS, Cheng D, Colacino JA, Birnbaum LS (2013) Phthalateconcentrations and dietary exposure from food purchased in New YorkState. Environ Health Perspect. 121(4):473-494.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Vol. 58, No. 1, pages 267-288.
Tu YK, Gunnell D, Gilthorpe MS (2008). Simpson’s Paradox, Lord’s Paradox, and Suppression Effects are the same phenomenon – the Reversal Paradox. Emerging Themes in Epidemiology, 5:2 [http://www.ete-online.com/content/5/1/2].
Vedal S, Kaufman JD (2011). What does multi-pollutant air pollution research mean? Am J Respir Crit Care Med. 183(1):4-6.
Wild CP (2005). Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 14(8):1847-1850.
——— (2012). The exposome: from concept to utility. International journal of epidemiology. 41(1):24-32.
Wormuth M, Scheringer M, Vollenweider M, Hungerbuhler K (2006) What are the sources of exposure to eight frequently used phthalic acid esters in Europeans? Risk Analysis, 26(3):803-824.
Zou H (2006). The adaptive lasso and its oracle properties. J American Statistical Association. 101:1418-1429.
Zou H, Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society.Series B (Statistical Methodology), 67(2), 301-320.
Acknowledgments
The authors gratefully acknowledge support from #T32 ES0007334 and #UL1TR000058.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 Simulating Correlated Data
Our objective is to simulate normally distributed data \(N(M,\sum )\) with a given correlation structure for an outcome y and predictors \(x_{1}, x_{2,\ldots }x_{c}\). Let \(\varvec{\rho }\) be the correlation matrix between and among y and the components in X and \(\Sigma \) be the corresponding covariance matrix with diagonal values in vector S and sample means in vector m. To impose the correlation structure, we first use the relationship between the correlation and the variance that yields:
Then follow the simulation steps below where \(\hbox {p}=\hbox {c}+1\):
-
1)
Calculate the Cholesky decomposition of \(\sum \,(\hbox {p} \times \hbox {p}\) dimension), such that \(\sum =\mathbf{U}_{\mathrm{pxp}}^{\prime } \mathbf{U}_{\mathrm{pxp}}\). (see Harville (1997))
-
2)
Simulate \(\mathbf{Z}_{\mathbf{i}}\sim \hbox {N}(\mathbf{0}_{\mathbf{px1}}, \mathbf{I}_{\mathbf{p}})\). \(\mathbf{Z}^{\prime }=[\mathbf{Z}_{\mathbf{1}} \mathbf{Z}_{\mathbf{2}} \ldots . \mathbf{Z}_{\mathbf{n}}]\), i.e., Z is nxp where each row is a \(p\)-variate standard normal distribution.
-
3)
Let \(\mathbf{M}= (\mathbf{m}^{*}\mathbf{1}_{\mathrm{1xn}})^{\prime }\) and \(\mathbf{Y}_{\mathrm{nxp}}=\mathbf{M}_{\mathrm{nxp}}+\mathbf{Z}_{\mathrm{nxp}}^{*}\mathbf{U}_{\mathrm{pxp}}\)
-
a)
\(\hbox {E}(\mathbf{Y})=\hbox {E}(\mathbf{M}+\mathbf{Z}^{*}\mathbf{U})= \mathbf{M}+ \hbox {E}(\mathbf{Z})=\mathbf{M}\)
-
b)
\(\hbox {Var}(\mathbf{Y})=\hbox {Var}(\mathbf{M}+\mathbf{Z}^{*}\mathbf{U})= \hbox {Var}(\mathbf{M})+\hbox {Var}(\mathbf{Z}^{*}\mathbf{U})=\mathbf{0}+\mathbf{U}^{\prime }\hbox {Var}(\mathbf{Z})\mathbf{U}=\mathbf{U}^{\prime }\mathbf{U}=\sum \)
-
a)
-
4)
So, Y is nxp and has the distribution \(N_{p}(\mathbf{M}, \sum )\)
In the first step, in order to calculate \(\mathbf{U}, \sum \) must be positive definite. To evaluate relevant cases with highly correlated data, \(\sum \) may be nearly singular. In this case, we use matrix ridging to stabilize the matrix.
Rights and permissions
About this article
Cite this article
Carrico, C., Gennings, C., Wheeler, D.C. et al. Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting. JABES 20, 100–120 (2015). https://doi.org/10.1007/s13253-014-0180-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13253-014-0180-3