Skip to main content

Advertisement

Log in

Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting

  • Published:
Journal of Agricultural, Biological, and Environmental Statistics Aims and scope Submit manuscript

Abstract

In risk evaluation, the effect of mixtures of environmental chemicals on a common adverse outcome is of interest. However, due to the high dimensionality and inherent correlations among chemicals that occur together, the traditional methods (e.g. ordinary or logistic regression) suffer from collinearity and variance inflation, and shrinkage methods have limitations in selecting among correlated components. We propose a weighted quantile sum (WQS) approach to estimating a body burden index, which identifies “bad actors” in a set of highly correlated environmental chemicals. We evaluate and characterize the accuracy of WQS regression in variable selection through extensive simulation studies through sensitivity and specificity (i.e., ability of the WQS method to select the bad actors correctly and not incorrect ones). We demonstrate the improvement in accuracy this method provides over traditional ordinary regression and shrinkage methods (lasso, adaptive lasso, and elastic net). Results from simulations demonstrate that WQS regression is accurate under some environmentally relevant conditions, but its accuracy decreases for a fixed correlation pattern as the association with a response variable diminishes. Nonzero weights (i.e., weights exceeding a selection threshold parameter) may be used to identify bad actors; however, components within a cluster of highly correlated active components tend to have lower weights, with the sum of their weights representative of the set.

Supplementary materials accompanying this paper appear on-line.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Billionnet C, Sherrill D, Annesi-Maesano I; GERIE Study (2012). Estimating the health effects of exposure to multi-pollutant mixture. Annals of Epidemiology 22(2): 126-141.

    Article  Google Scholar 

  • Breiman L (1996). Stacked regressions. Machine Learning 24:49-64.

    MATH  MathSciNet  Google Scholar 

  • Brunekreef B. Exposure science, the exposome, and public health. Environmental and molecular mutagenesis. Feb 26 2013.

  • Buck Louis GM, Yeung E, Sundaram R, Laughon SK, Zhang C. The exposome–exciting opportunities for discoveries in reproductive and perinatal epidemiology. Paediatr Perinat Epidemiol. May 2013;27(3):229-236.

    Article  Google Scholar 

  • Center for Disease Control. National Health and Nutrition Examination Study. http://www.cdc.gov/nchs/nhanes.htm.

  • Christensen, KLY, Carrico, CK, Sanyal, AJ, Gennings, C (2013). Multiple classes of environmental chemicals are associated with liver disease: NHANES 2003-04. International Journal of Hygiene and Environmental Health. [epub March 8, 2013].

  • Colt J, Severson R, Lubin J, Rothman N, Camann D, Davis S, Cerhan JR, Cozen W, Hartge P. (2005). Organochlorines in carpet dust and non-Hodgkin lymphoma. Epidemiology 16(4): 516-525.

    Article  Google Scholar 

  • Dominici F, Peng RD, Barr CD, Bell ML. (2010) Protecting human health from air pollution: shifting from a single-pollutant to a multipollutant approach. Epidemiology. 21(2):187-194.

    Article  Google Scholar 

  • Ferguson KK, Loch-Caruso R, Meeker JD (2011) Exploration of oxidative stress and inflammatory markers in relation to urinary phthalate metabolites: NHANES 1999-2006. Environ Sci Technol. 2012 Jan 3;46(1):477-85. Epub 2011 Dec 1.

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J of Statistical Software, 33(1),1–22. [http://www.jstatsoft.org/v33/i01/]

  • Gennings C, Sabo RT, Carney E. (2010). Identifying subsets of complex mixtures most associated with complex diseases polychlorinated biphenyls and endometriosis as a case study. Epidemiology, 21, S77-S84.

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction, 2nd edn, Springer Series in Statistics.

  • Harville, DA (1997). Matrix algebra from a statistician’s perspective. Dordrecht: Dordrecht Springer-Verlag New York Inc.

    Book  MATH  Google Scholar 

  • Hoerl AE and Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1): 55-67.

    Article  MATH  MathSciNet  Google Scholar 

  • Kim S, Kang S, Lee G, Lee S, Jo A, Kwak K, Kim D, Koh D, Kho YL, Kim S, Choi K (2014). Urinary phthalate metabolites among elementary school children of Korea: sources, risks, and their association with oxidative stress marker. Sci Total Environment, 472:49-55.

    Article  Google Scholar 

  • Leblanc M and Tibshirani R (1993). Combining estimates in regression and classification. J American Statistical Association, 91:1641-1650.

    MathSciNet  Google Scholar 

  • Mustapha BA, Blangiardo M, Briggs DJ, Hansell AL (2011). Traffic ari pollution and other risk factors for respiratory illness in schoolchildren in the Niger-Delta region of Nigeria. Environ Health Perspect. 119:1478-1482.

    Article  Google Scholar 

  • Meinshausen N and Buhlmann P (2010) Stability selection. Journal of the Royal Statistical Society, 72, 417-473.

    Article  MathSciNet  Google Scholar 

  • Nocedal J and Wright S (2006). Numerical optimization. New York: New York: Springer

    MATH  Google Scholar 

  • R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. [http://www.R-project.org]

  • Rappaport SM, Smith MT (2010) Epidemiology. Environment and disease risks. Science. 330(6003):460-461.

    Article  Google Scholar 

  • Roberts S and Martin MA (2006) Investigating the mixture of air pollutants associated with adverse health outcomes. Atmospheric Environment 40(5):984-991.

    Article  Google Scholar 

  • SAS Institute Inc (2008). SAS 9.2 Help and Documentation. Cary, NC: SAS Institute Inc.

  • Schecter A, Lorber M, Guo Y, Wu Q, Yun SH, Kannan K, Hommel M, ImranN, Hynan LS, Cheng D, Colacino JA, Birnbaum LS (2013) Phthalateconcentrations and dietary exposure from food purchased in New YorkState. Environ Health Perspect. 121(4):473-494.

    Google Scholar 

  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Vol. 58, No. 1, pages 267-288.

    MATH  MathSciNet  Google Scholar 

  • Tu YK, Gunnell D, Gilthorpe MS (2008). Simpson’s Paradox, Lord’s Paradox, and Suppression Effects are the same phenomenon – the Reversal Paradox. Emerging Themes in Epidemiology, 5:2 [http://www.ete-online.com/content/5/1/2].

  • Vedal S, Kaufman JD (2011). What does multi-pollutant air pollution research mean? Am J Respir Crit Care Med. 183(1):4-6.

    Article  Google Scholar 

  • Wild CP (2005). Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 14(8):1847-1850.

    Article  MathSciNet  Google Scholar 

  • ———   (2012). The exposome: from concept to utility. International journal of epidemiology. 41(1):24-32.

    Article  MathSciNet  Google Scholar 

  • Wormuth M, Scheringer M, Vollenweider M, Hungerbuhler K (2006) What are the sources of exposure to eight frequently used phthalic acid esters in Europeans? Risk Analysis, 26(3):803-824.

    Article  Google Scholar 

  • Zou H (2006). The adaptive lasso and its oracle properties. J American Statistical Association. 101:1418-1429.

    Article  MATH  Google Scholar 

  • Zou H, Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society.Series B (Statistical Methodology), 67(2), 301-320.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors gratefully acknowledge support from #T32 ES0007334 and #UL1TR000058.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Gennings.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (docx 52 KB)

Appendix

Appendix

1.1 Simulating Correlated Data

Our objective is to simulate normally distributed data \(N(M,\sum )\) with a given correlation structure for an outcome y and predictors \(x_{1}, x_{2,\ldots }x_{c}\). Let \(\varvec{\rho }\) be the correlation matrix between and among y and the components in X and \(\Sigma \) be the corresponding covariance matrix with diagonal values in vector S and sample means in vector m. To impose the correlation structure, we first use the relationship between the correlation and the variance that yields:

$$\begin{aligned} \varvec{\Sigma }=\mathrm{diag}(S)*\varvec{\rho }*\mathrm{diag}(S) \end{aligned}$$

Then follow the simulation steps below where \(\hbox {p}=\hbox {c}+1\):

  1. 1)

    Calculate the Cholesky decomposition of \(\sum \,(\hbox {p} \times \hbox {p}\) dimension), such that \(\sum =\mathbf{U}_{\mathrm{pxp}}^{\prime } \mathbf{U}_{\mathrm{pxp}}\). (see Harville (1997))

  2. 2)

    Simulate \(\mathbf{Z}_{\mathbf{i}}\sim \hbox {N}(\mathbf{0}_{\mathbf{px1}}, \mathbf{I}_{\mathbf{p}})\). \(\mathbf{Z}^{\prime }=[\mathbf{Z}_{\mathbf{1}} \mathbf{Z}_{\mathbf{2}} \ldots . \mathbf{Z}_{\mathbf{n}}]\), i.e., Z is nxp where each row is a \(p\)-variate standard normal distribution.

  3. 3)

    Let \(\mathbf{M}= (\mathbf{m}^{*}\mathbf{1}_{\mathrm{1xn}})^{\prime }\) and \(\mathbf{Y}_{\mathrm{nxp}}=\mathbf{M}_{\mathrm{nxp}}+\mathbf{Z}_{\mathrm{nxp}}^{*}\mathbf{U}_{\mathrm{pxp}}\)

    1. a)

      \(\hbox {E}(\mathbf{Y})=\hbox {E}(\mathbf{M}+\mathbf{Z}^{*}\mathbf{U})= \mathbf{M}+ \hbox {E}(\mathbf{Z})=\mathbf{M}\)

    2. b)

      \(\hbox {Var}(\mathbf{Y})=\hbox {Var}(\mathbf{M}+\mathbf{Z}^{*}\mathbf{U})= \hbox {Var}(\mathbf{M})+\hbox {Var}(\mathbf{Z}^{*}\mathbf{U})=\mathbf{0}+\mathbf{U}^{\prime }\hbox {Var}(\mathbf{Z})\mathbf{U}=\mathbf{U}^{\prime }\mathbf{U}=\sum \)

  4. 4)

    So, Y is nxp and has the distribution \(N_{p}(\mathbf{M}, \sum )\)

In the first step, in order to calculate \(\mathbf{U}, \sum \) must be positive definite. To evaluate relevant cases with highly correlated data, \(\sum \) may be nearly singular. In this case, we use matrix ridging to stabilize the matrix.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carrico, C., Gennings, C., Wheeler, D.C. et al. Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting. JABES 20, 100–120 (2015). https://doi.org/10.1007/s13253-014-0180-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13253-014-0180-3

Keywords

Navigation