A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey

The National Health and Nutrition Examination Survey (NHANES) is a population survey implemented by the Centers for Disease Control and Prevention (CDC) to monitor the health of the United States whose data is publicly available in hundreds of files. This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables. The variables consist of phenotype and environmental exposure information on each individual, specifically (1) demographic information, physical exam results (e.g., height, body mass index), laboratory results (e.g., cholesterol, glucose, and environmental exposures), and (4) questionnaire items. Second, the data descriptor describes a dictionary to enable analysts find variables by category and human-readable description. The datasets are available on DataDryad and a hands-on analytics tutorial is available on GitHub. Through a new big data platform, BD2K Patient Centered Information Commons (http://pic-sure.org), we provide a new way to browse the dataset via a web browser (https://nhanes.hms.harvard.edu) and provide application programming interface for programmatic access.

with the NHANES data, and (3) introduce the PIC-SURE enabled web application to browse and download the data through an 'application programming interface' (API). Further, we have provided a web video tutorial on the web application located here: https://vimeo.com/182576739. We emphasize our data descriptor is an introduction for use of the NHANES dataset and that all analyses must be verified with data from CDC/NHANES directly. Furthermore, we also emphasize that the derived variables we include were suitable for our own analyses of NHANES and may not be suitable for hypotheses specific to other investigators. Therefore, we include all raw variables in our integrated dataset for investigators.

Methods
National Health and Nutrition Examination Surveys (NHANES) data NHANES datasets are publicly accessible through the United States Centers of Disease Control and Prevention (US CDC) [23][24][25][26] . All NHANES participants have consented for their information to be used in research.  (Fig. 1a,b) which are hyperlinked to a CDC website in January 2014. We chose to focus on these surveys as they had the greatest number of variables available at the time of download. We will make future instances of merged NHANES available via DataDryad with additional Data Descriptors.
Each participant of the NHANES has a unique identifier; in other words, there is no overlap in participants in the 1999-2000, 2001-2002, 2003-2004, and 2005-2006 surveys. In total, these 255 files contain information on 41,474 distinct individuals representative of the United States population and 1,191 unique variables.
Each.xpt formatted data file consists of information structured in a 'N × M' form, in which N number of individuals make up every row and M number of columns of variables for each individual (Fig. 1a,b) and a participant identifier (called 'SEQN'), the primary key that joins the data files together (shown as a gray column, Fig. 1a After downloading all 255.xpt files, we executed a number of data processing steps. First, all.xpt files were converted into.csv files using using the 'foreign' R package 27 , preserving the original 'N × M' form of the data. Next, we created some derived variables to ease potential downstream analyses, including (1) occupation (1 variable), (2) chronic disease (40 variables), and (3) pharmaceutical drug use (100 variables) (Fig. 1c).
We coded occupation as variables that correspond to (1) white-collar and professional jobs that are coded as white-collar and semi-routine (e.g., technicians), blue-collar and high-skill (e.g., mechanics, construction trades, and military), blue-collar and semi-routine (e.g., personal services, farm workers) as previously described in our previous EWAS 28 . Labor force participation was defined as working at a job or business or having a job or business within the last two weeks, not including work around the house.
We defined presence of 6 types of chronic diseases, including diabetes (1 variable), coronary disease (1 variable), hypertension (1 variable), asthma (1 variable), rheumatoid arthritis, osteoarthritis, and 30 site-specific cancers. We coded diabetes as present (as an integer 1) if the participant had a fasting blood glucose greater than 125 mg/dl (as per American Diabetes Association [ADA]) threshold for diabetes diagnosis or if the participant answered 'yes' to the question, 'Other than during pregnancy, have you ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?'. If the participant did not have both of those characteristics, he/she were coded as 0 (ref. 29). Similarly, we defined presence of hypertension as 1 if the participant had a systolic over diastolic blood pressure greater than 130 over 90 or answered 'yes' to the question, 'Have you ever been told by a doctor or other health professional that you had hypertension, also called high blood pressure' and 0 otherwise. We defined presence of coronary disease as 1 if the participant answered 'yes' to the question, 'Has a doctor or other health professional ever told you that you had coronary (kor-o-nare-ee) heart disease?' and 0 otherwise. The NHANES also contains coding for site-specific cancers. First, participants were asked whether a doctor has 'ever told you you have cancer?'. If the participant replies yes to a question, a followup question is administered, 'what type of cancer do you have' and the participant can answer from a set of 27 cancers, such as breast, skin, lung, colon, bladder, kidney, and other type of cancers. We turned these into 27 separate variables that are coded 1 if the site-specific cancer is present, 0 otherwise. Third, we extracted pharmaceutical drug use for each participant. The CDC used a Master Drug Database (MDDB), a proprietary but comprehensive database of all prescription and some nonprescription drug products available in the U.S. drug market. The CDC NHANES interviewer asked participants whether they were taking a drug in the past month, and if they were, what drugs they were taking. The CDC NHANES interviewer matched each drug to an MDDB identifier and drug description (e.g., METFORMIN or ALBUTEROL). Second, the CDC NHANES interviewer-if the interview was  occurring at the participant's home-verified possession of the prescription drug container. Each participant could report taking more than one drug. There were 626, 668, 667, and 692 unique drugs found by the CDC interviewers in the 1999-2000, 2001-2002, 2003-200, and 2005-2006 cohorts respectively. To keep the merged data table (Fig. 1d) of tractable size, we chose to focus on the top 100 drugs that were most prevalent in the population. We coded a participant was on a drug if (1) they reported use of a drug and (2) whether the interviewer verified the container was present.
The CDC also ascertained cause and time of death (mortality) information for a subset of the participants in 2006 by linking eligible participants to the National Death Index. We incorporated this data into our data merge (n = 11,429 participants). The variables that describe the mortality information include ELIGSTAT ( Finally, we combined the 255 files together into single data file by merging by the patient identifier ('SEQN') ( Fig. 1d). This merge resulted in one consolidated and analysis-ready data file representing a grand total of 1,191 variables on 41,474 participants.

Creation of a digital handbook: annotating and categorizing the NHANES datasets
The CDC NHANES have provided a.html formatted codebook (e.g.: https://wwwn.cdc.gov/Nchs/ Nhanes/Search/variablelist.aspx?Component = Laboratory&CycleBeginYear = 1999) that consists of variable name (column in the.xpt file) and a human-readable description of each variable. For example, the variable with names RIDAGEYR or LBXGLU is described as 'Age in Years' and 'fasting serum glucose [mg ul −1 ]' respectively. These descriptions include the variable units, such as 'ug/mL' (inferred as a continuous variable), or 'positive'/'negative' (a binary variable) of each variable.
We have extended the CDC NHANES data description methodology in the following ways (Fig. 1e) to facilitate analysis and data browsing. Specifically, we have created a data dictionary that contains the name of the variable, a human readable description of the variable, what 'module' a variable belongs to, what survey the variable was measured (e.g., [1999][2000]. Second, we have binned each variable into categories that offer more specificity than the CDC NHANES 'module' characterization. We make available the data dictionary (Fig. 1e) along with the data set (Data Citation 1). A summary of the number of variables per category, the median sample size for the variables in the category, and the demographic representation (percent female and race/ethnicity available for each variable) in Table 2. The entire data dictionary is available as Table 3 (available online only) (Data Citation 1 and Table 3 (available online only)).
These categories aide in the filtering and querying of variables with common types, such as 'nutrients', 'body measures', 'pharmaceutical drug', 'viral infection', and 'pesticides'. Second, we have created a column that denotes the categorical levels for variables that are categorical or binary. For example, 'Are you a past, current, or never smoker?' is a variable that has three levels, one representing 'never smoker', 'current smoker', and 'past smoker'; these categories are captured in a column called 'categorical levels'.

Browsing and accessing the data through BD2K Patient-Centered Information Commons (PIC)
We leveraged the Patient-Centered Information Commons (PIC, for an overview, see: http://pic-sure. org)) platform is leveraged to (1) enable interactive web browsing of the NHANES data (see: https:// nhanes.hms.harvard.edu) and (2) access data through an application programming interface (API). PICs are built using the i2b2/tranSMART software stack. Data is organized into a hierarchy resembling a directory structure to facilitate browsing (Figs 2 and 3). Raw data can be also queried using a drag and drop interface (Fig. 3). With the NHANES, we organized each of the 1171 variables into a multi-level hierarchy that was ordered by the module (i.e., 'Laboratory', 'Examination', 'Demographics', and 'Questionnaire') and category (i.e., 'pesticides', 'body measures', etc, see Table 2). To display this NHANES data hierarchy in our user interface we created a Metadata mapping file located here: https:// github.com/hms-dbmi/public-data-deployments/blob/master/NHANES/nhanes_9906.map and used this mapping file to integrate the data file.
The merged dataset ('MainTable') and data dictionary ('VarDescription') (Fig. 1d,e) are made available in DataDryad (Fig. 1f). A Usage Guide and.Rdata files are provided for download in GitHub (Fig. 1f). Finally, all data are browsable at https://nhanes.hms.harvard.edu. We have provided two additional resources for individuals to learn about the resource. The first is a tutorial of the web application located at Vimeo (https://vimeo.com/182576739). This web application shows users how to count the number of variables and number of participants (by age, sex, and   race/ethnicity) that we believe will aid in planning analyses of the data. Second, we have built an online course (http://www.chiragjpgroup.org/exposome-analytics-course/) to guide users step-by-step through an investigation our group recently published (Patel et al., 2016). We plan to assess how frequently our data descriptor and data resources are being utilized by the scientific community through traditional means (e.g., number of citations to this descriptor), but also through by counting the number of unique visitors to the Vimeo video website, the web application (http://nhanes.hms.harvard.edu), and through feedback from course materials.

Data Records
Data record 1: Integrated NHANES dataset and data dictionary in.csv format.
The integrated NHANES dataset and a data dictionary is available online at Dryad (Data Citation 1) as a .zip file which includes 3 .csv formatted files. The first file ('data file') contains each individual (as rows) surveyed in 1999-2006 with all of their measurements (as columns) ('MainTable', Fig. 1d). The second file contains a data dictionary file which contains the name of the variable as represented in the data file, a human readable description of the variable, the categories that the variable belongs to), and the levels of the categories (if a categorical variable) (Fig. 1e). The third file is a dictionary specifically for demographic information, such as describing the columns for age, sex, race/ethnicity, whether the participant was born in the US, education level, income level, and mortality information. Also, to facilitate analyses using the R programming language, we have provided a 4th file that contains all the files described above as a R data object in.Rdata format.

Technical Validation
The raw data contained herein are from the CDC NHANES. The CDC NHANES have performed extensive technical validation of their data described elsewhere (e.g., refs 30,31).

Usage Notes
The NHANES utilizes a 'multistage survey sampled' study design to ensure minority subgroups (e.g., Blacks, Mexican-American, elderly, pre-adolescents) of the population are appropriately represented in the dataset 32 and to optimize sampling resources. Therefore, statistical analyses need to take into account the structure of the sampling into account to provide accurate estimates of the population, such as means, standard errors, and correlations 33 .
To demonstrate how to properly analyse NHANES data, we provide a R markdown files in our GitHub repository (https://github.com/chiragjp/nhanes_scidata) to re-create several relevant analyses. Conducting an 'environment-wide association analysis' in all-cause mortality in NHANES Previously, we conducted a data-driven search of environmental exposure factors associated with all-cause mortality known as an 'environment-wide association study' 28 . In the guide (https://github.com/ chiragjp/nhanes_scidata/blob/master/User_Guide.Rmd), we describe how to associate one of the top findings, serum cadmium, with all-cause mortality using survey-weighted Cox proportional hazards regression.

Distribution of serum lead in in children: Accessing the NHANES in PIC-SURE API
In this guide (https://github.com/chiragjp/nhanes_scidata/blob/master/User_Guide_PIC.Rmd), we demonstrate how to access the NHANES data programmatically through the PIC-SURE API. In our example, we show how to query the API to estimate the quartiles of serum lead in the US population of all ages and aged under 18.

Redistributable analytics environment in Docker
The issue of reproducibility, replicability, and scalability in computational scientific research has been raised on multiple occasions 34,35 . We promote a reproducible practice by packaging the curated NHANES data (Data Citation 1) with an analytics environment comprised of R-3.3.0 (ref. 36) and the Rstudio-0.99.902 (ref. 37) web interface in addition to a custom R library for regression studies in a Docker container 38 . The packaged environment is publically available on Docker hub (https://hub. docker.com/r/chiragjp/nhanes_scidata/) and can be consistently deployed across local or cloud-based environments. We have provided these materials as a hands-on short course available here: http://www. chiragjpgroup.org/exposome-analytics-course/