Doing Data Science: A Framework and Case Study

Today’s data revolution is not just about big data, it is about data of all sizes and types. While the issues of volume and velocity presented by the ingestion of massive amounts of data remain prevalent, it is the rapidly developing challenges being presented by the third v , variety, that necessitates more attention. The need for a comprehensive approach to discover, access, repurpose, and statistically integrate all the varieties of data is what has led us to the development of a data science framework that forms our foundation of doing data science . Unique features in this framework include problem identification, data discovery, data governance and ingestion, and ethics. A case study is used to illustrate the framework in action. We close with a discussion of the important role for data acumen.


Introduction
Data science is the quintessential translational research field that starts at the point of translation-the real problem to be solved.It involves many stakeholders and fields of practice and lends itself to team science.Data science has evolved into a powerful transdisciplinary endeavor.This article shares our development of a framework to build an understanding of what it means to just do data science.
We have learned how to do data science in a rather unique research environment within the University of Virginia's Biocomplexity Institute, one that is an intentional collection of statisticians and social and behavioral scientists with a common interest in channeling data science to improve the impact of decision making for the public good.Our data science approach to research is based on addressing real, very applied public policy problems.It is a research model that starts with translation by working directly with the communities or stakeholders and focusing on their problems.This results in a 'research pull' versus a 'research push' to lay the research foundation for data science.Research push is the traditional research paradigm.For example, research in biology and life sciences moves from basic bench science to bedside practice.For data science, it is through working several problems in multiple domains that the synergies and overarching research needs emerge, hence a research pull.
Through our execution of multiple and diverse policy-focused case studies, synergies and research needs across the problem domains have surfaced.A data science framework has emerged and is presented in the remainder of this article along with a case study to illustrate the steps.This data science framework warrants refining scientific practices around data ethics and data acumen (literacy).A short discussion of these topics concludes the article.

Data Science Framework
Conceptual models are being proposed for capturing the life cycle of data science, for example, Berkeley School of Information (2019) and Berman et al. (2018).A simple Google search of 'data science' brings forward pages and pages of images.These figures have overlapping features and are able to nicely summarize several components of the data science process.We find it critical to go beyond the conceptual framing and have created a framework that can be operationalized for the actual practice of data science.
Our data science framework (see Figure 1) provides a comprehensive approach to data science problem solving and forms the foundation of our research (Keller, Korkmaz, Robbins, & Shipp, 2018;Keller, Lancaster, & Shipp, 2017).The process is rigorous, flexible, and iterative in that learning at each stage informs prior and subsequent stages.There are four features of our framework that deviate from other frameworks and will be described in some detail.First, we specify the problem to be addressed and keep it ever-present in the framework, hence grounding the data science research in a problem to be solved.Second, we undertake data discovery, the search for existing data sources, as a primary activity and not an afterthought.Third, governance and data ingestion play a critical role in building trust and establishing data-sharing protocols.Fourth, we actively connect data science ethics to all components of the framework.
In the following, we describe the components of the data science framework.Although the framework is described in a linear fashion, it is far from a linear process as represented by a circular arrow that integrates the process.We also provide a case study example for youth obesity and physical activity in Fairfax County, Virginia, that walks through the components of the framework to demonstrate how a disciplined implementation of the steps taken to do data science ensures transparency and reproducibility of the research.

Problem Identification
Data science brings together disciplines and communities to conduct transdisciplinary research that provides new insights into current and future societal challenges (Berman et al., 2018).Data becomes a common language for communication across disciplines (Keller, 2007;Keller et al., 2017).The data science process starts with the identification of the problem.Using relevant theories and framing hypotheses is achieved through traditional literature reviews, including the review of the grey literature (e.g., government, industry, and nonprofit organization reports) to find best practices.Subject matter (domain) expertise also plays a role in translating the information acquired into understanding the underlying phenomena in the data (Box, Hunter, & Hunter, 1978).Domain knowledge provides the context to define, evaluate, and interpret the findings at each stage of the research (Leonelli, 2019;Snee, DeVeaux, & Hoerl, 2014).
Domain knowledge is critical to bringing data to bear on real problems.It can take many forms, from understanding the theory, the modeling, or the underlying changes observed in data.For example, when we repurpose local administrative data for analyses, community leaders can explain underlying factors and trends in the data that may not be apparent without contextual knowledge.

Case Study Application-Problem Identification
The Health and Human Services (HHS) of Fairfax County, Virginia, is interested in developing capacity for data-driven approaches to gain insights on current issues, such as youth obesity, by characterizing social and economic factors at the county and subcounty level and creating statistical models to inform policy options.Fairfax County is a large county (406 square miles) with 1. Data discovery is the identification of potential data sources that could be related to the specific topic of interest.Data pipelines and associated tools typically start at the point of acquisition or ingestion of the data (Weber, 2018).A unique feature of our data science framework is to start the data pipeline with data discovery.
The goal of the data discovery process is to think broadly and imaginatively about all data, capturing the full potential variety of data (the third v of the data revolution) that could be useful for the problem at hand and literally assemble a list of these data sources.
An important component of doing data science is to first focus on massive repurposing of existing data in the conceptual development work.Data science methods provide opportunities to wrangle these data and bring them to bear on the research questions.In contrast to traditional research approaches, data science research allows researchers to explore all existing data sources before considering the design of new data collection.
The advantage of this approach is that data collection can be directly targeted at current gaps in knowledge and information.
Khan, Uddin, and Gupta (2014) address the importance of variety in data science sources.Even within the same type of data, for example, administrative data, the problem (research question) drives its use and applicability of the information content to the issue being addressed.This level of variety drives what domain discoveries can be made ("Data Diversity," 2019).Borgman (2019) notes that data are human constructs.
Researchers and subject matter experts decide "what are data for a given purpose, how those data are to be interpreted, and what constitutes appropriate evidence."A similar perspective is that data are "relational," and their meaning relies on their history (how the data are born and evolve), their characteristics, and the interpretation of the data when analyzed (Leonelli, 2019).
Integrating data from disparate sources involves creating methods based on statistical principles that assess the usability of the data (United Nations Economic Commission for Europe, 2014Europe, , 2015)).These integrated data sources provide the opportunity to observe the social condition and to answer questions that have been challenging to solve in the past.This highlights that the usefulness and applicability of the data vary depending on its use and domain.There are barriers to using repurposed data, which are often incomplete, challenging to access, not clean, and nonrepresentative.There may also exist restrictions on data access, data linkage, and redistribution that stem from the necessity of governance across multiple agencies and organizations.Finally, repurposed data may pose methodological issues in terms of inference or creating knowledge from data, often in the form of statistical, computational, and theoretical models (Japec et al., 2015;Keller, Shipp, & Schroeder, 2016).
When confronted over and over with data discovery and repurposing tasks, it becomes imperative to understand how data are born.To do this, we have found it useful to define data in four categories, designed, administrative, opportunity, and procedural.These definitions are given in Table 1 (Keller et al., 2017, Keller et al., 2018).The expected benefits of data discovery and repurposing are the use of timely and frequently lowcost (existing) data, large samples, and geographic granularity.The outcomes are a richer source of data to support the problem solving and better inform the research plan.A caveat is the need to also weigh the costs of repurposing existing data compared to new data collection, questioning whether new experiments would provide faster results and more unbiased results than finding and repurposing data.In our experience, the benefits of repurposing existing data sources often outweigh these costs and, more importantly, provides guidance on data gaps for cost effective development of new data collection.
The typology of data (designed, administrative, opportunity, and procedural) provides a systematic way to think about possible data sources and a foundation for the data discovery steps.Data inventory is the process by which the data sources are first identified through brainstorming, searching, and snowballing processes (see Figure 2).
A short set of data inventory questions is conducted to assess the usefulness of the data sources to support the research objectives for a specific problem.The process is iterative, starting with the data inventory questions to assess whether the data source meets the basic criteria for the project with respect to the type of data, recurring nature of the data, data availability for the time period needed, geographic granularity, and unit of analysis required.If the data meet the basic criteria, then they undergo additional screening to document the provenance, purpose, frequency, gaps, how used in research, and other uses of the data.We employ a 'data map' to help drive our data discovery process (see Figure 3).Throughout the course of the project, as new ideas and data sources are discovered, they are inventoried and screened for consideration.
The acquisition process for existing data sources depends on the type and source of the data being accessed and includes downloading data, scraping the Web, acquiring it directly from a sponsor, or purchasing data from aggregators, or other sources.It also includes the development and initiation of data sharing agreements, as necessary.
Designed Data involve statistically designed data collections, such as surveys and experiments, and intentional data collections such as astronomical observations, remote sensing, and health registries.
Administrative Data are collected for the administration of an organization or program by entities, such as government agencies as they provide services, companies to track orders, and universities to record registered students.
Opportunity Data are derived from Internet-based information, such as websites and social media and captured through application programming interfaces (APIs) and Web scraping.
Procedural Data focus on processes and policies, such as a change in health care coverage or a data repository policy that outlines procedures and metadata required to store data.

Case Study Application-Data Discovery
The creation of a data map highlights the types of data we want to 'discover' for this project (see Figure 3).This is guided by a literature review and Fairfax County subject matter experts that are part of the Community Learning Data Driven Discovery (CLD3) team for this project (Keller, Lancaster, & Shipp, 2017).This data map immediately captures the multiple units of analysis that will need to be integrated in the analysis.The data map helps the team identify potential implicit biases and ethical considerations.
Data Inventory, Screening, and Acquisition.
The data map then guides our approach to identify, inventory, and screen the data.We screened each one to assess its relevance to this project, as follows.
For surveys and administrative data: For place-based data: The data map highlights the types of data desired for the study and is used as a guide for data discovery.The lists are social determinants and physical infrastructure that could affect teen behaviors.The map highlights the various units of analysis that will need to be captured and linked in the analyses.These include individuals, groups and networks of individuals, and geographic areas.
Are the data at the county or subcounty level?(Note: This question screened out several national sources of data that are not available at the geographic granularity needed for the study.) What years are the data available, i.e., are they for the same years as the American Community Survey (ACS) and Fairfax Youth Survey?
Can we acquire and use the data in the timeframe of the project, e.g., March to September?
Following the data discovery step, we identified and acquired survey, administrative, and place-based (opportunity) data to be used in this study.These are summarized in Table 2.The baseline data are the ACS, which provides demographic and economic data at the census block and census tract levels.We characterize the housing and rental stock in Fairfax County through the use of property tax assessment administrative records.Geo-coded place-based data are scraped from the Web and include location of grocery stores, convenience stores, restaurants (full-service and fast food), recreation centers, and other opportunities for physical activity.We also acquired Fairfax County Youth Survey aggregates (at the high school boundary level) and Fairfax Park Authority administrative data.
Table 2. Selected data sources.

Is an address provided?
Can the type of establishment be identified?
Can we acquire and use the data in the timeframe of the project?

Data Source Geography
American

Data Governance and Ingestion
Data governance is the establishment of and adherence to rules and procedures regarding data access, dissemination, and destruction.In our data science framework, access to and management of data sources is defined in consultation with the stakeholders and the university's institutional review board (IRB).Data ingestion is the process of bringing data into the data management platform(s).
Combining disparate data sources can raise issues around privacy and confidentiality, frequently from conflicting interests among researchers and sponsors working together.For clarity, privacy refers to the amount of personal information individuals allow others to access about themselves and confidentiality is the process that data producers and researchers follow to keep individuals' data private (National Research Council, 2007).
For some, it becomes intoxicating to think about the massive amounts of individual data records that can be linked and integrated, leading to ideas about following behavioral patterns of specific individuals, such as what a social worker might want to do.This has led us to a data science guideline distinguishing between ensuring confidentiality of the data for research and policy analyses versus real-time activities such as casework (Keller et al., 2016).Casework requires identification of individuals and families for the data to be useful, policy analysis does not.For casework, information systems must be set up to ensure that only social workers have access to these private data and approvals granted for access.Our focus is policy analysis.
Data governance requires tools to identify, manage, interpret, and disseminate data (Leonelli, 2019).These tools are needed to facilitate decision making about different ways to handle and value data and to articulate conflicts among the data sources, shifting research priorities to consider not only publications but also data infrastructures and curation of data.Our best practices around data governance and ingestion are included as part of the training of all research team members and also captured in formal data management plans.
Resulting modified read-write data, or code that can generate the modified data, produced from the original data sources are stored back to a secure server and only accessible via secured remote access.For projects involving protected information, unless special authorization is given, researchers do not have direct access to data files.For those projects, data access is mediated by the use of different data analysis tools hosted on our own secure servers that connect to the data server via authenticated protocols (Keller, Shipp, & Schroeder, 2016).

Case Study Application-Data Governance and Ingestion
Selected variables from data sources in Table 2 were profiled and cleaned (indicated by the asterisks).
Two unique sets of data requiring careful governance were discovered and included in the study.First is the Fairfax County Youth Survey, administered to 8th, 10th, and 12th graders every year.Access to these data requires adhering to specific governance requirements that resulted in aggregate data being provided for each school.These data include information about time spent on activities, e.g., homework, physical activity, screen time; varieties of food eaten each week; family structure and other support; and information about risky behaviors, such as use of alcohol and drugs.Second, the Fairfax County Park Authority data include usage data at their nine recreation centers, including classes taken, services used, and location of recreation center.

Data Wrangling
These next phases of executing the data science framework activities of data profiling to assess quality, preparation, linkage, and exploration can easily consume the majority of the project's time and resources and contribute to assessing the quality of the data (Dasu & Johnson, 2003).Details of data wrangling are now readily available from many authors and are not repeated here (e.g., DeVeaux, Hoerl, & Snee, 2016; Wickham, 2014;Wing, 2019).Assessing the quality and representativeness of the data is an iterative and important part of data wrangling (Keller, Shipp, & Schroeder, 2016).

Fitness-for-Use Assessment
Fitness-for-use of data was introduced in the 1990s from a management and industry perspective (Wang & Stone, 1996) and then expanded to official statistics by Brackstone (1999).Fitness-for-use starts with assessing the constraints imposed on the data by the particular statistical methods that will be used and if inferences are to be made whether or not the data are representative of the population to which the inferences extend.This assessment extends from straightforward descriptive tabulations and visualizations to complex analyses.
Finally, fitness-for-use should characterize the information content in the results.

Case Study Application-Fitness-for-Use
After linking and exploring the data sources, a subset of data was selected for the fitness-for-use analyses to benchmark the data.We were unable to gain access to individual student-level data and also important health information (even in aggregate) such as body mass indices (BMI is a combination of height and weight data).An implicit bias discussion across the team ensued and given these limitations the decisions on which data would be carried forward into the analyses were guided by a refocusing of the project to characterize the social, economic, and behavioral features of the individual high schools, their attendance areas, and county political districts.These characterizations could be used to target new programing and policy development.

Statistical Modeling and Analyses
Statistics and statistical modeling are key for drawing robust conclusions using incomplete information (Adhikari & DeNero, 2019).Statistics provide consistent and clear-cut words and definitions for describing the relationship between observations and conclusions.The appropriate statistical analysis is a function of the research question, the intended use of the data to support the research hypothesis, and the assumptions required for a particular statistical method (Leek & Peng, 2015).Ethical dimensions include ensuring accountability, transparency, and lack of algorithmic bias.

Case Study Application-Statistical Modeling and Analyses
We used the place-based data to calculate and map distances between home and locations of interest by political districts and high school attendance areas.The data include the availability of physical activity opportunities and access to healthy and unhealthy food.Figure 4 gives an example of the distances from home to locations of fast food versus farmers markets within each political district.

Synthetic information methods.
Unlike the place-based data, the survey data do not directly align with geographies of interest, e.g., 9 Supervisor Districts and 24 School Attendance Areas.To realign the data and the subsequent composite indicators to the relevant geographies, we used synthetic information technology to impute social and economic characteristics and attach these to housing and rental units across the county.Multiple sets of representative synthetic information about the Fairfax population based on iterative proportional fitting were constructed allowing for estimation of margins of errors (Beckman, Baggerly, & McKay, 1996).Some of the features of the synthetic data are an exact match to the ACS marginal tabulations, while others are generated statistically using survey data collected at varying levels of aggregation.Synthetic estimates across these multiple data sources can then be used to make inferences at resolutions not available in any single data source alone.

Creation of composite indicators.
Composite indicators are useful for combining data to create a proxy for a concept of interest, such as the relative economic position of vulnerable populations across the county (Atkinson, Cantillon, Marlier, & Nolan, 2002).Two composite indicators were created, the first to represent economically vulnerable populations and the second to represent schools that have a larger percent of vulnerable students (see Figure 5).We defined the indicators as follows: Economic vulnerability is the statistical combination of four factors: the percent of households with housing burden greater than 50% of household income, with no vehicle, receiving Supplemental Nutrition Assistance Program (SNAP) benefits, and in poverty.
High school vulnerability indicators are developed as a statistical combination of percentage of students enrolled in Limited English Proficiency programs, receiving free and reduced meals, on Medicaid, receiving Temporary Assistance for Needy Families, and migrant or homelessness experiences.
Figure 6 presents correlations between factors that may affect obesity.The next phase of the analyses was to build statistical models that would give insights into the relationships between physical activity and healthy eating based on information from the Youth Surveys.
Based on the full suite of data, several machine learning models were used.

Fitness-for-use assessment revisited.
While we were asked to examine youth obesity, we did not have access to obesity data at the subcounty level or student level.Yet, we decided to move from descriptive analysis to more complex statistical modeling to assess if existing data could still provide useful results.First, we used Random Forest, a supervised machine learning method that builds multiple decision trees and merges them together to get a more accurate and robust prediction.Our Random Forest results did not predict any reasonable or statistically significant results.Next, we used LASSO (least absolute shrinkage and selection operator), a regression analysis method that performs both variable selection and regularization (the process of adding information) to enhance the prediction accuracy and interpretability of the statistical model it produces.
However, the LASSO method consistently selected the model with zero predictors, suggesting none are useful.A partial least squares regression had the best performance when no components were used, mirroring LASSO.Instead of using the original data, partial least squares regression reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components.
Our conclusion is that more complex statistical modeling does not provide additional information beyond the (still clearly useful) descriptive analysis.As noted below, BMI data and stakeholder input to identify the relative importance of composite indicator components are needed to extend the modeling.

Communication and Dissemination
Communication involves sharing data, well-documented code, working papers, and dissemination through conference presentations, publications, and social media.These steps are critical to ensure processes and findings are transparent, replicable, and reproducible (Berman et al., 2018).An important facet of this step is to tell the story of the analysis by conveying the context, purpose, and implications of the research and findings (Berinato, 2019;Wing, 2019).Visuals, case studies, and other supporting evidence reinforce the findings.
Communication and dissemination are also important for building and maintaining a community of practice.It can include dissemination through portals, databases, and repositories, workshops, and conferences, and the creation of new journals (e.g., Harvard Data Science Review).Underlying communication and dissemination is preserving the privacy and ethical dimensions of the research.

Case Study Application-Communication and Dissemination
We summarized and presented our findings at each stage of the data science lifecycle, starting with the problem asked, through data discovery, profiling, exploratory analysis, fitness-for-use, and the statistical analysis.We provided new information to county officials about potential policy options and are continuing to explore how we might obtain data-sharing agreements to obtain sensitive data, such as BMI.
The data used in this study are valuable for descriptive analyses, but the fitness-for-use assessment demonstrated the statistical models required finer level of resolution of student-level data to obtain better predictive measure, for example, body mass index (BMI) or height and weight data.The exploratory analysis described earlier provided many useful insights for Fairfax County Health and Human Services about proximity to physical activity and healthy food options for each political district and high school attendance area.We encourage Fairfax County Health and Human Services to develop new data governance policies that allow researchers to access sensitive data, while ensuring that the privacy and confidentiality of the data are maintained.
Until we can access BMI or height and weight data, we propose to seek stakeholder input to develop composite indicators, such as the economic vulnerability indicator described in this example.These composite indicators would inform stakeholders and decision makers about where at-risk populations live, and changes over time in how those populations are faring from various perspectives such as economic self-sufficiency, health, access to healthy food, and access to opportunities for physical activity.

Ethics Review
The ethics review provides a set of guiding principles to ensure dialogue on this topic throughout the lifecycle of the project.Because data science involves interdisciplinary teams, conversations around ethics can be challenging.Each discipline has its own set of research integrity norms and practices.To harmonize across these fields, data science ethics touches every component and step in the practice of data science as shown in Figure 1.This is illustrated throughout the case study.
When acquiring and integrating data sources, ethical issues include considerations of mass surveillance, privacy, data sovereignty, and other potential consequences.Research integrity includes improving day-to-day research practices and ongoing training of all scientists to achieve "better record keeping, vetting experimental designs, techniques to reduce bias, rewards for rigorous research, and incentives for sharing data, code, and protocols-rather than narrow efforts to find and punish a few bad actors" ("Editorial: Nature Research Integrity," 2019, p. 5).Research integrity is advanced by implementing these practices into research throughout the entire research process, not just through the IRB process.
Salganik (2017) proposes a principles-based approach to ethics to include standards and norms around the uses of data, analysis, and interpretation, similar to the steps associated with implementing a data science framework.Similarly, the "Community Principles on Ethical Data Sharing," formulated at a Bloomberg conference in 2017, is based on four principles-fairness, benefit, openness, and reliability (Data for Democracy, 2018).A systematic approach to implementing these principles is ensuring scientific data are

FAIR:
Underlying the FAIR principles is to also give credit for curating and sharing data and to count this as important as journal publication citations (Pierce, Dev, Statham, & Bierer, 2019).The FAIR movement has taken hold in some scientific disciplines where issues surrounding confidentiality or privacy are not as prevalent.Social sciences, on the other hand, face challenges in that data access is often restricted for these 'Findable' using common search tools; 'Accessible' so that the data and metadata can be explored; 'Interoperable' to compare, integrate, and analyze; and 'Reusable' by other researchers or the public through the availability of metadata, code, and usage licenses (Stall et al., 2019).
reasons.However, the aim should be to develop FAIR principles across all disciplines and adapt as necessary.
This requires creating repositories, infrastructures, and tools that make the FAIR practices the norm rather than the exception at both national and international levels (Stall et al., 2019).
Building on these principles, we have developed a Data Science Project Ethics Checklist (see the Appendix for an example).We find two things useful to do to instantiate ethics in every step of 'doing data science.'First, we require our to take IRB) and the Responsible Conduct of Research training classes.Second, for each project, we develop a checklist to implement an ethical review at each stage of research to address the following criteria: Creating the checklist is the first step for researchers to agree on a set of principles and serves as a reminder to have conversations throughout the project.This helps address the challenge of working with researchers from different disciplines and allow them to approach ethics through a variety of lenses.The Data Science Ethics Checklist given in the Appendix can be adapted to specific data science projects, with a focus on social science research.Responsible data science involves using a set of guiding principles and addressing the consequences across the data lifecycle.

Case Study Application-Ethics
Aspects of the ethics review, a continuous process, have been touched on in the earlier steps of the case study, specifically, the ethics examination of the methods used, including the choice of variables, the creation of synthetic populations, and the models used.In addition, our findings were scrutinized, vetted, and refined based on internal discussions with the team, with our sponsors, Fairfax County officials, and external experts.The primary question asked throughout was whether we were introducing implicit bias into our research.We concurred that some of the findings had the potential to appear biased, such as the finding about level of physical activity by race and ethnicity.
However, in this case, these findings would be important to school officials and political representatives.
Balance simplicity and sufficient criteria to ensure ethical behavior and decisions.
Make ethical considerations and discussion of implicit biases an active and continuous part of the project at each stage of the research.
Seek expert help when ethical questions cannot be satisfactorily answered by the research team.Being data literate is important to know why our intuition may not often be right (Kahneman, 2011).We believe that building data capacity and acumen of decision makers is an important facet of data science.

Conclusion
Without applications (problems), doing data science would not exist.Our data science framework and research processes are fundamentally tied to practical problem solving and can be used in diverse settings.We provide a case study of using local data to address questions raised by county officials.Some contrasting examples that make formal use of the data science framework are the application to industry supply chain synchronization and the application to measuring the value and impact of open source software (Keller et al., 2018;Pires et al., 2017).
We have highlighted data discovery as a critical but often overlooked step in most data science frameworks.
Without data discovery, we would fall back on data sources that are convenient.Data discovery expands the power of data science by considering many new data sources, not only designed sources.We are also developing new behaviors by adopting a principles-based approach to ethical considerations as a critical underlying feature throughout the data science lifecycle.Each step of the data science framework involves documentation of decisions made, methods used, and findings, allowing opportunity for data repurposing and reuse, sharing, and reproducibility.
Our data science framework provides a rigorous and repeatable, yet flexible, foundation for doing data science.
The framework can serve as a continually evolving roadmap for the field of data science as we work together to embrace the ever-changing data environment.It also highlights the need for supporting the development of data acumen among stakeholders, subject matter experts, and decision makers.
agreement with the US Department of Agriculture, National Agriculture Statistical; U.S. Army Research Institute for Social and Behavioral Sciences; Fairfax County, Virginia.

Figure 1 .
Figure 1.Data science framework.The data science framework starts with the research question, or problem identification, and continues through the following steps: data discovery -inventory, screening, and acquisition; data ingestion and governance; data wrangling-data profiling, data preparation and linkage, and data exploration; fitness-for-use assessment; statistical modeling and analyses; communication and dissemination of results; and ethics review.

Figure 2 .
Figure 2. Data discovery filter.Data discovery is the open-ended and continuous process whereby candidate data sources are identified.Data inventory refers to the broadest, most far-reaching 'wish list' of information pertaining to the research questions.Data screening is an evaluative process by which eligible data sets are sifted from the larger pool of candidate data sets.Data acquisition is the process of acquiring the data from a sponsor, purchasing it, downloading it using an application programming interface (API), or scraping the web.

Figure 3 .
Figure 3. Data Map.The data map highlights the types of data desired for the study andis used as a guide for data discovery.The lists are social determinants and physical infrastructure that could affect teen behaviors.The map highlights the various units of analysis that will need to be captured and linked in the analyses.These include individuals, groups and networks of individuals, and geographic areas.

Figure 4 .
Figure 4. Exploratory analysis-direct aggregation of place-based data based on location of housing units.The box plots show the distance from each housing unit to farmers market or fast food by each of the 9 Fairfax County political districts.The takeaway is that people live closer to fast food restaurants than to farmers markets.

Figure 5 .
Figure 5. School and economic vulnerability indicators for Fairfax County, Virginia.Economic vulnerability indicators are mapped by the 24 high school attendance areas and by color; the darker the color, the more vulnerable is an area.The overlaid circles are high school vulnerability indicators geolocated at the high school locations.The larger the circle, the higher the vulnerability of the high school population.

Figure 6 .
Figure 6.Correlations of factors that may affect obesity.The factors are levels of physical activity (none or 5+ days per week), food and drink consumed during past week, unhealthy weight loss, and food insecurity.As an example, the bottom left-hand corner shows a positive correlation between no physical activity and food insecurity.
Ensure documentation, transparency, ongoing discussion, questioning, and constructive criticism throughout the project.Incorporate ethical guidelines from relevant professional societies (for examples, see ACM Committee on Professional Ethics.(2018), American Physical Society (2019), Committee on Professional Ethics of the American Statistical Association (2018),