Using large, publicly available data sets to study adolescent development: opportunities and challenges

Adolescence is a period of rapid change, with cognitive, mental wellbeing, environmental biological factors interacting to shape lifelong outcomes. Large, longitudinal phenotypically rich data sets available for reuse (secondary data) have revolutionized the way we study adolescence, allowing the field to examine these unfolding processes across hundreds or even thousands of individuals. Here, we outline the opportunities and challenges associated with such secondary data sets, provide an overview of particularly valuable resources available to the field, and recommend best practices to improve the rigor and transparency of analyses conducted on large, secondary data

Adolescence is a period of profound and prolonged change in cognitive ability, mental wellbeing, environmental factors, social abilities, hormonal fluctuations, and brain structure and function [10]. Large, deeply phenotyped, longitudinal data sets present a unique opportunity to understand the interplay between these multiple explanatory levels, capturing rich individual differences in processes that unfold during adolescence [24].
Large data sets offer numerous important benefits for understanding adolescent development. Such data sets expand our ability to ask research questions in ways often neglected by experimental studies, by improving our ability to study cross-cultural effects [13], diverse demographics [19], and historical and cohort differences [37]. They increase statistical power, safeguarding against misestimation of effect direction or magnitude (e.g. [20]), and allow us to examine differences in degree (e.g. factor models) or in kind (e.g. by identifying latent clusters or phenotypic profiles), [33] in outcomes of interest. Crucially, large longitudinal designs allow us to study antecedents of relatively rare but impactful phenotypes before their emergence, promising unique avenues toward identifying novel interventions. Although randomized controlled trials are often considered the gold standard for establishing intervention efficacy, they may not always be informative of developmental processes 'in the wild'. For many developmental questions, carefully considered observational studies may in fact be the most sensitive and suitable tool to achieve such inferential aims [50]. Although such large, rich, data sets are often too costly in terms of time and logistics to collect at the individual or laboratory level, increasingly, they are made available as secondary data sets [26]: Data sets collected by other individuals, with other primary purposes in mind. Although secondary data sets may also be small (e.g. focused experimental data), and large data sets are not necessarily made available for secondary analysis, our focus here is on data sets that are both large (larger than typical in the field) and, in principle, publicly available.
Here, we outline different types of large secondary data sets, provide an overview of resources, and illustrate how we can test developmental theories and inform policy and practice, using the example of sensitive periods for adversity (Textbox 1). In addition, we highlight some methodological challenges associated with large, secondary data sets, and provide an overview of best practices that can help improve the robustness and rigor of secondary data analysis.

Secondary data sets and where to find them
Funder mandates toward open data (e.g. the European Research Council (ERC) and the Wellcome Trust), the open science movement (e.g. [32]), and incentives such as 'open data' badges (e.g. [14]) have all encouraged increased data sharing and thus the availability of secondary data sets. These data sets differ in their scope and design but commonly include measures of demographics, mental health, mood, cognitive skills, personality, and/or environmental factors relevant for understanding adolescent development. Increasingly, richer biological data are also collected, such as brain structure and function, genotyping, twin designs, hormone levels, and puberty status (e.g. [44]). The inclusion of these biological factors is especially impactful, as they often require high levels of expertise to acquire and process, as well as large sample sizes to allow their proper integration as explanatory factors, placing them out of reach for many smaller laboratories.
Cohort designs follow participants across time, for example, ALSPAC (Avon Longitudinal Study of Parents and Children), which tracks children born in

Textbox 1
Leveraging secondary data analysis to study sensitive periods for adversity.
1991e1992, the Millennium Cohort Study, which tracks 19,000 United Kingdom children born in 2000e 2002, or the ABCD (Adolescent Brain Cognitive Development) data set [44] which will track 11,000 United States 10-year-olds up to adulthood across one or multiple geographic locations, commonly at fixed intervals. In contrast, there also exist household panel studies, which are routinely operated at a national level and provide questionnaires to members of specific households over long periods of time (e.g. US (Understanding Society), YRBSS (Youth Risk Behavior Surveillance System), SOEP (German Socio-Economic Panel), HILDA (Household, Income and Labour Dynamics in Australia and PSID (Panel Study of Income Dynamics)). Increasingly, large-scale studies are collated into larger consortium data sets, for example, by combining diverse studies examining adolescents during the COVID-19 pandemic ( [5]) or by harmonizing developmental cohorts that may include adolescent participants [46]. Data sets vary considerably in the richness and scope of their data, their ease of access (ranging from click-todownload to multimonth vetting processes), and their cost (from free-to-use to several thousand dollars). For the purpose of this article, we have created an Open Science Framework (OSF) resource (See Figure 1 for a screenshot) which collates a number of publicly available developmental data sets that include (parts of) adolescence, the domains and variables included, and details of their data access protocol. This document is dynamic and will be updated based on reader suggestions. The at-a-glance convenience of our spreadsheet is meant to complement, rather than substitute, existing, online platforms with extensive search functionality (e.g. https://www.closer.ac.uk/).

Methodological challenges in secondary data analysis
Above, we outline some of the substantive and methodological benefits of large secondary data sets in trying to understand the mechanisms of development during adolescence. Although these data sets confer many benefits, there are also potential methodological challenges associated with their use. We briefly outline some of the most important and less appreciated challenges.
Large sample sizes mean that almost every statistical association will be statistically, if not practically, significant. A move away from reporting statistical significance and toward effect sizes and model comparison is therefore desirable [16]. In addition, the analytic flexibility offered by large, phenotypically rich data sets can lead to many different, yet defensible, analytic pipelines, [7,28,41] amplified even further if (raw) neuroimaging data are involved [9]. This challenge extends beyond individual studies to the fact that hundreds of researchers access the same data set over time, each with their own, often partially overlapping, research questions. Although Type I errors are often discussed within a single study (or the related idea of forking paths, [21]), multiple research teams investigating the same data set may also lead to inflated error rates [43].
Furthermore, large-scale studies must find a compromise between the competing goals of phenotypic breadth and practicable ease and length of administration. This means that measurements of a given domain, especially when not the core interest of a study, are likely to use short or simplified proxies, potentially compromising various forms of validity (e.g. [27,42]).
One of the biggest challenges in many open data sets is that study design idiosyncrasies may limit our developmental inferences. Most large cohort studies attempt to measure individuals (bi)annually with roughly equal spacing between measurements. This may be for reasons of logistical convenience, or a (mistaken, e.g. see [11]) assumption that longitudinal models cannot accommodate unequally spaced observations. As detailed in [31], cohort studies often confound the number of assessments (which may induce practice  A more metascientific question concerns the scientific desirability of many research studies relying on the same cohort. The United Kingdom has several popular data sets covering adolescence, including ALSPAC, US (Understanding Society), and MCS (Millenium Cohort Study). Although these data sets have many strengths, it is likely not desirable that most of our scientific understanding of adolescence relies to a great degree on a small subset of the population (let alone the global population, [19]). Furthermore, in studies with higher participant burden, notably Magnetic Resonance Imaging (MRI), external validity may be additionally reduced given the differences between MRI-compatible individuals and the general population [30].

Improving secondary data analysis
Although not all pitfalls can be addressed by individual researchers, many can be mitigated. We now offer some recommendations for best practices. First, we recommend researchers preregister their analyses whenever possible [45]. Although preregistration is more commonly performed before data collection, it is suitable for secondary analysis also, for instance, by having a timestamped data access track record [47]. Weston et al., [47] offer multiple recommendations for preregistration tailored to secondary data analysis and [49] offer a template to preregister multiple existing data sets into a single analysis within the domain of longitudinal studies of aging. An alternative to preregistering a single analytic plan is instead to run many or even all plausible analyses. This strategy, known as multiverse or specification curve analysis, allows a researcher to run, and report, the output of a wide variety of defensible analysis pipelines (e.g. ineor exclusion of covariates, parametric vs nonparametric analyses, and so on.) and report the full range of findings (e.g. [34]). Other forms of robustness checks include replicating the core analyses in an independent replication sample (which can be an independent subset of the full sample but ideally a fully independent data set), which allows one to compare the generalizability (or lack thereof) across similar or different adolescent cohorts (e.g. [17]).
Second, we recommend researchers report effect sizes as many associations may be statistically, but not practically, significant when analyzing large sample sizes. In doing so, we may need to adjust our collective expectations of what effect sizes to expect, and which ones to treat as substantial (e.g. [22]). Whether an effect size is 'large enough' is a context-and research-specific decision that needs to be considered and debated by the community [4].
Relatedly, we recommend that researchers always plot the raw data in their articles. Summary plots such as bar graphs, summary statistics such as the mean/SD (Standard Deviation), or estimates such as correlation coefficients in isolation obscure the richness of distributions, the presence of floor or ceiling effects, the likelihood of subgroup heterogeneity, the appropriateness of parametric techniques, and many more features offered by inspection of raw data [48]. Although visualizing raw high dimensional data may be challenging, tailored approaches such as raincloud plots [2] and related techniques [8] can easily overcome practical challenges.
Finally, the analyses conducted on secondary data are commonly more complex than those applied to simpler experimental designs [28]. Methods sections in highimpact journals are often highly condensed or hidden at the end of an article, which can make it difficult or even impossible to assess which analyses exactly were performed. To address this issue, we recommend authors always publish the full analytic code, even when the raw data cannot be directly shared (e.g. [27,35]). This allows reviewers to assess precisely which analyses were completed, replicate the findings by applying for the data, and ideally expand on the work in future analyses.

Discussion
Secondary data offer a powerful resource to study the mechanisms and outcomes associated with (individual differences in) adolescent development. These samples increase power, allow for a considerably wider range of analytic strategies, and can allow us to ask questions not accessible using classic, smaller-scale experimental designs. Crucially, they improve access to uniquely rich data sets across laboratories and research environments that differ in resources. We provide an easily accessible overview of secondary data sets Table 1 Secondary data analysis: Checklist.

Yes/no
Is the study preregistered? Does it report robustness analyses/specification curve? Does it report effect sizes? Is there a replication sample (internal or external)? Does the article plot raw data? Does the article share analytic code?
researchers may want to explore. In addition, we outline some of the unique challenges that can come along with large secondary data sets and how they can be ameliorated to a considerable degree by increased awareness and better practices (see Table 1) that help improve robustness and transparency. Other metascientific challenges of secondary data sets remain to be resolved by the field, including the consequences of over-reliance on a small number of data sets, and the ethical challenges associated with (degrees of) anonymity offered by increasingly rich data sets. By tackling these challenges transparently, secondary data sets offer a unique, complementary source of scientific inquiry into the complex, but fascinating, processes that unfold during adolescence.

Conflict of interest statement
No conflict of interest.