Challenges and Solutions to the Measurement of Neurocognitive Mechanisms in Developmental Settings

ISS ABSTRACT Identifying early neurocognitive mechanisms that confer risk for mental health problems is one important avenue as we seek to develop successful early interventions. Currently, however, we have limited understanding of the neurocognitive mechanisms involved in shaping mental health trajectories from childhood through young adulthood, and this constrains our ability to develop effective clinical interventions. In particular, there is an urgent need to develop more sensitive, reliable, and scalable measures of individual differences for use in developmental settings. In this review, we outline methodological shortcomings that explain why widely used task-based measures of neurocognition currently tell us little about mental health risk. We discuss specific challenges that arise when studying neurocognitive mechanisms in developmental settings, and we share suggestions for overcoming them. We also propose a novel experimental approach—which we refer to as “cognitive microscopy”—that involves adaptive design optimization, temporally sensitive task administration, and multilevel modeling. This approach addresses some of the methodological shortcomings outlined above and provides measures of stability, variability, and developmental change in neurocognitive mechanisms within a multivariate framework.

Multiple developmental theories implicate the role of specific neurocognitive mechanisms in the onset and maintenance of mental health symptoms (1,2). Research has indicated that neurocognitive difficulties in childhood and adolescence, such as poor self-control, may represent transdiagnostic risk factors for psychopathology (3,4). Therefore, a better understanding of the relationships between neurocognitive mechanisms and mental health across development is needed to develop effective preventative interventions that can reduce that risk for affected individuals, families, and societies.
Accurate measurement of neurocognitive mechanisms entails the administration of task-based measures wherein participants respond to stimuli that putatively engage the mechanisms of interest. For example, in a Go/NoGo task, participants must suppress responses to NoGo stimuli, and making fewer errors on NoGo trials indicates better selfcontrol (5). Task-based measures designed to tap neurocognitive mechanisms are widely used to study clinical and at risk groups, but recent concerns regarding the psychometric properties of these task-based measures and their sensitivity to individual differences (6)(7)(8) have forced the field to take stock of its methods. However, relatively little attention has been devoted to the implications of these concerns for developmental research. If we cannot measure neurocognitive mechanisms reliably and sensitively during development, then we are limited in our ability to discern the nature of mental health vulnerability and to develop early interventions that may prevent mental health symptoms from emerging or escalating.
In this article, we review the challenges associated with measuring neurocognitive mechanisms using task-based measures and their implications for developmental research. In addition to pointing out challenges, we suggest some potentially fruitful avenues to advance the field. More specifically, we propose triangulating methods that have been proven successful individually or in research with adult populations. We also explain how this approach may yield psychometrically valid measurements of individual differences in neurocognitive mechanisms relevant to mental health vulnerability. Here, we focus on behavioral measurements because the proposed approach readily lends itself to behavioral experimentation, including within large-scale data collection. Nevertheless, the proposed approach may also find application in neuroimaging research.

DIFFICULTIES IN USING CURRENT TASK-BASED MEASURES TO STUDY INDIVIDUAL DIFFERENCES IN NEUROCOGNITIVE MECHANISMS UNDERLYING MENTAL HEALTH RISK
Improving our understanding of individual differences in neurocognitive mechanisms during development is essential for individual-level prediction of mental health outcomes in clinical and evidentiary applied settings (9). For example, mental health professionals routinely develop personal treatment plans in the absence of objective markers and models to aid them in predicting mental health trajectories and clinical outcomes (10,11). Improved understanding of individual variability can also help shift views of what it means to be at risk for mental health problems-from a stable feature, or even a label, to a processing style that characterizes some young people more often than others, also depending on the context. Several collaborative efforts have been made to improve our understanding of neurocognitive development and its relation to mental health risk, such as the Adolescent Brain Cognitive Development (ABCD) Study (12) and the Healthy Brain and Child Development Study (13). These efforts have great potential to provide information about normative neurodevelopment as well as biological and environmental pathways to mental ill health (14). Nonetheless, existing large-scale datasets include task-based measures that vary in their psychometric qualities (15), often administered months or years apart. Therefore, the groundwork to develop task-based measures that are sensitive to individual differences in neurocognitive development is urgently needed and can help to improve mental health diagnosis, treatment tailoring, and outcome prediction (16).
Questionnaire measures (completed by parents, teachers, or children themselves) typically outperform tasks in predicting real-world outcomes (17,18). However, questionnaires are not designed to-and hence are not able to-discern potentially different underlying cognitive mechanisms that may lead to similar behavioral profiles but may require different interventions. For example, a questionnaire measuring conduct disorder symptoms cannot be used to discern whether the child displays aggression as an exaggerated response to perceived threat, as a result of low tolerance for frustration, or because the child does not respond to other people's expressed distress and is thus able to act aggressively to get what they want. Knowing what information-processing differences underlie mental health symptoms is relevant for locating the source of the child's difficulty and formulating personalized intervention targets (19,20). There are several reasons why questionnaires outperform tasks in individual-level prediction. Questionnaires are completed based on "priors" that stem from accumulated data relating to each questionnaire item (e.g., having "trouble relaxing") over a particular period of time (e.g., over the last 2 weeks) (21). This averaging over time, and the implied emphasis on traits that are stable in the face of diurnal, stress-induced, and other sources of variation, may partly explain the ability of questionnaires to capture individual differences reliably and sensitively (22,23). Moreover, the long tradition of careful psychometric development of questionnaire measures of mental health vulnerability has not been accompanied by comparable work on how to extract information on individual differences from task-based measures (24). This is understandable given that task-based measures have been developed precisely to minimize between-subject variability and capture aspects of cognitive function that are consistent across individuals (25). While task-based measures perform well when examining experimental (within-subject) and between-group differences and have provided critical insight into the general principles of human brain and cognitive function, they are seldom optimized to sensitively discern individual differences in continuous trait analyses (26,27). For example, titrating tasks (e.g., the stop-signal delay) is helpful to detect group-level effects (e.g., in response inhibition), but it hampers our ability to measure individual variation, which is often larger than the variance between groups (28). Taskbased measures can provide information about individual differences when additional analytical steps are implemented. One approach that is frequently used is correlating task performance with variance in questionnaire measures, but these correlations tend to be modest (29)(30)(31). Furthermore, questionnaires and task-based measures of the same putative underlying cognitive mechanisms may, in fact, assess distinct constructs (26). This problem, which is widespread in cognitive research, is referred to as the jingle-jangle fallacy (30), namely, measures with the same name tapping different constructs (jingle fallacy) and measures with different names tapping the same construct (jangle fallacy). Low overlap between the same putative constructs across measurement types limits our ability to examine associations between neurocognitive mechanisms and observed behavior (32,33).

CHALLENGES AND WAYS FORWARD FOR RELIABLE TASK-BASED RESEARCH IN DEVELOPMENTAL SETTINGS
We argue that some of the main concerns with the use of current task-based measures for individual differences research represent opportunities for research into neurocognitive development.
First, a number of well-established task-based measures have been found to display suboptimal psychometric properties, including poor test-retest reliability (6,7,25) and low internal consistency (7,34). For example, suboptimal test-retest reliability has been observed in child and adolescent longitudinal functional magnetic resonance imaging studies using attentional, emotional, and cognitive control tasks, with lower reliability of blood oxygen level-dependent signal in brain regions subject to greater developmental change (15,35,36). The problem of poor psychometric properties represents an opportunity to conduct targeted research to establish when low reliability and internal consistency estimates reflect task properties versus change in the underlying cognitive mechanisms-including their state-dependent or dynamic nature. The reliability and stability of measures over time is most relevant in developmental settings because they may reflect important dynamics of neurocognitive development. For example, poor test-retest reliability of neuroimaging tasks administered during developmentally sensitive periods may result from brain activation patterns being less stable with increasing interscan intervals, when individual differences in brain development should be expected (37). Temporally sensitive methods, like the one proposed below, offer ways to disentangle different possible contributors to low reliability and may thus lead to novel insights into neurocognitive mechanisms across development.
Task-based measures also require that measurement noise is accounted for to display reasonable psychometric properties, but this is seldom the case (10,(12)(13)(14). Sources of measurement error that can affect task reliability include, for example, habituation and fatigue (38). In developmental settings, sources of measurement error also need to be distinguished from change in the neurocognitive mechanisms of interest. Indeed, change should be considered as intrinsic and central to the scientific inquiry, rather than treated as a nuisance parameter to control for, thereby removing its effect on individual and group averages. Researchers measuring the same neurocognitive mechanism across different time points should adopt approaches that can quantify variability and change, as well as intra-and interindividual processes and mechanisms that affect the rate of change and variability (e.g., language skills, distractibility, mood, motivation). When studying neurocognitive mechanisms across multiple developmental stages, it should also be noted that the same task might tap different processes at different time points (39) and that inferring the process from the measure is a challenge in itself (40). For example, as children mature, they rely increasingly on sophisticated, goal-directed, model-based decisionmaking strategies rather than on habitual and computationally less demanding model-free strategies (41)(42)(43). Therefore, task-based measures of decision making administered at different time points may capture different processes. Using designs that maximize the signal-to-noise ratio, such as Bayesian optimization, and statistical methods that account for different sources of variance, such as latent variable modeling and multilevel modeling, meaningful variability in task performance can be distinguished from measurement error (25,44).
In addition, the emergence of mental health symptoms reflects a multitude of genetic and environmental risk factors, each of which affects multiple neurocognitive systems and contributes a small proportion of variance in total mental health risk (45). Multivariate tools capable of capturing contributing pathways that may contain the shared variance of multiple risk factors can provide predictive leverage (46). Multivariate techniques can be used to build predictive models of mental health risk, for example, by identifying groupings of youths who are experiencing mental health problems over time based on their performance on relevant task-based measures. These data-driven groupings may align better with underlying mechanisms than traditional diagnostic categories (47,48). Neurocognitive mechanisms also do not develop in isolation; they emerge in the context of other processes, some of which may act as gatekeepers. For example, phonological awareness may act as a gatekeeper to working memory during early development. This means that the construct validity of working memory tasks could be affected by phonological awareness during early childhood, but less so at later stages (49). Multivariate approaches also allow examining whether and why variability in task performance tends to decrease across development, while mean performance improves (50). One possibility is that neurocognitive mechanisms are more variable in and of themselves during early relative to late development. Alternatively, decreased task variability over time could be due to decreased gatekeeping by neurocognitive mechanisms other than the one under study. For example, verbal ability may contribute to variability in task performance-as well as with a child's ability to comprehend task instructions-during early childhood more than during late childhood. Therefore, researchers who are interested in complex cognitive processes may need to measure and model multiple different processes to capture their codevelopment reliably and sensitively over time.
Neurocognitive development is also intrinsically interactive and situated in a particular context. Traditional experimental studies are rarely designed to account for contextual factors such as circadian rhythms, hormonal fluctuations, or changes in the social environment. Contextual factors are often controlled for, but they could instead be examined as factors contributing to variability in the neurocognitive mechanism of interest. This can be achieved by combining cognitive measurements with measures of biological or social environmental factors, particularly measures addressing social risk and protective factors that are critical in shaping development and mental health and that themselves evolve during development. For example, behavioral genetics research has indicated that the rearing environment influences individual differences in cognitive ability during early childhood, but does so to a lesser extent during adolescence, when nonshared environmental exposures (e.g., different peer groups) become relatively more influential (51). In addition, people are active cocreators of their environments, which partly explains why social risk factors are not distributed at random in the population (52,53). Consequently, studying the covariation between neurocognitive function and contextual factors in a temporally sensitive way can help to explain how individual differences in neurocognitive mechanisms relate to the generation and maintenance of social risk.

COGNITIVE MICROSCOPY: ADAPTIVE DESIGN OPTIMIZATION, TEMPORALLY SENSITIVE TASK ADMINISTRATION, AND MULTILEVEL MODELING
We propose one new approach for overcoming some of the methodological shortcomings outlined above, which we refer to as cognitive microscopy ( Figure 1). This approach integrates 3 main methodologies-adaptive design optimization, temporally sensitive task administration, and multilevel modeling-to address the challenges of extracting metrics of variability and change within a multivariate framework.
Adaptive design optimization involves sampling parameters strategically to obtain a sensitive assessment of performance thresholds (54,55). This approach involves 2 main steps. A task is first developed and characterized in terms of a generative model, namely, a model that relates parameter variability to variability in task performance (56). This model is then used during administration of the adaptive optimized task version. On each task trial, a parameter estimate based on the data obtained thus far is inferred, and the following trial is chosen to maximize the amount of information gained. Because this approach maximizes the informative value of each trial, it can be especially beneficial when lengthy cognitive assessments are impractical or costly, such as in large-scale data collections or in functional magnetic resonance imaging research (54). For example, in a decision-making task, each participant would be shown choice options that are tailored to their response patterns rather than all possible options (55). Bayesian adaptive methods are the state-of-the-art for efficient adaptive design and allow an optimal tradeoff between minimal task length and efficient parameter estimation from task performance (34). Bayesian adaptive methods are particularly Neurocognitive Measurements in Developmental Settings Biological Psychiatry: Cognitive Neuroscience and Neuroimaging -2023; -:---www.sobp.org/BPCNNI efficient because they minimize the number of steps required to identify the underlying cognitive model and its parameter values (57). Therefore, such methods can obviate the need to collect data from large samples or to administer long and demanding assessments, which is especially difficult in developmental settings (58).
Temporally sensitive task administration means flexible sampling that can be tailored to the neurocognitive mechanism of interest to detect temporal variability within a given time frame-from minutes to years, depending on the research question. One example of this approach is repeated short task administration, which allows capturing snapshots of cognitive function and extracting within-person metrics without introducing some of the measurement artifacts associated with traditional single-shot full-length task administration in laboratory settings (10,59,60). This approach can detect developmental change even when considering a relatively small number of repeated task administrations in a limited time window. When increasing sample size is precluded, repeated task administration is also an alternative method to increase statistical power (61). Although traditional one-occasion snapshot measurements can capture the average performance in a given setting, they do not allow the investigation of stability and variability over time. Repeated short tasks could be delivered noninvasively in naturalistic settings using portable or wearable devices (62). In this modality, repeated short tasks may also be readily integrated into large-scale data collection, thereby overcoming power limitations of smallsample studies (63,64). This approach shares features with methods such as experience sampling, ambulatory assessment, ecological momentary assessment, and intensive longitudinal data collection (65). These methods have been increasingly employed to assess affect and mood but less so for neurocognitive mechanisms (66,67), likely because of the length of traditional task-based measures and concerns about their reliability. Bayesian adaptive optimization methods have recently been used to obtain metrics of stability and variability based on brief and internally valid individual task assessments (34,68). However, only limited psychometric work has been conducted on extracting stable and variable properties of cognitive function from short tasks despite the possibility that these would offer something conceptually comparable with questionnaire ratings, which are based on a number of exemplars of a particular trait or behavior. In other words, repeated short task administration could enable extracting performance averages that represent stable informationprocessing characteristics comparable with the stable characteristics that are sampled by questionnaire ratings. Furthermore, variability in task performance over time has rarely been the object of study in and of itself (69), with some exceptions such as the study of reaction time variability in attention-deficit/hyperactivity disorder (70)(71)(72). This is an area A B C Figure 1. Visual representation of the proposed cognitive microscopy approach, consisting of 3 steps. (A) Adaptive design optimization: designing a task that dynamically modifies some aspects (e.g., task difficulty) based on participant performance, i.e., using Bayesian adaptive methods to estimate parameter values best describing the data as they accumulate (after each trial or a number of trials) and optimizing trials accordingly. (B) Temporally sensitive task administration: administering task-based assessments in a way that accounts for temporal variability in the cognitive function of interest, e.g., by sampling it at different times of the day for short intervals through mobile technology. (C) Multilevel modeling: using statistical techniques to analyze average performance (stable between-person variance) but also individual variability (intertrial and interassessment variance) and developmental change (interassessment variance), net of measurement error. In the example, participant 1 has stable average but high variation in performance, participant 2 has low within-session variability but high between-session variability, and participant 3 is relatively stable over time.
Neurocognitive Measurements in Developmental Settings that deserves more attention because the consistency of a particular information-processing style may itself indicate greater or lesser mental health risk. Advances have also been made in delivering brief task assessments at scale (73,74). However, the studies that have been conducted to date have typically been cross-sectional, with no explicit focus on the longitudinal characterization of neurocognitive mechanisms, which we argue is particularly beneficial when studying developmental samples. Metrics of stability and variability in cognitive function may represent markers of mental health risk and also, therefore, offer important clues regarding mechanisms that should be targeted in interventions.
Multilevel models (or hierarchical linear models) can be especially fruitful when analyzing data from repeated adaptive task administrations in developmental settings. Multilevel models are a family of statistical techniques that can be used to detect sources of variability in the presence of multiple sampling dimensions, such as clusters of participants within groups (e.g., students in the same class) or assessment visits in longitudinal designs (75,76). Because they separate variability within participants, differences between them, and measurement error, these models are better suited to studying development as a continuous process than traditional statistical methods (27,(77)(78)(79). In this context, multilevel modeling strategies have been used to study the development of cognitive functions (80) and brain circuits supporting them (81). Although multilevel modeling requires considerable quantitative expertise, the benefits offered by this approach may motivate researchers to develop expertise in this area. A wide range of open-source, hands-on tutorials (e.g., R Boot camp: Introduction to Multilevel Model and Interactions, offered by Penn State University) and tools for classical (82) and Bayesian (83) estimation are freely available online. Moreover, large-scale collaborative efforts and consortia have created opportunities to aggregate datasets and increase power to conduct such sophisticated statistical analyses. A case could also be made for academic institutions to provide researchers, especially early-career researchers, with the time and training infrastructure needed to acquire expertise in relevant analytical tools. Ideally, institutions would invest in growing and retaining methodological expertise in this area by offering job security and career progression pathways to researchers who focus specifically on developing and applying such analytical tools.

CONCLUSIONS
Efforts to develop more sensitive, reliable, and scalable neurocognitive measurements are required to identify developmental mechanisms of mental health problems. Our progress toward effective intervention will undoubtedly rely on our ability to build a proper mechanistic understanding of mental health conditions, for which improved cognitive phenotyping is vital. In developmental settings, this painstaking work needs to account for the stability, variability, and change in neurocognitive mechanisms that occur across development. Methodological approaches that aim to do so, such as adaptive design optimization, temporally sensitive task administration, and multilevel modeling, have the potential to improve our understanding of the neurocognitive mechanisms that underly mental health risk across developmental stages.