Neurocognitive reorganization between crystallized intelligence, fluid intelligence and white matter microstructure in two age-heterogeneous developmental cohorts

Highlights • In childhood and adolescence, cognitive ability is ‘best’ explained as consisting of two factors, gc and gf.• White matter tracts provide independent contributions to cognitive ability.• Associations between white matter and intelligence differed from childhood to adolescence.


Introduction
Intelligence measures have repeatedly been shown to predict important life outcomes such as educational achievement (Deary et al., 2007) and mortality (Calvin et al., 2011). Modern investigations of intelligence began over 100 years ago, when Spearman first proposed g (for 'general intelligence') as the underlying factor behind his positive manifold of cognitive ability and established intelligence as a central theme of psychological research (Spearman, 1904). Cattell proposed a division of Spearman's g-factor into two separate yet related constructs, crystallized (gc) and fluid (gf) intelligence (Cattell, 1967). Cattell suggested that gc represents the capacity to effectively complete tasks based on acquired knowledge and experience (e.g. arithmetic, vocabulary) whereas gf refers to one's ability to solve novel problems without task-specific knowledge, relying on abstract thinking and pattern recognition (see also Deary et al., 2010).
Although other detailed conceptualizations of intelligence are available (e.g. Schneider and McGrew, 2012), fluid and crystallized intelligence have proven especially insightful regarding developmental changes in intelligence. For instance, current understanding of lifespan trajectories of gc and gf using cross-sectional (Horn and Cattell, 1967) and longitudinal (McArdle et al., 2000;Schaie, 1994) cohorts indicates that gc slowly improves until late age while gf increases into early adulthood before steadily decreasing. However, most of the literature on individual differences between gc and gf has focused on early to late adulthood. As a result, considerably less is known about the association between gc and gf in childhood and adolescence (but see Hülür et al., 2011).
There has, however, been a recent rise in interest in this topic in child and adolescent samples. For instance, research on age-related differentiation and its inverse, age dedifferentiation, in younger samples has greatly expanded since first being pioneered in the middle of the 20th century (Garrett, 1946). According to the age differentiation hypothesis, cognitive factors become less correlated (more differentiated) with increasing age. For example, the relationship (covariance) between gc and gf would decrease as children age into adolescence, suggesting that cognitive abilities increasingly specialize into adulthood. In contrast, the age dedifferentiation hypothesis predicts that cognitive abilities become more strongly related (less differentiated) throughout development. In this case, gc and gf covariance would increase between childhood and adolescence, potentially indicating a strengthening of the g-factor across age. However, despite its increased attention in the literature, the debate remains unsolved as evidence in support of both hypotheses has been found (Bickley et al., 1995;de Mooij et al., 2018;Gignac, 2014;Hülür et al., 2011;Juan-Espinosa et al., 2000;Tideman and Gustafsson, 2004). Together, this literature highlights the importance of a lifespan perspective on theories of cognitive development, as neither age differentiation nor dedifferentiation may be solely able to capture the dynamic changes that occur from childhood to adolescence and (late) adulthood (Hartung et al., 2018).
The introduction of non-invasive brain imaging technology has complemented conventional psychometric approaches by allowing for fine-grained probing of the neural bases of human cognition. A particular focus in developmental cognitive neuroscience has been the study of white matter using techniques such as diffusion-weighted imaging, which allows for the estimation of white matter microstructure (Wandell, 2016). Both cross-sectional and longitudinal research in children and adolescents using fractional anisotropy (FA), a commonly used estimate of white matter integrity, have consistently revealed strong correlations between FA and cognitive ability using tests of working memory, verbal and non-verbal performance (Koenis et al., 2015;Krogsrud et al., 2018;Peters et al., 2014;Tamnes et al., 2010;Urger et al., 2015). Moreover, recent research has also found associations between the corpus callosum (Navas-S� anchez et al., 2014;Westerhausen et al., 2018) association fibers (e.g. inferior longitudinal fasciculus, see Peters et al., 2014), the superior longitudinal fasciculus (Urger et al., 2015), and differences in cognitive ability, suggesting the importance of white matter integrity across large coordinated brain networks for high cognitive performance. However, interpretations of these studies are limited due to restricted cognitive batteries (e.g. small number of tests used) and a dearth of theory-driven statistical analyses (e.g. structural equation modelling). Interestingly, preliminary evidence suggests that the associations between brain structure and cognitive performance may not be static during development. For instance, Koenis et al., 2018 observed increased correlation between cognitive performance and FA-derived metric changes over adolescence.
For these reasons, several outstanding questions in the developmental cognitive neuroscience of intelligence remain: 1) Are the white matter substrates underlying intelligence in childhood and adolescence best understood as a single global factor or do individual tracts provide specific contributions to gc and gf?, 2) If they are specific, are the tract contributions identical between gc and gf?, and 3) Does this brainbehavior mapping change in development (e.g. age differentiation/ dedifferentiation or both)?
To examine these questions, our preregistered hypotheses are as follows: 1) gc and gf are separable constructs in childhood and adolescence.
More specifically, the covariance among scores on cognitive tests are more adequately captured by the two-factor (gc-gf) model as opposed to a single-factor (e.g. g) model.
2) The covariance between gc and gf differs (decreases) across childhood and adolescence.
3) White matter tracts make unique complementary contributions to gc and gf. 4) The contributions of these tracts to gc and gf differ (decrease) with age.
To address these questions, we examined the relationship between gc and gf in two large cross-sectional child and adolescent samples. The first is the Centre for Attention, Learning and Memory (CALM, see Holmes et al., 2019). This sample, included in our preregistration, was recruited atypically (see Methods for more detail) and generally includes children with slightly lower cognitive abilities than age-matched controls. To examine whether findings from CALM would generalize to other samples, we also conducted non-preregistered analyses on the Nathan Kline Institute (NKI)-Rockland Sample, a cohort with similar population demographics to the United States (e.g. race and socioeconomic status, see Table 1 of Nooner et al., 2012). All analyses were carried out using structural equation modelling (SEM), a multivariate statistical framework combining factor and path analysis to examine the extent to which causal hypotheses concerning latent (unobserved, e.g. g) and manifest (observed, e.g. cognitive tests scores) variables (Schreiber et al., 2006) are in line with the observed data. Taken together, this paper sought to investigate the relationship between measures of intelligence (gc and gf) and white matter connectivity in typically and atypically (struggling learners) developing children and adolescents.

Participants
For the CALM sample, we analyzed the most recent data release (N ¼ 551; 170 female, 381 male 2 ; age range ¼ 5.17-17.92 years) at the time of preregistration (see https://aspredicted.org/5pz52.pdf). Participants were recruited based on referrals made for possible attention, memory, language, reading and/or mathematics problems (Holmes et al., 2019). Participants with or without formal clinical diagnosis were referred to CALM. Exclusion criteria included known significant and uncorrected problems in vision or hearing and a native language other than English. A subset of participants completed MRI scanning (N ¼ 165; 56 female, 109 male; age range ¼ 5.92-17.92 years). For more information about CALM, see http://calm.mrc-cbu.cam.ac.uk/.
Next, to assess the generalizability of our findings in CALM, we used a non-preregistered subset of the data from the Nathan Kline Institute (NKI)-Rockland Sample (cognitive data: N ¼ 337; 149 female, 188 male; age range ¼ 6.12-17.94 years; neural data: N ¼ 65; 27 female, 38 male; age range ¼ 6.97-17.8 years). This multi-institutional initiative recruited a lifespan (aged between 6 and 85 years), communityascertained sample (Nooner et al., 2012). We chose this sample due to its representativeness (demographics resemble those of the United States population) and the fact that its cognitive battery assessments closely-matched CALM. For more information about the NKI-Rockland Sample and its procedures, see http://rocklandsample.org/. Also see Fig. 1 for age distributions of CALM and NKI-Rockland. These same two cohorts were used in a recent paper to address a distinct set of questions (Fuhrmann et al., 2019).

Cognitive assessments: gc, gf, and working memory
All cognitive data from the CALM sample were collected on a one-toone basis by an examiner in a dedicated child-friendly testing room. The test battery included a wide range of standardized assessments of cognition and learning (Holmes et al., 2019). Participants were given 2 Gender was coded as either female or male. However, it should be noted that participants might identify themselves as 'Other', which, to our knowledge, was not an option according to the biographical protocols used in either sample. regular breaks throughout the session. Testing was divided into two sessions for participants who struggled to complete the assessments in one sitting. For analyses of the NKI-Rockland Sample cohort, we matched tasks used in CALM except for the Peabody Picture Vocabulary Test, Dot Matrix, and Mr. X, which were only available for CALM. For the NKI-Rockland Sample, we included the N-Back task, which is not available in CALM (Nooner et al., 2012). In both samples, only raw scores obtained from assessments were included in analyses. Due to varying delays between recruitment and testing in the NKI-Rockland cohort, we only used cognitive test scores completed no later than six months after initial recruitment. The cognitive tasks are further described in Table 1; the raw scores are depicted in Fig. 2.
The forward and backward digit working memory tasks, while measuring the same cognitive abilities, used different test batteries and scoring protocols. For CALM, these scores indicate the total number of correctly recalled digits across all trials while the NKI-Rockland scores  were transformed into a span score. Due to this discrepancy (see References in Table 1 and Alloway et al., 2008 for statistical comparisons between the batteries), these tasks are plotted separately (see Fig. 2).

MRI acquisition
The CALM sample neuroimaging data were obtained at the MRC Cognition and Brain Sciences Unit, Cambridge, UK. Scans were acquired on the Siemens 3 T Tim Trio system (Siemens Healthcare, Erlangen, Germany) via 32-channel quadrature head coil. All T1-weighted volume scans were acquired using a whole brain coverage 3D magnetizationprepared rapid acquisition gradient echo (MPRAGE) sequence with 1 mm isotropic image resolution with the following parameters: Repetition Time (TR) ¼ 2250 ms; Echo Time (TE) ¼ 3.02 ms; Inversion Time (TI) ¼ 900 ms; flip angle ¼ 9 degrees; voxel dimensions ¼ 1 mm isotropic; GRAPPA acceleration factor ¼ 2. Diffusion-Weighted Images (DWI) were acquired using a Diffusion Tensor Imaging (DTI) sequence with 64 diffusion gradient directions with a b-value of 1000s/mm 2 , plus one image acquired with a b-value of 0. Other relevant parameters include: TR ¼ 8500 ms, TE ¼ 90 ms, voxel dimensions ¼ 2 mm isotropic.

White matter connectome construction
Note that part of the following pipeline is identical to that described in . Diffusion-weighted images were pre-processed to create a brain mask based on the b0-weighted image (FSL BET; Smith, 2002) and to correct for movement and eddy current-induced distortions (eddy; Graham et al., 2016). Subsequently, the diffusion tensor model was fitted and fractional anisotropy (FA) maps were calculated (dtifit). Images with a between-image displacement greater than 3 mm as indicated by FSL eddy were excluded from further analysis. All steps were carried out with FSL v5.0.9 and were implemented in a pipeline using NiPyPe v0.13.0 (Gorgolewski et al., 2011). To extract FA values for major white matter tracts, FA images were registered to the FMRIB58 FA template in MNI space using a sequence of rigid, affine, and symmetric diffeomorphic image registration (SyN) as implemented in ANTS v1.9 (Avants et al., 2008). Visual inspection indicated good image registration for all participants. Subsequently, binary masks from a probabilistic white matter atlas (threshold at >50 % probability) in the same space were applied to extract FA values for white matter tracts (see below). Participant movement, particularly in developmental samples, can significantly affect the quality, and, hence, statistical analyses of MRI data. Therefore, we undertook several procedures to ensure adequate MRI data quality and minimize potential biases due to subject movement. First, for the CALM sample, children were trained to lie still inside a realistic mock scanner prior to their actual scans. Secondly, for both samples, all T1-weighted images and FA maps were visually examined by a qualified researcher to remove low quality scans. Lastly, quality of the diffusion-weighted data were evaluated in both samples by calculating the framewise displacement between subsequent volumes in the sequence. Only data with a maximum between-volume displacement below 3 mm were included in the analyses. All steps were carried out with FMRIB Software Library v5.0.9 and implemented in the pipeline using NiPyPe v0.13.0 (see https://nipype.readthedocs.io/en/latest/).

Neural measures: white matter and fractional anisotropy
To approximate white matter contributions to fluid and crystallized ability, we analyzed fractional anisotropy (FA; see Wandell, 2016). We based our choice of FA on previous studies of white matter in developmental samples (de Mooij et al., 2018; Kievit et al., 2016). We used FA as a general summary metric of white matter microstructure as it cannot directly discern between specific cellular components (e.g. axonal diameter, myelin density, water fraction). Mean FA was computed for 10 bilateral tracts as defined by the Johns Hopkins University DTI-based white matter tractography atlas (see Fig. 1 of Hua et al., 2008): forceps minor (FMin), forceps major (FMaj), anterior thalamic radiations  Fig. 3 shows the cross-sectional trends of FA across the age range for both samples.

Statistical analyses
We used structural equation modelling (SEM), a multivariate approach that combines latent variables and path modelling to test causal hypotheses (Schreiber et al., 2006) as well as SEM trees, which combine SEM and decision tree paradigms to simultaneously permit exploratory and confirmatory data analysis (Brandmaier et al., 2013).
We performed structural equation modelling (SEM) using the lavaan package version 0.5-22 (Rosseel, 2012) in R (R Core Team, 2018) and versions 2.9.9 and 0.9.12 of the R packages OpenMx (Boker et al., 2011) and semtree (Brandmaier et al., 2013), respectively. To account for missing data and deviations from multivariate normality, we used robust full information maximum likelihood estimator (FIML) with a Yuan-Bentler scaled test statistic (MLR) and robust standard errors (Rosseel, 2012). We evaluated overall model fit via the (Satorra-Bentler scaled) chi-squared test, the comparative fit index (CFI), the standardized root mean squared residuals (SRMR), and the root mean square error of approximation (RMSEA) with its confidence interval (Schermelleh-Engel et al., 2003). Assessment of model fit was defined as: CFI (acceptable fit 0.95-0.97, good fit >0.97), SRMR (acceptable fit 0.05-.10, good fit <0.05), and RMSEA (acceptable fit 0.05-0.08, good fit <0.05). To determine whether gc and gf were separable constructs, we compared a two-factor (gc-gf) model to an single-factor (g) model. To investigate if the covariance between gc and gf differed across ages, we conducted multiple group comparisons between younger and older participants based on median splits (CALM split at 8.91 years yielding N ¼ 279 young and 272 old; NKI-Rockland split at 11.38 years into N ¼ 169 young and N ¼ 168 old). Doing so inevitably led to slightly unbalanced numbers of participants with white matter data (CALM: . To test measurement invariance across age groups (Putnick and Bornstein, 2016), we fit multigroup models (French and Finch, 2008), constraining key parameters across groups. Model comparisons and deviations from measurement invariance were determined using the likelihood ratio test and Akaike information criterion (AIC, see Bozdogan, 1987).
To examine whether white matter tracts made unique contributions to our latent variables we fit Multiple Indicator, Multiple Cause (MIMIC) models (J€ oreskog and Goldberger, 1975;Kievit et al., 2012). Lastly, we conducted a SEM tree analysis, a method that combines the confirmatory nature of SEM with the exploratory framework of decision trees (Brandmaier et al., 2013). SEM trees hierarchically and recursively partition datasets based on a covariate (in our case age). This creates data-driven age-groups which show differences in one or more paths of interest. The advantage of SEM trees is that they do not require a-priori decisions as to where potential categorical boundaries between age groups may lie (as was the case in the median split analysis). SEM trees also do not require a-priori knowledge as to the shape of developmental trajectories (as is usually the case when using age as a continuous covariate). Using this technique therefore allowed us to: 1) examine the robustness of findings based on the median age split, and 2) examine whether white matter contributions differed across age groups of younger and older participants in a data-driven way (Hypothesis 4). Therefore, for our SEM tree analyses in CALM and NKI-Rockland, we used age as a continuous covariate. Finally, we used Bonferroni-correction at alpha level .001 to correct for multiple comparisons, and the semtree cross-validation scheme, which "partitions the data for maximizing splits on each variable, then comparing maximum splits across each variable on the rest of the data" (see https://cran.r-pr oject.org/web/packages/semtree/semtree.pdf for more details)  Fig. 3. Scatterplots of FA values for all white matter tracts across age for CALM and NKI-Rockland samples. Note that the age trends are more pronounced in CALM than in the NKI sample, possibly due to lower sample size in the NKI-Rockland sample (N ¼ 65). Lines and shades reflect linear and polynomial fit and 95 % confidence intervals, respectively. Solid lines: CALM. Dashed lines: NKI-Rockland. Abbreviations: anterior thalamic radiations (ATR), corticospinal tract (CST), cingulate gyrus (CING), cingulum [hippocampus] (CINGh), inferior fronto-occipital fasciculus (IFOF), inferior longitudinal fasciculus (ILF), superior longitudinal fasciculus (SLF), uncinate fasciculus (UNC), forceps major (FMaj), and forceps minor (FMin).

Covariance among cognitive abilities cannot be captured by a singlefactor
In accordance with our preregistered analysis plan, we first describe model fit for the measurement models of the cognitive data only. First, we tested hypothesis 1: that gc and gf are separable constructs in childhood and adolescence. More specifically, we tested the hypothesis that the covariance among scores on cognitive tests would be better captured by a two-factor (gc-gf) model than a single-factor (e.g. g) model. In support of this prediction, the single-factor model fit the data poorly: χ 2 (27) ¼ 317.695, p < .001, CFI ¼ .908, SRMR ¼ .040, RMSEA ¼ .146 [.132 .161], Yuan-Bentler scaling factor ¼ 1.090, suggesting that cognitive performance was not well represented by a singlefactor. The two-factor (gc-gf) model also displayed poor model fit (χ 2 (24) ¼ 196.348, p < .001, CFI ¼ .946, SRMR ¼ .046, RMSEA ¼ .119 [.104 .135], Yuan-Bentler scaling factor ¼ 1.087), although it fit significantly better (χ 2 Δ ¼ 119.41, dfΔ ¼ 3, AICΔ ¼ 127, p < .001) than the single-factor model.
To investigate the source of poor fit, we examined modification indices (Schermelleh-Engel et al., 2003a), which quantify the expected improvement in model fit if a parameter is freed. Modification indices suggested that the Peabody Picture Vocabulary Test had a very strong cross-loading onto the fluid intelligence latent factor. The Peabody Picture Vocabulary Test (PPVT), often considered a crystallized measure in adult populations, asks participants to choose the picture (out of four multiple-choice options) corresponding to the meaning of the word spoken by an examiner. Including a cross-loading between gf and the PPVT drastically improved goodness of fit (χ 2 Δ ¼ 67.52, dfΔ ¼ 1, AICΔ ¼ 100, p < .001) to adequate (χ 2 (23) ¼ 104.533, p < .001, CFI ¼ .975, SRMR ¼ .025, RMSEA ¼ .083 [.067 .099], Yuan-Bentler scaling factor ¼ 1.069). A likely explanation of this result is that such tasks may draw considerably more on executive, gf-like abilities in younger, lower ability samples. For a more thorough investigation of the loading of PPVT across development, see Supplementary Material. Notably, fitting the PPVT as a solely fluid task (i.e. removing it as a measurement of gc entirely) did not significantly decrease model fit (χ 2 Δ ¼ 2.058, dfΔ ¼ 1, AICΔ ¼ 1, p ¼ .152). Therefore, we decided to proceed with the more parsimonious PPVT gf-only model (χ 2 (24) ¼ 106.382, p < .001, CFI ¼ .972, SRMR ¼ .025, RMSEA ¼ .082 [.066 .098], Yuan-Bentler scaling factor ¼ 1.073). We note that although this is a data-driven modification, we believe it would likely generalize to samples with similarly low ages and abilities.
Next, we examined whether the single or two-factor model fit best in the NKI-Rockland sample. The single-factor model fit the data adequately (χ 2 (14) ¼ 41.329, p < .001, CFI ¼ .983, SRMR ¼ .029, RMSEA ¼ .075 [.049 .102], Yuan-Bentler scaling factor ¼ .965). Still, the two-factor model showed considerably better fit (χ 2 (12) ¼ 19.732, p ¼ .072, CFI ¼ .995, SRMR ¼ .018, RMSEA ¼ .043 [.000 .075], Yuan-Bentler scaling factor ¼ .956) compared to the single-factor model (χ 2 Δ ¼ 20.661, dfΔ ¼ 2, AICΔ ¼ 17, p < .001). It should be noted that, given the differences in tasks measured between the samples, gf and working memory were assumed to be measurements of the same latent factor, rather than separable factors. A similar competing model where gf and working memory were modeled as separate constructs with working memory loaded onto gf, similarly to the best-fitting model for the CALM sample (see Fig. 4), showed comparable model fit and converging conclusions with further analyses. Overall, these findings suggested, that for both the CALM and NKI-Rockland samples, a twofactor model with separate gc and gf factors provided a better account of individual differences in intelligence than a single-factor model.

Evidence of age differentiation between crystallized and fluid ability
We investigated the relationship between gc and gf in development to see whether we could observe evidence for age differentiation as predicted by hypothesis 2. Age differentiation (e.g. Hülür et al., 2011) would predict decreasing covariance between gc and gf from childhood to adolescence. We fit a multigroup confirmatory factor analysis to assess fit on our younger (N ¼ 279) and older (N ¼ 272) participant cohorts. The model had acceptable fit (χ 2 (48) ¼ 142.214, p < .001, CFI ¼ .960, SRMR ¼ .037, RMSEA ¼ .085 [.069 .102], Yuan-Bentler scaling factor ¼ 1.019). However, a likelihood ratio test, showed that model fit did not decrease significantly when imposing equal covariance between gc and gf in the younger and older participant subgroups (χ 2 Δ ¼ 0.323, dfΔ ¼ 1 AICΔ ¼ 2, p ¼ .57). This suggested no evidence for age differentiation in the CALM sample. However, the lack of association could be due to limitations of using median splits to investigate age differences when independent (or latent in our case) variables are correlated (Iacobucci et al., 2015). For instance, if the age range of differences in behavioral associations between gc and gf lies elsewhere, the median split may not be sensitive enough to detect it. To test this explicitly, we next fit SEM trees (Brandmaier et al., 2013) to the cognitive data.
We estimated SEM trees in the CALM sample by specifying the cognitive model with age as a continuous covariate. We observed a SEM tree split at age 9.12, yielding two groups (younger participants ¼ 290, older participants ¼ 261). This split was accompanied by a decrease in the unstandardized parameter estimate between gc and gf (from .64 to .59, see Table 3 in Section 3.5), providing support for age differentiation using a more exploratory approach (SEM tree: 9.12 versus median split: 8.91). When we fit the two-factor model before and after the SEM tree age split, we found that the correlation between gc and gf increased slightly (from .90 to .92).
In contrast to the multigroup model outcome, the NKI-Rockland SEM tree model under identical specifications as in CALM failed to produce an age split. A possible explanation is that to penalize for multiple testing we relied on Bonferroni-corrected alpha thresholds for the SEM tree. If, as seems to be the case here, the true split lies (almost) exactly on the median split, then the SEM tree will have slightly less power than conventional multigroup models, as the SEM tree likelihood ratio test is penalized for the number of tests (splits). These differences between analyses methods suggested that the age differentiation observed here is likely modest in size. Taken together, we interpret our findings as evidence for a small, age-specific but suggestive decrease in gc-gf covariance in both cohorts, which is compatible with age differentiation such that, for younger participants, gc and gf factors are almost indistinguishable, whereas for older participants a clearer separation emerges.

Violation of metric invariance suggests differences in relationships among cognitive abilities in childhood and adolescence
Finally, we more closely examined age-related differences in cognitive architecture (e.g. factor loadings) by examining metric invariance (Putnick and Bornstein, 2016). Testing this in the CALM sample as a two-group model by imposing equality constraints on the factor loadings (fully constrained) showed that the freely-estimated model (no factor loading constraints) outperformed the fully-constrained model (χ 2 Δ ¼ 107.05, dfΔ ¼ 7, AICΔ ¼ 82, p < .001), indicating that metric invariance was violated. This violation of metric invariance suggested Fig. 4. MIMIC models displaying standardized parameter estimates and regression coefficients for all cognitive measures and white matter tracts for complete CALM and NKI-Rockland samples. Dotted, green, and red arrows indicate nonsignificant (>.05), positively significant, and negatively significant path estimates, respectively. Note standardized estimates exceeding 1 in NKI are likely the consequence of highly-correlated factors (J€ oreskog, 1999). that the relationship between the cognitive tests and latent variables was different in the two age groups. Closer inspection suggested that the differences in loadings were not uniform, but rather showed a more complex pattern of age-related differences (see Table 2 for more details). Some of the most pronounced differences include an increase of the loading of matrix reasoning onto gf as well as increased loading of digit recall and dot matrix onto working memory across age groups.
Similarly, in the NKI-Rockland cohort, the freely-estimated model outperformed the constrained model (χ 2 Δ ¼ 41.111, dfΔ ¼ 5, AICΔ ¼ 33, p < .001), indicating that metric invariance was again violated as in CALM. This suggests that the relationship between the cognitive tests and the latent factors differed across age groups. The pattern of factor loadings differed in some respects from CALM. For example, the loading of the N-back task onto gf showed the largest difference across age groups in the NKI-Rockland sample. However, as CALM did not include the N-back task, we cannot directly interpret this as a difference between the cohorts. For detailed comparisons among factor loadings between age groups in both samples, refer to Table 2. The overall pattern in both samples suggested small and varied differences in the relationship between the latent factors and observed scores. A plausible explanation is that the same task draws on a different balance of skills as children differ in age and ability. Our findings concerning the latent factors should be interpreted in this light as it seems likely that in addition to age differentiation (and possibly dedifferentiation) effects, the nature of the factors also differed slightly across the age range studied here.

The neural architecture of gc and gf indicates unique contributions of multiple white matter tracts to cognitive ability
We next focused on the white matter regression coefficients to inspect the neural underpinnings of gc and gf. In line with hypothesis 3, we wanted to explore whether individual white matter tracts made independent contributions to gc and gf. First, we examined whether a single-factor model could account for covariance in white matter microstructure across our ten tracts. If so, then scores on such a latent factor would represent a parsimonious summary for neural integrity. However, this model showed poor fit (χ 2 (35) ¼ 124.810, p < .001, CFI ¼ .938, SRMR ¼ .039, RMSEA ¼ .132 [.107 .157], Yuan-Bentler scaling factor ¼ 1.114), suggesting separate influences from white matter regions in supporting cognitive abilities. To examine whether the white matter tracts showed specific and complementary associations with cognitive performance, we fit a MIMIC model. Doing so, we observed that 5 out of the 10 tracts showed significant relations with gc and/or gf (Fig. 4). Specifically, the anterior thalamic radiations, forceps major, and forceps minor had moderate to strong associations with gc with similar relations seen for gf for the superior longitudinal fasciculus, forceps major, and the cingulate gyrus. Interestingly, the forceps minor exhibited a negative association with gf. This could be due to modeling several highly correlated paths simultaneously since this relationship was not found when only the forceps minor was modeled onto gc (standardized estimate ¼ .426) and gf (standardized estimate ¼ .386, see Tu et al., 2008). Together, individual differences in white matter microstructure explained 32.9 % in crystallized and 33.6 % in fluid ability.
As in the CALM sample, the single-factor white matter model produced poor fit (χ 2 (35) ¼ 131.637, p < .001, CFI ¼ .924, SRMR ¼ .023, RMSEA ¼ .201 [.165 .238], Yuan-Bentler scaling factor ¼ .950) in the NKI-Rockland sample. Therefore, we fit a multi-tract MIMIC model. The superior longitudinal fasciculus emerged as the only tract to significantly load onto gc or gf (Fig. 4). This result was likely due to lower power associated with a small subset of individuals with white matter data (see Discussion for further investigation). In NKI-Rockland, the same set of tracts explained 29.7 % and 26.7 % of the variance in gc and gf, respectively, yielding similar joint effect sizes as in the CALM sample. Together, these findings demonstrated generally similar associations between white matter microstructure and cognitive abilities in the CALM and NKI-Rockland samples. Therefore, it seems to be the case that, in both typically and atypically (struggling learners) developing children and adolescents, individual white matter tracts make distinct contributions to crystallized and fluid ability, as more than one tract explains variance in the outcomes (gc and gf) above and beyond all other tracts.

Support for neurocognitive reorganization of crystallized and fluid ability in childhood and adolescence
Lastly, to address our fourth and final preregistered hypothesis, we examined whether brain-behavior associations differed across the developmental age range. We hypothesized that the relationship between the white matter tracts and cognitive abilities would decrease across the age range, in support of the differentiation hypothesis. Using a multigroup model, we compared the strength of brain-behavior relationships between younger and older participants to test whether white matter contributions to gc and gf differed in development. Contrary to our prediction, we observed that, in the CALM sample, a freely estimated model, where the brain-behavior relationships were allowed to vary across age groups, did not outperform the constrained model (χ 2 Δ ¼ 12.16, dfΔ ¼ 10, AICΔ ¼ 9, p ¼ .27). This suggested that the contributions of white matter tracts did not vary significantly between age groups when examined using multigroup models.
As before, we estimated a SEM tree model. In contrast to the multigroup model, we observed that multiple white matter tracts did differ in their associations with gc and/or gf. These differences manifested in different ways for gc and gf. For example, the correlations between the cingulum, superior longitudinal fasciculus, and forceps major and gf decreased with increasing age, in line with age differentiation. On the other hand, the forceps major, forceps minor and anterior thalamic radiations demonstrated a more complicated pattern with each tract displaying two age splits. For the first split (around age 8), the regression strength decreased before spiking again around age 11 (Table 3, also see Fuhrmann et al., 2019). Given that all first splits showed a decrease between white matter and cognition, and all second splits revealed an increase compared to the first, this suggests a non-monotonic pattern of brain-behavior reorganization that cannot be fully captured by age differentiation or dedifferentiation (Hartung et al., 2018) but may be in line with theories such as Interactive Specialization (Johnson, 2011), which provides a range of mechanisms which may induce age-varying brain-behavior strengths. One hypothesis we have previously offered that may (partially) explain the nature of the age-varying associations between white matter and cognitive performance is the onset of puberty (Fuhrmann et al., 2019, p. 11) and the associated hormonal changes. Previous work has shown that pubertal processes, including differences and changes in hormones such as testosterone, affect diffusion measures in ways that cannot be explained away by (only) age (Menzies et al., 2015). More work in large samples such as ABCD (Volkow et al., 2018), ideally including longitudinal changes in hormone levels, is needed to establish the robustness of this explanation.
Lastly, we performed the same multigroup analysis for the NKI-Rockland MIMIC model, but it failed to converge or produce an age split, likely due to sparsity of the neural data. Therefore, this analysis could not be used to replicate the cutoff age used for multigroup analyses (11.38 years) based on the median split. Further inspection of the only significantly associated tract, the superior longitudinal fasciculus, revealed the same trend for gc and gf with decreased correlations with increasing age (Table 3). Overall, our findings suggest the need for a neurocognitive account of age differentiation-dedifferentiation/ reorganization from childhood into adolescence.

Summary of findings
In this preregistered analysis, we examined the cognitive architecture as well as the white matter substrates of fluid and crystallized intelligence in children and adolescents in two developmental samples (CALM and NKI-Rockland). Analyses in both samples indicated that individual differences in intelligence were better captured by two separate but highly correlated factors (gc and gf) of cognitive ability as opposed to a single global factor (g). Further analysis suggested that the covariance between these factors decreased slightly from childhood to adolescence, in line with the age differentiation hypothesis of cognitive abilities (Garrett, 1946;Hülür et al., 2011).
We observed multiple, partially independent contributions of specific tracts to individual differences in gc and gf. The clearest associations were observed for the anterior thalamic radiations, cingulum, forceps major, forceps minor, and superior longitudinal fasciculus, all of which have been implicated to play a role in cognitive functioning in childhood and adolescence Navas-S� anchez et al., 2014;Peters et al., 2014;Tamnes et al., 2010;Urger et al., 2015;Vollmer et al., 2017). However, except for the superior longitudinal fasciculus, these tracts were not significant in NKI Rockland sample. A possible explanation for this is the difference in imaging sample size between the cohorts (N ¼ 165 in the CALM sample versus N ¼ 65 in the NKI-Rockland sample). This difference implies sizeable differences in power (73.4 % in CALM versus 36.2 % in NKI, assuming a standardized effect size of 0.2) to identify weaker individual pathways. The most consistent association, observed in both samples, was between the superior longitudinal fasciculus, a region known to be important for language and cognition, which significantly contributed to cognitive ability in both CALM (gf only) and NKI-Rockland (gc and gf). The superior longitudinal fasciculus is a long myelinated bidirectional association fiber pathway that runs from anterior to posterior cortical regions and through the major lobes of each hemisphere (Kamali et al., 2014), and has been associated with memory, attention, language, and executive function in childhood and adolescence in both healthy and atypical populations (Frye et al., 2010;Urger et al., 2015). Therefore, given its widespread links throughout the brain, which include temporal and fronto-parietal regions, it is no surprise that it was found to be significantly related to both gc and gf in our samples.
Together, these results are in line with previous research relating fractional anisotropy (FA) and cognitive ability. For instance, Peters et al., 2014 found that age-related differences in cingulum FA mediated differences in executive functioning. Moreover, white matter changes in the forceps major have been linked to higher performance on working memory tasks . The remaining tracts (superior longitudinal fasciculus and anterior thalamic radiations) have also been positively correlated with verbal and non-verbal cognitive performance in childhood and adolescence (Tamnes et al., 2010;Urger et al., 2015). We also observed more surprising negative pathways, such as between gc and the forceps minor in the CALM sample. However, closer inspection showed that the simple association between forceps minor and gc was positive, suggesting the negative pathway is likely the consequence of the simultaneous inclusion of collinear predictors (see Tu et al., 2008).
Finally, using SEM trees (Brandmaier et al., 2013), we observed that white matter contributions to gc and gf differed between participants of different ages. In CALM, the contributions of the cingulum, superior longitudinal fasciculus, and forceps major weakened with increasing age for gf. For gc, however, the forceps major and forceps minor, and the anterior thalamic radiations exhibited a more complex pattern with each tract providing significantly different effects on crystallized intelligence at two distinct time points in development. In NKI-Rockland, the superior longitudinal fasciculus became less associated with both gc and gf. Considering that decreases in white matter relations to gc and gf occurred before covariance decreases found between gc and gf suggest that differences in white matter development may underlie subsequent individual differences in cognition. In a related project (Fuhrmann et al., 2019, Table 6) we observed age-related differences in associations despite focusing on different cognitive factors (processing speed and working memory).
Overall, our findings align with a neurocognitive interpretation of age differentiation-dedifferentiation hypothesis, which would predict that cognitive abilities and their neural substrates become more differentiated (less correlated) until the onset of maturity, followed by an increase (dedifferentiation) in relation to each other until late adulthood (Hartung et al., 2018). However, we note that the evidence for age differentiation-dedifferentiation was not always robust across analyses methods or samples, suggesting only small effect sizes.

Limitations of the present study
First and foremost, all findings here were observed in cross-sectional samples. To better understand effects such as age differentiation and dedifferentiation, future studies will need to model age-related changes within the same individual. The complexity and expense of collecting such longitudinal data has long precluded such investigations, but new cohorts such as the ABCD sample (Volkow et al., 2018) will allow us to model longitudinal changes in the future. Secondly, since the tasks modelled here were not identical between cohorts, detailed interpretations of similarities and differences between the CALM and NKI-Rockland samples should be treated with caution. Therefore, future research comparing cohorts may want to prioritize cohorts with matching tasks to maximize comparability. Thirdly, although the majority of our findings are similar across our cohorts, some differences were observed, particularly in white matter effects. Moreover, although our findings in the SEM tree analysis of age-related differences in white matter to cognition mapping are both cross-validated as well as corrected for multiple comparisons, they remain inherently exploratory. Although these findings largely generalize across the two cohorts studied here, further work in larger (such as ABCD, Volkow et al., 2018), more age-heterogeneous (e.g. the Developing Human Connectome Project, Makropoulos et al., 2018) is needed to assess the robustness of these findings. This may reflect statistical variability, differences in sample size and associated differences in power, or true differences between samples. Although the samples here are considerably larger than typical in the field (Poldrack et al., 2017), even larger samples are desirable to gain truly precise estimates of the key parameters, especially regarding measures such as DTI in the NKI-Rockland sample which have a non-trivial proportion of missing data. Moreover, the white matter differences observed could also be due to the scans being obtained at different scanner sites, although this is unlikely to have produced considerable differences for all raw images were processed using the same pipeline, and previous work suggests that FA is quite a robust measure in multi-site comparison (see Vollmar et al., 2010).
In terms of analytical frameworks, here we implement a relatively new analytical framework, called SEM trees (Brandmaier et al., 2013), to allow for recursive partitioning of our cohorts into age-demarcated subgroups, to capture developmental heterogeneity. SEM trees have a number of strengths, including considerable flexibility in model specification, implementation in open source software, and the ability to combine measurement and structural model components as well as multiple simultaneous predictors. However, they also have challenges, including potential vulnerability to small fluctuations and overfitting (which may cascade down affecting other partitions), and are certainly not the only choice available to examine model heterogeneity. Alternative analytical strategies, varying in the degree to which they presuppose known group membership or estimate it, include finite mixture models (Zadelaar et al., 2019), Gaussian process structural equation models (e.g. Silva and Gramacy, 2010), latent class and latent profile analysis (Oberski, 2016), general frameworks such as decision trees (McArdle, 2013) and model-based cluster analysis (Fraley and Raftery, 1999), as well as extensions of SEM trees such as SEM forests (Brandmaier et al., 2016). All of these techniques differ in their strengths and weaknesses, ease of implementation, degree of confirmation versus exploration and their flexibility (e.g. can they accommodate latent variables or not). One particularly fruitful avenue for future research is to combine both, using exploratory as well as confirmatory methods to balance discovery and robustness. Here we hopefully illustrate how SEM trees can be one such tool, but would urge the reader to tailor their analytical framework to the question at hand, and be mindful of potential drawbacks. Nonetheless, our view is that SEM trees offer at least one fruitful avenue to formalize hypotheses in developmental cognitive neuroscience which would otherwise often remain mostly verbal.
One concrete concern with SEM trees is the recursive partitioning into subgroups of more modest sample size. Although our two cohorts samples here are considerably larger than typical in the field in terms of their total sample size (Poldrack et al., 2017), the partitioning into subgroups means that some of our parameter estimates are nonetheless based on modest samples, especially regarding measures such as DTI in the NKI-Rockland sample, which have a non-trivial proportion of missing data. Such smaller samples are known to inflate effect sizes (Gelman and Carlin, 2014;Vul et al., 2009). As such, we urge the reader to weigh confidence in the point estimates reported here as a function of sample sizes.
CALM consists of children with referrals for any difficulties related to learning, attention or memory (Holmes et al., 2019). It should be noted that, since CALM is a sample of children and adolescents struggling to learn, and, therefore, 'atypical', a large percentage of this cohort had been assigned a diagnosis (36.12 %). However, controlling for this possible confound through constrained multigroup models showed this did not affect the results of our models, as was seen in previous work using CALM (Fuhrmann et al., 2019). The NKI-Rockland sample, in contrast, is a United States population representative sample (Nooner et al., 2012). Both samples are composed of large cohorts that underwent extensive phenotyping and population-specific representative sampling. Therefore, we argue that our results generalize to 'typical' and 'atypical' samples of neurocognitive development.

Conclusions
The present analyses revealed that crystallized and fluid intelligence factors explained a significant amount of variance in test performance in two large child and adolescent samples. These results were found in both typically and atypically (struggling learners) developing cohorts, demonstrating the generalized notion that cognitive ability is better understood as a two-factor rather than a single-factor phenomenon in childhood and adolescence. The addition of white matter microstructure indicated independent contributions from specific white matter tracts known to be involved in cognitive ability. Moreover, further analyses suggested that the associations between neural and behavioral measures differed during development.
Overall, these results support a neurocognitive age differentiationdedifferentiation hypothesis of cognitive abilities whereby the relation between white matter and cognition become more differentiated (less correlated) in pre-puberty and then dedifferentiate (become more correlated) during early puberty. However, modest subgroup sizes and an inherently exploratory approach such as SEM trees necessitate confirmation in additional, large-scale samples to further quantity the precise developmental differences and changes. Future studies should take this limitation into account when designing experiments attempting to clarify such statements.

Declarations of Competing Interest
None.

Acknowledgements
The Centre for Attention Learning and Memory (CALM) research clinic is based at and supported by funding from the MRC Cognition and Brain Sciences Unit, University of Cambridge. The Principal Investigators are Joni Holmes (Head of CALM), Susan Gathercole (Chair of CALM Management Committee), Duncan Astle, Tom Manly and Rogier Kievit. Data collection is assisted by a team of researchers and PhD students at the CBU that includes Sarah Bishop, Annie Bryant, Sally Butterfield, Fanchea Daily, Laura Forde, Erin Hawkins, Sinead O'Brien, Cliodhna O'Leary, Joseph Rennie, and Mengya Zhang. The authors wish to thank the many professionals working in children's services in the South-East and East of England for their support, and to the children and their families for giving up their time to visit the clinic. We would also like to thank all NKI-Rockland participants and researchers. The NKI-Rockland sample analysed here is the 'Neurodevelopmental Genomics: Trajectories of Complex Phenotypes Distribution Set 1' dataset made available through the dbGaP portal under code phs000607.v3.p2.c1. Data was collected using the GOASSESS and Penn Computerized Neurocognitive Battery (CNB) tools, based on the following publications: Calkins et al. (2014), Calkins et al. (2015) and Gur et al. (2014). I.L.S.-K. is supported by the Cambridge Trust. D.F., J.B. and G.S. Borgeest are supported by the UK Medical Research Council (MRC). J.A. is funded by the Studienstiftung des deutschen Volkes (German Academic Scholarship Foundation). This project also received funding from the European Union's Horizon 2020 research and innovation programme (grant agreement number 732592). R. A. Kievit is supported by the Wellcome Trust (Grant No. 107392/Z/15/Z) and the UK Medical Research Coun-cilSUAG/014 RG91365.