The neural determinants of age-related changes in fluid intelligence: a pre-registered, longitudinal analysis in UK Biobank

Background: Fluid intelligence declines with advancing age, starting in early adulthood. Within-subject declines in fluid intelligence are highly correlated with contemporaneous declines in the ability to live and function independently. To support healthy aging, the mechanisms underlying these declines need to be better understood. Methods: In this pre-registered analysis, we applied latent growth curve modelling to investigate the neural determinants of longitudinal changes in fluid intelligence across three time points in 185,317 individuals (N=9,719 two waves, N=870 three waves) from the UK Biobank (age range: 39-73 years). Results: We found a weak but significant effect of cross-sectional age on the mean fluid intelligence score, such that older individuals scored slightly lower. However, the mean longitudinal slope was positive, rather than negative, suggesting improvement across testing occasions. Despite the considerable sample size, the slope variance was non-significant, suggesting no reliable individual differences in change over time. This null-result is likely due to the nature of the cognitive test used. In a subset of individuals, we found that white matter microstructure (N=8839, as indexed by fractional anisotropy) and grey-matter volume (N=9931) in pre-defined regions-of-interest accounted for complementary and unique variance in mean fluid intelligence scores. The strongest effects were such that higher grey matter volume in the frontal pole and greater white matter microstructure in the posterior thalamic radiations were associated with higher fluid intelligence scores. Conclusions: In a large preregistered analysis, we demonstrate a weak but significant negative association between age and fluid intelligence. However, we did not observe plausible longitudinal patterns, instead observing a weak increase across testing occasions, and no significant individual differences in rates of change, likely due to the suboptimal task design. Finally, we find support for our preregistered expectation that white- and grey matter make separate contributions to individual differences in fluid intelligence beyond age.


Introduction
Fluid intelligence refers to the ability to solve novel problems in the absence of task-specific knowledge, and predicts important outcomes including life expectancy, expected income and work performance (Gottfredson & Deary, 2004). Both crosssectional (e.g. Hartshorne & Germine, 2015;Kievit et al., 2016) and longitudinal studies (e.g. Ghisletta et al., 2012;Salthouse, 2009;Schaie, 1994) have shown that advancing age is associated with a marked decrease in fluid intelligence performance. Although the precise starting point of decline is hard to estimate precisely due to cohort effects, selective attrition and enrolment and retest effects in longitudinal cohorts (e.g. Salthouse et al., 2004), estimates for the onset of decline in fluid intelligence range between the third (e.g. Park et al., 2002;Salthouse, 2009) and sixth decade of life (e.g. Schaie, 1994). Moreover, recent findings have demonstrated that withinsubject decline in fluid intelligence is highly correlated with withinsubject declines in the ability to live and function independently (Tucker-Drob, 2011). The advent of large-scale neuroimaging studies has shown that neural measures can be strongly predictive of individual differences in fluid intelligence (e.g. Kievit et al., 2014;Ritchie et al., 2015). A better understanding of the neural determinants of changes in fluid intelligence is therefore necessary for improving our understanding of healthy cognitive aging, and may aid the development of early markers for individuals at risk of rapid decline. Recent innovations in multivariate models allow researchers to simultaneously estimate multiple determinants of current ability as well as changes in ability over time (Jacobucci et al., 2018). To estimate these models with precision, large datasets are required. The UK Biobank (Sudlow et al., 2015) is a unique resource for addressing such questions, as it includes both cognitive and neural measures on an unprecedented number of participants.
In our pre-registration, we proposed analyses of UK Biobank cognitive and brain data to a) examine the nature of age-related decline in fluid intelligence and b) model the neural determinants of this decline. The cognitive data consisted of the Biobank's fluid intelligence scores, which were acquired in N=185,317 people (aged 39-73 years) across up to three testing occasions 2-4 years apart (though note that the majority of individuals only completed one (174,728) or two (9,719) assessments). The brain data came from a subset of approximately 10,000 individuals (white matter data, grey matter data) who underwent an MRI scan, and consisted of pre-processed measures of the integrity of major white-matter tracts (N=8839) and volume of grey matter (N=9931) in key brain regions (Miller et al., 2016). Our preregistered analyses entailed two steps: first modelling cognitive data; second including neuroimaging predictors of cognitive abilities.
More specifically, our pre-registered analyses specified the use of latent growth models (Bauer, 2007) to model the mean and slope of age-related changes in fluid intelligence, in order to address the following questions: 1. What is the magnitude of change in fluid intelligence across occasions, as captured by the slope of fluid intelligence?
2. Is there significant variance associated with this slope (i.e. do people differ in their rate of change)?
3. Is the slope linear or non-linear (i.e. does a quadratic latent growth factor capture meaningful variance above a linear factor)?
4. Does the rate of decline (slope) depend on the level (intercept) (i.e. is age-related decline determined by current cognitive status)?
5. Is there evidence for subgroups (growth mixture models) (i.e. do we find evidence of subgroups of individuals, differing in their baseline score or rate of change)?
On the basis of prior studies, we predicted a decline in fluid intelligence across testing occasions. We expected that the decline in fluid intelligence would be more pronounced in older individuals (Kievit et al., 2014), and that there would be significant individual differences in the rate of change (Ghisletta et al., 2012). We had no strong expectations about slopeintercept covariance or the presence of subgroups.
Our second set of hypotheses concerned the neural determinants of individual differences in the slope and intercept of fluid intelligence. To examine this question, we preregistered a series of analyses using Multiple Indicator Multiple Causes (MIMIC) models (Jöreskog & Goldberger, 1975;Kievit et al., 2012) to relate the mean and slope estimates for fluid intelligence to the various brain measures, and asked: 6. What neural properties determine the intercept and slope of fluid intelligence?
7. Are the neural determinants of the mean (general ability) the same as those of the slope (rate of change)?
8. Do multiple region-specific markers of neural health predict unique variance in cognitive level and slope, or does a single global marker suffice?
Based on prior work, we predicted that the mean and/or slope estimates from the latent growth models will depend in particular on complementary effects of frontal grey and white matter (Kievit et al., 2014;Kievit et al., 2016). Moreover, we expected the slope and intercept to have similar, but non-identical multiple brain determinants, as the mechanisms that govern individual differences need not be identical to those governing withinsubject change (cf. Kievit et al., 2013). We also pre-registered exploratory analyses relating possible sub-groups to factors like physical health, but given the insufficient evidence for sub-groups, we did not explore these relationships further.

Participants
The present study sample consisted of a subset of healthy middle to older-aged adults (age range at time of recruitment: 39-73 years) from the UK Biobank cohort (for more information see the Biobank website; Sudlow et al., 2015).

Fluid ability measures
We here analysed the 'fluid intelligence test' included in the UK Biobank cognitive battery. The test is designed to measure "the capacity to solve problems that require logic and reasoning ability, independent of acquired knowledge" (for a complete overview of the 13 individual fluid intelligence items, please see the Biobank manual for the Fluid intelligence test). The test comprised thirteen logic and reasoning questions administered via a computer-touchscreen interface with a two-minute time limit for each question. The maximum score was 13 (one point for each correct response). Overall, the test items have a reported Cronbach alpha coefficient of 0.62 (Hagenaars et al., 2016). No participants or observations were excluded from subsequent analyses. Raw data are shown for the fluid intelligence scores at T1 (Figure 1, top), and a random subset of 100 individuals with 3 timepoints (Figure 1, bottom).
Participants who took part in all three waves (N=870) were slightly older, and had lightly higher baseline scores, than those who took part in only one or two waves (See Figure 2,

Fluid intelligence latent growth curve model
To test our pre-registered behavioural analyses, we used a latent growth curve model (LGCM), as shown in Figure 2. We fit the model to the full sample (N=185,317) with three time points, using FIML estimation to account for missingness. The slope factor loadings were constrained to the mean intervals between timepoint 1 and 2 (4.3) and 1 and 3 (6.85). This model fit the data well: χ 2 (2) = 10.70, p = 0.005; RMSEA = 0.005 [0.002 -0.008]; CFI = 0.999; SRMR = 0.006. Raw parameter estimates are shown in Figure 2. The mean score at T1 was 6.706, with a strong suggestion of individual differences (intercept variance estimate=2.955, SE=0.116, z=25.39, with a significant decrease in model fit when constraining the intercept variance: χ2(1), 549.6, p<0.0001). Higher age was associated with slightly lower intercepts (estimate= -0.013, SE= 0.001, z=-19.809, see also Figure 1A). However, this effect was very small (standardized path=-0.06), especially compared to previously reported effects (e.g. r=-0.7, Kievit et al., 2014). The pattern of results for the slopes was unexpected. First, the slope intercept (in this specification the mean change per measurement occasion) was strongly positive (estimate=0.208, SE=0.017, z=12.602), suggesting people, on average, improved over time. In other words, there was no evidence of our hypothesized within-subject age-related cognitive decline. There was a weak negative effect of age on slope (est=-0.002, SE=0.0001, z=-7.018) suggesting older individuals improved slightly less than younger adults. Most surprisingly, the slope variance was non-significant and negative (est=-0.001, SE=0.004) suggesting an improper solution.
suggesting an improper solution. A likelihood ratio test showed the slope variance could be constrained to 0 without adversely affecting model fit χ2(1), .63, p=.72). This indicates that there were no reliable indications of individual differences in change over time. Although non-significant slope variance has been reported previously for fluid intelligence over time (Yuan et al., 2018), and improper solutions are common in random effects models (Eager & Roy, 2017), it is nonetheless highly surprising in a sample of this magnitude. To achieve a proper solution we therefore constrained the slope variance and slope-intercept covariance to 0 (for this and future models), and refit the model, which yielded good model fit χ 2 (4) = 10.88, p = 0.028; RMSEA = 0.003 [0.001 -0.005]; CFI = 0.999; SRMR = 0.006) and showed negligible changes to other parameter estimates compared to the model without constraints (final parameters shown in Figure 2). In line with our preregistered analysis 1c, we also fit a quadratic growth model by including a quadratic growth factor with linear factor loadings squared, and imposed constraints in order to render the model identifiable (residual variances equality constrained across occasions, and linear slope variance constrained to 0 based on the linear model). However, this model too yielded an improper solution (a negative quadratic slope variance), so it cannot be interpreted with confidence.
To further examine the unexpected absence of a negative slope or reliable slope variance, we examined a set of alternative, exploratory, analytic approaches and model specifications. First, in the previous analysis we used full information maximum likelihood to analyze all individuals, despite considerable missing data. Comparable results were obtained when fitting the same models to reduced subsets of the data (e.g. only those with at least two (9,719), or all three measurements (N=870). We attempted to address two further plausible explanations for the poor quality of the longitudinal data. Firstly, we fit a second-order latent growth curve model, where fluid intelligence was measured by 13 observed indicators at every time point, imposing equal factor loadings across occasions. Such a model could appropriately weigh individual items based on the degree to which they share variance, possibly improving the purity of the fluid intelligence estimates. Although this model yielded a significant slope variance 1 , other aspects of model fit were poor, including factor loadings (mean standardized factor loading for T1=0.14), and model indices such as the CFI (0.133) and SRMR (0.150) suggested poor fit. As substantive patterns were similar to the occasion sum scores (i.e. positive slope intercept) we will continue with the first order growth model instead. In a final exploratory analysis, we reran the basic growth model with every individual item. This yielded qualitatively very similar results, with positive slopes for all items and non-significant slopes for all but one item (item 5). Closer inspection of item 5 suggested only a marginal, uncorrected benefit of freely estimating the slope variance χ2(1), 8.1, p=.004, combined with a non-significant slope intercept, and a BIC favouring the constrained slope model, together suggesting insufficient evidence to proceed with this post hoc item selection instead of the sumscore.
One likely explanation for the increase across testing occasions is the presence of practice effects (e.g. Salthouse, 2010). To address this explanation, we fit another exploratory model including an additional growth factor with factor loadings constrained to 0, 1 and 1 for the three time points. This so-called 'boost' factor (Hoffman et al., 2012) captures the hypothesis that test performance will show an improvement between the first and second testing occasions that is purely a practice effect. The inclusion of the boost factor rendered the slope intercept non-significant, which is compatible with the notion that the gains are most likely practice gains. However, like the quadratic model, such a more complex model is only identified by imposing a range of constraints (here including constraining the boost factor variance to 0). Moreover, despite these constraints this model yielded an improper solution and should thus be interpreted with caution. In a final exploratory analysis, we switched from an occasionspecific approach (T1, T2, T3) to an age-specific approach (scores at a given age). Although this approach yielded high proportions of missing data (as every individual will have missing data for most ages), it has been successfully applied to study cognitive aging (Ghisletta & Lindenberger, 2003) and can allow for more convenient decomposition of retest effects. However, this approach too failed to converge. In summary, we conclude that a meaningful longitudinal signal does not exist in the repeated measures fluid intelligence task, as currently implemented in Biobank.
Finally, in line with our preregistered analyses (1e), we fit a series of growth mixture models to examine evidence for the pres-ence of subgroups. For this analysis, we used Mplus (version 7.4 (Muthén & Muthén, 2005). We fit 1 to 5 classes and examined the sample size adjusted BIC (SA-BIC) to decide on the best model. As shown in Figure 3, the SA-BIC was lowest for the four-group solution. However, further inspection of this solution suggested that evidence for subgroups was weak. Firstly, the 'best' solution of 4 subgroups had poor entropy (0.61, Figure 3 right panel), well below common guidelines of 0.8. This suggests subgroups were not well separated. More importantly, inspection of the slopes and intercepts showed that the four subgroups were effectively subdividing the normal distribution of the whole population into subgroups (i.e. two larger groups with an intercept/slope close to the population mean, two smaller groups with intercept/slopes closer to the upper and lower 'edges' of the population distribution). This pattern of results is common in growth mixture modelling (Bauer, 2007, p. 768, Figure 3). Therefore, we conclude that there is no compelling evidence for latent subgroups with different longitudinal patterns. We now turn to our examination of the neural determinants of fluid intelligence.
White matter determinants of fluid intelligence Next, in line with our second set of preregistered analyses, we fit a LGM-MIMIC model, where both the intercept and slope were regressed simultaneously on neural predictors. First, we focus on white matter. We started by testing our preregistered prediction whether the scores across tracts can be reduced to a single factor, which would suggest that a single global factor suffices (preregistration 2c), or whether individual ROIs are required. We observed that a model with a single white matter latent variable measured by all 15 tracts fit poorly (χ 2 (90) = 8023.57, p < 0.001; RMSEA = 0.100 [0.099 -0.101]; CFI = 0.957; SRMR = 0.061), replicating previous findings (Kievit et al., 2016;Lövdén et al., 2013), and suggesting further analyses should include individual tracts. In all further models, age was included as a covariate of both intercept and slope, estimation was conducted on the full sample using FIML, and all tracts were allowed to co-vary with each other, as well as with age (not shown in figures for visual clarity).
First, the full model LGM-MIMIC model fit the data well (χ 2 (19) = 19.06, p = 0.453; RMSEA = 0.0001 [.000 -0.002]; CFI = 1.000; SRMR = 0.004. In this model, the intercept of fluid ability was significantly associated with FA in five tracts, as shown in Figure 4. Jointly the tracts and age explained 2.1% of the variance in fluid intelligence, equivalent to a standardized effect of r=0.145, which is small by individual differences standards (Gignac & Szodorai, 2016). Higher FA predicted higher fluid ability in all significant tracts apart from the forceps major and the inferior fronto-occipital fasciculus. Contrary to our expectation and previous findings, the forceps minor was not the strongest predictor of the fluid intelligence intercept (Kievit et al., 2014, Figure 4). None of the white tracts predicted slope variance -A likelihood ratio test showed that the regression paths of the slope on the individual tracts could be constrained to 0 without adversely affecting model fit χ2 (15), 17.97, p=.26. Next, we examined grey matter volume correlates of the fluid intelligence intercept.
Grey matter determinants of fluid intelligence Next, we fit the same model using only estimates of grey matter volume. First, we again replicated the poor fit of a single factor model, suggesting that a global grey matter factor does not accurately reflect the population covariance structure (χ 2 (35) = 7208.61, p < 0.001; RMSEA = 0.144 [0.141 -0.146]; CFI = 0.783; SRMR = 0.071). Next, we estimated a joint LGM MIMIC model as above, which showed good model fit (χ 2 (14) = 15.01, p = 0.377; RMSEA = 0.001 [0.0 -0.002]; CFI = 1.000; SRMR = 0.003). The joint effect size of 4.5% was considerably larger than for white matter (albeit still modest). Inspection of key parameters (see Figure 5) showed that the strongest determinant of the fluid intelligence intercept was the frontal pole (r=.16), replicating our previous finding in a separate cohort (Kievit et al., 2014, Figure 4). Two additional regions, namely the angular gyrus and the inferior frontal gyrus, explained further variance in the fluid intelligence intercept. No regions predicted slope variance -A likelihood ratio test showed the regression paths of the slope on the individual regions could be constrained to 0 without adversely affecting model fit χ2 (10), 12.55, p=.24.
Joint Grey matter and white matter determinants of fluid intelligence Finally, we examined whether the grey and white matter provide complementary information about fluid intelligence, in line with our preregistered prediction. To do so, we refit the above MIMIC model, including only those white and grey matter regions that were nominally significant in the modality-specific analyses. Again, model fit was good (χ 2 (14) = 16.11, p = 0.186; RMSEA = 0.001 [.0000000 -0.003]; CFI = 1.000; SRMR = 0.004), with a joint effect size of 5.2% (intercept) variance explained. Inspection of the parameter estimates supported our a priori hypothesis regarding the intercept: grey matter volume and white matter microstructure made largely complementary contributions to individual differences in fluid intelligence. The two strongest paths were (again) grey matter in the frontal pole (r=0.16) and white matter in the posterior thalamic radiations (r=0.12). Together, these findings support our preregistered hypotheses that white matter and grey matter would provide partly complementary effects. As before, no regions or tracts predicted slope variance, χ2 (10), 10.99, p=.35. As there was no meaningful slope variance, we could not address our preregistered expectation that neural determinants would be similar but distinct for intercept and slope. Contrary to our a priori hypothesis, frontal white matter was not the strongest determinant of individual differences in fluid intelligence. Instead, in the full model, the posterior thalamic radiations, a posterior tract linking the occipital lobe to the thalamus, proved most strongly predictive ( Figure 6).

Summary of main findings
We conducted a preregistered examination of longitudinal changes in fluid intelligence in an N=185,317 subset of the Biobank cohort (Sudlow et al., 2015). We observed a negative effect of age on the fluid intelligence intercept, consistent with other cross-sectional studies, but smaller than normally found (cf. Kievit et al., 2014). However, contrary to our expectations, our analysis of the rate of change of fluid intelligence revealed a positive rather than negative slope. In other words, rather than show decline, performance on the Biobank fluid intelligence task improved across test occasions, likely due to retest and practice effects. We also found a small negative effect of initial age on the rate of change, i.e. older people showed less improvement across time points. Convergence problems (likely due to the limited number of waves) meant that we were unable to infer whether the rates of change were best captured by a linear or quadratic model. No compelling evidence was observed for the existence of subgroups.
In a second set of analyses, we examined the neural determinants of individual differences in fluid intelligence. The absence of slope variance precluded meaningful modelling of individual differences in rate of change. In line with our expectations, we observed seven distinct and complementary contributions from individual white matter tracts. However, the effect sizes were small, and contrary to our expectations and previous work (Kievit et al., 2014;Kievit et al., 2016), frontal white matter tracts were not among the strongest determinants of fluid abilities. The posterior thalamic radiations appeared as the strongest white matter predictor in both the white matter only model, as well as the combined grey matter/white matter model. The posterior thalamic radiations connect thalamic systems to both parietal and early visual systems. A tentative interpretation could be that parietal systems are often recruited in demanding tasks (e.g. Fedorenko et al., 2013). However, the small magnitude of the effect size, as well as the relative dearth of previous findings relating the PTR to fluid reasoning (although some weak effects have been reported, e.g. Navas-Sánchez et al., 2014), together suggest caution in interpreting this finding with confidence. Focusing on grey matter, we observed a strong, positive association between grey matter volume in the frontal pole and  Quality of the fluid intelligence measure A plausible explanation for both the disparity in the size of cross-sectional age effects on fluid intelligence intercept (e.g. r=-0.04 in Figure 1, versus r=-0.55 in comparable samples), as well as the absence of expected slope effects, most likely lies in the fluid intelligence task itself. First and foremost, not all items are representative of classic fluid intelligence items. For instance, item two asks 'which number is the largest?'. This item might be best characterized as relying on crystallized knowledge, and would not usually be considered a component of fluid intelligence. It would perhaps be more appropriate in a dementia-screening task in elderly samples than in a fluid intelligence test administered in a population-representative sample. This interpretation is supported by a striking ceiling effect on this item (99.06% accuracy). Similar ceiling effects were observed for other items (94.9% for the first item). However, other items (e.g. item 3) rely on verbal analogies, which likely do require a measure of abstract reasoning abilities. Taken together, individual differences in the mean (intercept) scores likely reflect fluid abilities to some degree, but more weakly so than traditional, standardized tests. Previous work on the Biobank fluid intelligence task has characterized the nature of the test as 'verbal-numerical reasoning' (Lyall et al., 2016), which is a more apt description than 'fluid intelligence', although arguably doesn't cover items such as the example above. As for the longitudinal component, the relative memorabilityof certain items (such as the 'largest number' question) may help explain the absence of slope variance over time, as people are likely to provide the same answers on repeat testing occasions. Moreover, the self-paced nature of the task means that item 13 was only attempted by 4,350 out of 165,097 individuals at time point 1. Out of these participants, only 844 got the item correct, giving an overall accuracy rate of 0.5%. In short, the fluid intelligence task as currently implemented shows poor construct validity, and is vulnerable to ceiling and floor effects. Moreover, the self-paced nature (the total score reflects the number of correct items given within a 2-minute window) may exacerbate retest effects, given that remembering previous answers (right or wrong) and increased familiarity with the testing environment might lead to more items being attempted. Together, these properties may explain the absence of hypothesized longitudinal effects. Recently, Biobank has started acquiring a new fluid intelligence 'matrix pattern completion' task which more closely aligns with traditional psychometric tests of fluid intelligence. We expect that this novel subtest will show more robust age and neuroimaging effects.

Conclusion
Many studies, particularly in neuroimaging, are underpowered (Button et al., 2013). The field's effort to collect large, collaborative datasets is an important response to this scientific challenge. Biobank offers a uniquely rich, publicly-available dataset that has revolutionized the scope of large scale shared projects, and already led to numerous insights into the genetic, environmental and neural markers of healthy aging (e.g. Hagenaars et al., 2016;Miller et al., 2016;Muñoz et al., 2016). However, our current analyses of the Biobank cognitive data demonstrate that the size of the dataset cannot always overcome suboptimal data quality (Kolossa & Kopp, 2018).
Longitudinal measurements may be especially vulnerable to practical constraints in large cohorts (e.g. short administration time, ease of use of the test etc.). Further improvements in the quality of cognitive data and additional waves of longitudinal measures will likely allow for more conclusive answers about the neural determinants of age-related changes in fluid intelligence, and facilitate understanding of lifespan changes in cognitive function.

Data availability
Our analysis is based on data from the Biobank cohort, and as such cannot be attached in the raw form without violation contractual agreements. Our analyses can be reproduced (or improved) by the following three steps: This is a well-conducted set of preregistered analyses, addressing important research questions, and using an impressive longitudinal data set. While the chosen latent growth model generally is standard for analyzing questions pertaining average longitudinal change (and individual differences therein) with few measurement occasions, there are some aspects that should or could be done somewhat differently, or additionally, in my view.
1. I would refrain from attempts to model quadratic change with (at maximum) only three measurement occasions. As this was part of the pre-registration, it should be mentioned, but maybe together with a qualification that a quadratic change model is not generally identified with 3 time points and identification could only be achieved using constraints that were not specified a priori (e.g., constraining the linear slope variance to zero).
2. Similarly, I find the use of a "boost" factor problematic, as the implied assumption of retest effects only taking place between T1 and T2 is difficult to defend. Also, this model has the same identification problem taking place between T1 and T2 is difficult to defend. Also, this model has the same identification problem as a quadratic model.
3. Power to detect individual differences in change may be larger if individual differences in the true timing of the measurement occasions would be taken into account in the models (instead of using the mean intervals). In an SEM framework, this is possible using Mplus and the TSCORES option. I would encourage trying out this modeling option and would consider it a minor deviation from the pre-registration, as it would keep with the general modeling strategy and just mean using all available information to get most precise and powerful parameter estimates.
4. In an exploratory manner, I would encourage to pursue the attempt to use measurement models for the fluid intelligence construct a bit more and reduce the set of items to those that "are representative of classic fluid intelligence items". This may improve model fit and help model convergence, and may also increase power to detect variance in slopes. Based on a decent and time-invariant measurement model, latent change score models could also allow to model change from T1 to T2 and change from T2 to T3 separately (capturing potential quadratic change or differential retest effects). Related to this point, I would like to see a brief but complete description of all fluid intelligence items in the Methods section.
5. As the power to detect significant variance in slopes and the power to detect effects of certain moderator variables on the slope may differ, I do not think that it is precluded to test such moderation effects just because the variance in slopes turns out not to be significant. As the moderator effects pertain to pre-registered a-priori hypotheses, I would go ahead and test and report these effects (preferably using likelihood ratio tests based on model comparisons), even though the variance of slopes may not be significant.
6. Generally, the variance of the slope factor should not be evaluated with z-test, but also with likelihood ratio tests, using adjusted critical values (see Stoel et al., 2006 ). Implicitly, this is already done by reporting the chi2 values for models with and without the slope variance. I would fully replace the reported z and p values with the more appropriate LR test, however.

Minor points
The term "slope intercept" may be confusing. Maybe use "the regression intercept of a model with the slope factor as dependent variable" or some other explanation that helps to distinguish between "intercept" as growth factor and "intercept" as regression parameter in the MIMIC models.

If applicable, is the statistical analysis and its interpretation appropriate? Partly
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 25 May 2018 , Dr, UK Rogier Kievit 'This is a well-conducted set of preregistered analyses, addressing important research questions, and using an impressive longitudinal data set. While the chosen latent growth model generally is standard for analyzing questions pertaining average longitudinal change (and individual differences therein) with few measurement occasions, there are some aspects that should or could be done somewhat differently, or additionally, in my view.' We thank the reviewer for their comments, which have served to strengthen the paper 1. I would refrain from attempts to model quadratic change with (at maximum) only three measurement occasions. As this was part of the pre-registration, it should be mentioned, but maybe together with a qualification that a quadratic change model is not generally identified with 3 time points and identification could only be achieved using constraints that were not specified a priori (e.g., constraining the linear slope variance to zero).
We agree that we should have mentioned the necessity of constraints in the pre-registration. We had hoped to be able to model decline as a function of age rather than testing occasion which would have allowed more flexibility in this regard. We did

mention this in passing in the original manuscript as follows, 'and imposed constraints in order to render the model identifiable (residual variances equality constrained across occasions, and linear slope variance
We now also include an explication of the need constrained to 0 based on the linear model).' of constraints below in the boost factor 2. Similarly, I find the use of a "boost" factor problematic, as the implied assumption of retest effects only taking place between T1 and T2 is difficult to defend. Also, this model has the same identification problem as a quadratic model.
We agree that we should have spelled out the challenges with estimating more an additional growth factor with so few timepoints more clearly in the pre-registration, and have now explained these limitations in the revision. Conceptually, we find the boost factor intuitively plausible -The change between wave 1 and 2 would entail the familiarity with the setting, using the iPad, the time constraints etcetera. This would be much less strong between wave 2 and 3. This type of boost constraints etcetera. This would be much less strong between wave 2 and 3. This type of boost factor was included as a core retest mechanisms in a simulation based paper on modelling retest effects (Hoffman, L., Hofer, S. M., & Sliwinski, M. J, 2011)

and was shown to have a less negative effect on the estimation on other model parameters than incremental practice effects.
This so-called 'boost' factor (Hoffman, Hofer, & Sliwinski, 2012) captures the hypothesis that test performance will show an improvement between the first and second testing occasions that is purely a practice effect. The inclusion of the boost factor rendered the slope intercept non-significant, which is compatible with the notion that the gains are most likely practice gains. However, like the quadratic model, such a more complex model is only identified by imposing a range of constraints (here including constraining the boost factor variance to 0). Moreover, despite these constraints this model yielded an improper solution and should thus be interpreted with caution.
3. Power to detect individual differences in change may be larger if individual differences in the true timing of the measurement occasions would be taken into account in the models (instead of using the mean intervals). In an SEM framework, this is possible using Mplus and the TSCORES option. I would encourage trying out this modeling option and would consider it a minor deviation from the pre-registration, as it would keep with the general modeling strategy and just mean using all available information to get most precise and powerful parameter estimates.

We agree this is a principled and elegant manner to model these effects. However, despite increasing the EM iterations well beyond the Mplus default, this model did not converge. Nonetheless we agree it is the more principled choice so have now included it in the manuscript as follows:
Here we use the mean age interval between waves to guide the fixed factor loadings in the growth model. A more precise modelling approach is to use the individual ages at each timepoint. This is known as a 'definition variable' approach (Mehta & Neale), uses all the information present in the data in richer manner and can be implemented in either Mplus or OpenMx (but not yet Lavaan). However, in the present dataset this approach did not converge 4. In an exploratory manner, I would encourage to pursue the attempt to use measurement models for the fluid intelligence construct a bit more and reduce the set of items to those that "are representative of classic fluid intelligence items". This may improve model fit and help model convergence, and may also increase power to detect variance in slopes.
In our previous manuscript we fit a full second order latent growth curve model which would allow individual items to contribute more, or less, to the latent factor. As reported this did not yield acceptable model fit nor meaningfully changed our findings. Moreover, we have now refit the models with each individual item, again yielding virtually identical results. In short we believe no meaningful signal can be extracted from these items without exhaustive data-driven subselection that may lead to overfitting.
In a final exploratory analysis, we reran the basic growth model with every individual item. This yielded qualitatively very similar results, with positive slopes for all items and non-significant slopes for all but one item (item 5). Closer inspection of item 5 suggested only a marginal, uncorrected benefit of freely estimating the slope variance χ (1), 8.1, 2 only a marginal, uncorrected benefit of freely estimating the slope variance χ (1), 8.1, p=.004, combined with a non-significant slope intercept, and a BIC favouring the constrained slope model, together suggesting insufficient evidence to proceed with this post hoc item selection instead of the sumscore.
Based on a decent and time-invariant measurement model, latent change score models could also allow to model change from T1 to T2 and change from T2 to T3 separately (capturing potential quadratic change or differential retest effects).
We agree in principle, but in practice these desiderata of the model fit and item properties seem beyond the data quality present in Biobank.
Related to this point, I would like to see a brief but complete description of all fluid intelligence items in the Methods section.
We included a link to the complete set of questionnaire items which is available online here: https://biobank.ctsu.ox.ac.uk/crystal/docs/Fluidintelligence.pdf We have modified the wording in the manuscript to be more explicit (as it currently only states 'the manual') for a complete overview of the 13 individual fluid intelligence items, please see http://biobank.ctsu.ox.ac.uk/crystal/docs/Fluidintelligence.pdf 5. As the power to detect significant variance in slopes and the power to detect effects of certain moderator variables on the slope may differ, I do not think that it is precluded to test such moderation effects just because the variance in slopes turns out not to be significant. As the moderator effects pertain to pre-registered a-priori hypotheses, I would go ahead and test and report these effects (preferably using likelihood ratio tests based on model comparisons), even though the variance of slopes may not be significant.
We agree that the significance of the slope in isolation needn't be a guiding principle to guide the analysis of moderators. Note that all continuous predictors of slope variance in our models are still included even in the models where slope (residual) variance is constrained to 0 -in other words, all continuous neural moderators of slope were included, but proved non-significant. We now include an LRT for each relevant model, comparing one where all neural predictors of slope are freely estimated versus a model where they are constrained to 0 for white matter, grey matter and the combined model. In all case the constrained model is preferred.
None of the white tracts predicted slope variance -A likelihood ratio test showed that the regression paths of the slope on the individual tracts could be constrained to 0 without adversely affecting model fit χ (15), 17.97, p=.26.
No regions predicted slope variance -A likelihood ratio test showed the regression paths of the slope on the individual regions could be constrained to 0 without adversely affecting model fit χ (10), 12.55, p=.24.
6. Generally, the variance of the slope factor should not be evaluated with z-test, but also with likelihood ratio tests, using adjusted critical values (see Stoel et al., 2006 ). Implicitly, this is 2 2 2 2 1 6. Generally, the variance of the slope factor should not be evaluated with z-test, but also with likelihood ratio tests, using adjusted critical values (see Stoel et al., 2006 ). Implicitly, this is already done by reporting the chi2 values for models with and without the slope variance. I would fully replace the reported z and p values with the more appropriate LR test, however.
We agree entirely. We reported the Wald test for reasons of greater familiarity to most readers, as well as slightly fewer issues of model convergence due to variance constraints, but on reflection we agree that the LR test is more appropriate and have updated this for all variance parameters.

Minor points
The term "slope intercept" may be confusing. Maybe use "the regression intercept of a model with the slope factor as dependent variable" or some other explanation that helps to distinguish between "intercept" as growth factor and "intercept" as regression parameter in the MIMIC models.
We agree 'slope intercept' can be confusing. We have clarified the first mention of this term in parentheses First, the slope intercept (in this specification, the mean change per measurement occasion) No competing interests were disclosed. The study involves a pre-registered analysis, is hypothesis-driven, and seems to involve sound analyses of the data and the text is clear. Nevertheless, we have a couple of concerns in regard to the presentation and data analyses.
Introduction: The authors state that "Both cross-sectional and longitudinal studies have shown that advancing age is associated with a marked decrease in fluid intelligence starting in the third or fourth decade of life", citing work by Hartshorne and Germine (2015) and Schaie (1994) . However, longitudinal data in the latter article clearly indicate a higher age of onset of mean-level decline than this, for each of the Primary Mental Abilities (around age 60; see Figure 2), including Inductive reasoning, a core facet of fluid intelligence (narrowly defined). The study by Hartshorne and Germaine involved cross-sectional data. Thus, whereas cross-sectional data typically indicate decline in the third or fourth decade of life (or earlier, see Park et al., 2002 ) actual decline at the mean level may appear later, at least as judged by the data in Schaie (1994). This is relevant to note as, from that perspective, quite a few participants in the UK biobank study (range 39-73 years) might be expected to be rather stationary in regard to mean-level fluid intelligence over a relatively short test-retest interval.
Results: Regarding attrition, did the participants who participated in 2 or 3 test waves differ from those who dropped out after the first occasion? Describing drop-out mechanisms with regards to age, gender, fluid intelligence at baseline, and potentially socio-economic factors (if such are available) may help to 1 1 2 1 2 1 2 3 fluid intelligence at baseline, and potentially socio-economic factors (if such are available) may help to clarify why the average slope was positive. Discussion: It would be informative if the authors could comment on the validity of their findings regarding gray and white matter predictors of level of fluid intelligence. Despite challenges with the task validity and psychometric properties, are the significant relationships that were observed plausible (albeit smaller in magnitude than expected)? For instance, is it reasonable that the posterior thalamic radiations had the strongest association with fluid intelligence (despite contradicting the authors own previous work)? Were the relative contributions of gray and white matter variables in line with previous literature?
The direction of the effect between two of the white matter tracts and fluid intelligence appears opposite to expectations (the forceps major and the inferior fronto-occipital fasciculus).

Minor comments:
Methods section: Please specify if the 27 white matter tracts were the total number of tracts available for this data set. Methods section: please state whether the gray matter volumes were raw values or corrected for intracranial volume (which could have made sense given the aging-related original hypotheses)?

Referee Expertise: cognitive aging research
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 25 May 2018 , Dr, UK

Rogier Kievit
The study involves a pre-registered analysis, is hypothesis-driven, and seems to involve sound analyses of the data and the text is clear. Nevertheless, we have a couple of concerns in regard to the presentation and data analyses.
We thank the reviewers for their comments, which have served to strengthen the paper Introduction: The authors state that "Both cross-sectional and longitudinal studies have shown that advancing age is associated with a marked decrease in fluid intelligence starting in the third or fourth decade of life", citing work by Hartshorne and Germine (2015) and Schaie (1994) . However, longitudinal data in the latter article clearly indicate a higher age of onset of mean-level decline than this, for each of the Primary Mental Abilities (around age 60; see Figure 2), including Inductive reasoning, a core facet of fluid intelligence (narrowly defined). The study by Hartshorne and Germaine involved cross-sectional data. Thus, whereas cross-sectional data typically indicate decline in the third or fourth decade of life (or earlier, see Park et al., 2002 ) actual decline at the mean level may appear later, at least as judged by the data in Schaie (1994). This is relevant to note as, from that perspective, quite a few participants in the UK biobank study (range 39-73 years) might be expected to be rather stationary in regard to mean-level fluid intelligence over a relatively short test-retest interval.
We agree that we oversimplified the state of knowledge, and did not use the optimal references to support our claim. However, it is also likely also the case that longitudinal data might underestimate within-subject decline to some degree due to retest or practice effects (e.g. Salthouse, T. A., Schroeder, D. H., & Ferrer, E. (2004), with Salthouse (2009 estimating decline to begin in the third or fourth decade. Regardless of the precise decade, we would suggest that one would not expect a slope increase or a non-significant slope variance in a sample of this age range. The rephrased section reads as follows: Results: Regarding attrition, did the participants who participated in 2 or 3 test waves differ from those who dropped out after the first occasion? Describing drop-out mechanisms with regards to age, gender, fluid intelligence at baseline, and potentially socio-economic factors (if such are available) may help to clarify why the average slope was positive.
These participants differed slightly -those who participated in all three waves were about 6 months 1 2 3 These participants differed slightly -those who participated in all three waves were about 6 months older on average, and had slightly higher fluid intelligence scores at T1 (see new plots in Figure 2). However, to the extent that these characteristics explain attrition (i.e. Missing At Random), our approach of full information maximum likelihood should adjust appropriately. This is confirmed by the highly similar results (non-significant slope variance, marginally positive slope) when we run the model only in those individuals who have data in all three waves. The conjunction of the results from the boost model, the task characteristics, the slight negative effect of age on slope and previous work on retest effects in longitudinal aging together strongly suggest the driving force behind the positive slope are small but significant retest effects due to increased familiarity with the task and setting, rather than a more substantively meaningful signal. We now include the below paragraphs as well as two new figures Participants who took part in all three waves (N=870) were slightly older, and had lightly higher baseline scores, than those who took part in only one or two waves (See Figure 2A and 2B The absence of Forceps Minor as a strong predictor and the presence of PTR as a predictor are contrary to previous findings. We now discuss tentatively as follows: The posterior thalamic radiations appeared as the strongest white matter predictor in both the white matter only model, as well as the combined grey matter/white matter model. The posterior thalamic radiations connect thalamic systems to both parietal and early visual systems. A tentative interpretation could be that parietal systems are often recruited in demanding tasks (e.g . Fedorenko et al, 2013). However, the small magnitude of the effect size, as well as the relative dearth of previous findings relating the PTR to fluid reasoning (although some weak effects have been reported, e.g. NavasSánchez et al. 2014), together suggest caution in interpreting this finding with confidence.
The direction of the effect between two of the white matter tracts and fluid intelligence appears opposite to expectations (the forceps major and the inferior fronto-occipital fasciculus). Methods section: please state whether the gray matter volumes were raw values or corrected for intracranial volume (which could have made sense given the aging-related original hypotheses)?
These were the raw values. Given various lines of evidence that suggest that larger overall brain volume is associated with intelligence (e.g. Gignac & Bates, 2017), we did not want to adjust in this manner (this is consistent with our previous approach, e.g. Kievit et al., 2014). Various lines of evidence suggest that total brain volume change may be a leading indicator of declines in cognitive performance (e.g. Grimm, K. J., An, Y., McArdle, J. J., Zonderman, A. B., & Resnick, S. M. (2012) which would suggest actual grey matter volume is a highly relevant measure in aging populations.
P. 8, the last sentence states that no regions explained significant variance in slope. But wasn't slope variance constrained to be zero in this model?
Perhaps counterintuitively, constraining the slope variance effectively constrains the residual, or variance to 0, not the absolute variance-In other words, any predictors may still exert conditional, influence and be estimated as normal (although the standardized effect sizes will be artificially high). For instance, age significantly (but weakly) predicts the positive slope with identical parameter estimates regardless of the slope constraint (this is so in Mplus and Lavaan). An alternative, defensible approach would be to constrain all predictors of the slope to 0 whenever the slope variance is constrained. However, as this would gain a large number of degrees of freedom, thereby (arguably) artificially improving model fit based on purely data driven considerations, we chose against doing so.

3.
This paper aimed to investigate the neural substrates of fluid intelligence, and its change across time, using participants with cognitive and brain magnetic resonance imaging (MRI) data in UK Biobank. The paper is thorough and well-written, and validly attempts to progress our understanding of this area of research. The authors found separate grey and white matter contributions to mean cognitive scores, however a major limitation related to the 'fluid reasoning' task itself, and its construct validity.
Is the work clearly and accurately presented and does it cite the current literature?

Yes.
Minor suggestions You may be interested in our 2016 PLOS ONE paper where we highlighted many of the same issues discussed here with the fluid reasoning task, although at that point not including any of the participants who had completed it at MRI. We suggest an alternative title for the task -'verbal-numerical reasoning' (which you may or may not agree with).
See Cox et al. where in n=3,513 UK Biobank participants it is suggested that 1) five specific white matter tracts are perhaps better off not included in a single FA factor -namely middle cerebellar peduncle, bilateral medial lemniscus and parahippocampal cingulum -because these had low factor loadings, 2) additional tract integrity metrics (e.g. NODDI; MD) could be informative beyond FA, and 3) there were some left vs. right hemispheric differences in FA with age, contrasting with here where values were averaged across left and right hemispheres.
Regarding the line: "We started with 27 tracts (Miller et al., 2016), and averaged bilateral hemispheric tracts, yielding mean FA estimates for a total of 15 tracts", please elaborate on why you took this approach, vs. including more tracts.

Is the study design appropriate and is the work technically sound?
Yes.

Minor suggestions
In the section regarding white/grey-matter contributions to fluid intelligence, have you considered looking specifically at contributions to scores in participants performing it for the first time at MRI? Did you consider people who may have developed neurological/neurodegenerative conditions across waves?
Please give more details on image quality control procedures -e.g. whether you performed anything beyond what UK Biobank have done centrally, and detail slightly more what UK Biobank have done.

Are sufficient details of methods and analysis provided to allow replication by others?
Yes. Although it is worth noting that the UK Biobank is such that firstly the number of participant scans is increasing in batches (see ), and secondly that there http://www.fmrib.ox.ac.uk/ukbiobank/ are sometimes participant withdrawals from the sample -so researchers who downloaded data tomorrow and ran the script might not find precisely the same results. 1 specific strengths and weaknesses (see Cox et al., 2016, for a discussion of the merits of more novel metrics) that are beyond the remit of this manuscript.
3) Regarding the line: "We started with 27 tracts (Miller et al., 2016), and averaged bilateral hemispheric tracts, yielding mean FA estimates for a total of 15 tracts", please elaborate on why you took this approach, vs. including more tracts.
27 tracts are all the tracts in Biobank. We agree our phrasing was imprecise and could be read as suggesting a sub selection of even more tracts than 27, so we have adjusted our phrasing accordingly (another reviewer had the same query). We had no specific hypotheses regarding lateralization, and two pragmatic considerations in favour of bilateral averaging. First, inclusion of the individual tracts considerably increases the size of the covariance matrix, which can complicate estimation. Second, simultaneous inclusion of highly collinear predictors in e.g. a MIMIC model can lead to estimation problems. Moreover, in the case of highly collinear predictors, this can artificially increase the difference between the predictors (with one tract highly significant, the other non-significant) merely because they have very similar predictions.
In the section regarding white/grey-matter contributions to fluid intelligence, have you considered looking specifically at contributions to scores in participants performing it for the first time at MRI?
The prediction of the intercept score for each individual will be extremely similar to the current model which predicts fluid intelligence score intercepts.
Did you consider people who may have developed neurological/neurodegenerative conditions across waves?
We did not consider this -to the extent that a subset of participants in a cohort of this magnitude will inevitably display pre-clinical symptoms we consider that part a natural variation in a large sample and should therefore be captured. Moreover, our attempt at fitting a growth mixture model did not yield a clearly identifiable subgroup of individuals with relative rapidly decline, above and beyond what would be expected as a function of a normal population distribution of slopes.
Please give more details on image quality control procedures -e.g. whether you performed anything beyond what UK Biobank have done centrally, and detail slightly more what UK Biobank have done.

We have expanded this section as follows:
We started with 27 tracts averages as generated by Biobank and Quality control was conducted by both automated identification of e.g. outlier slices and SNR, as well as manual inspection -For more detail, see Miller et al. 2016, online methods.

We have now included the below
In a final exploratory analysis, we reran the basic growth model with every individual item. This yielded qualitatively very similar results, with positive (but largely non-significant) slopes for all items and non-significant slope variances for all but one item (item 5). Closer inspection of item 5 suggested only a marginal, uncorrected benefit of freely estimating the slope variance χ (1), 8.1, p=.004, combined with a non-significant slope intercept, and a BIC favouring the constrained-to-0 slope model, together suggesting insufficient evidence to proceed with this post hoc item selection instead of the sumscore.
No competing interests were disclosed. Competing Interests: