General Functional Connectivity: shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks

Intrinsic connectivity, measured using resting-state fMRI, has emerged as a fundamental tool in the study of the human brain. However, due to practical limitations, many studies do not collect enough resting-state data to generate reliable measures of intrinsic connectivity necessary for studying individual differences. Here we present general functional connectivity (GFC) as a method for leveraging shared features across resting-state and task fMRI and demonstrate in the Human Connectome Project and the Dunedin Study that GFC offers better test-retest reliability than intrinsic connectivity estimated from the same amount of resting-state data alone. Furthermore, at equivalent scan lengths, GFC displays higher heritability on average than resting-state functional connectivity. We also show that predictions of cognitive ability from GFC generalize across datasets, performing as well or better than resting-state or task data alone. Collectively, our work suggests that GFC can improve the reliability of intrinsic connectivity estimates in existing datasets and, subsequently, the opportunity to identify meaningful correlates of individual differences in behavior. Given that task and resting-state data are often collected together, many researchers can immediately derive more reliable measures of intrinsic connectivity through the adoption of GFC rather than solely using resting-state data. Moreover, by better capturing heritable variation in intrinsic connectivity, GFC represents a novel endophenotype with broad applications in clinical neuroscience and biomarker discovery.


Introduction 52
Functional magnetic resonance imaging (fMRI) has proven invaluable for identifying the neural 53 architecture of human behavior and cognition (Betti et al., 2013;Fox et al., 2007;Huth et al., 54 2016). Accordingly, fMRI has been widely adopted in myriad studies seeking to deepen our 55 understanding of the human brain in both health and disease (Cole et al., 2011;Power et al., 2013;56 Satterthwaite et al., 2016). Recently, fMRI studies have expanded in both scale and scope, often 57 collecting data in thousands of individuals in an effort to adequately power the search for neural 58 correlates of complex human traits and predictive biomarkers for mental illness (Casey et al., 2018;59 Elliott et al., 2018;Miller et al., 2016;Swartz et al., 2015;Thompson et al., 2014). In this context, 60 most studies have focused on the acquisition of resting-state functional MRI to map the intrinsic 61 connectivity of neural networks. This choice is often prompted by three considerations. First, 62 intrinsic connectivity networks appear to be more heritable than task-elicited activity (Elliott et al., 63 2017;Ge et al., 2017;Winkler et al., 2010). Second, resting-state data are relatively easy to collect 64 from informative populations of children, the elderly, and the mentally ill (Fox, 2010;Greicius, 65 2008;Shehzad et al., 2009). Third, intrinsic connectivity plays a primary role in shaping task-66 based brain activity and associated behaviors (Cole et al., 2016(Cole et al., , 2014Krienen et al., 2014;Tavor 67 et al., 2016). 68 69 While analyses of resting-state data have revealed insights about average, group-level features of 70 the human brain (Buckner et al., 2008;Power et al., 2011;Yeo et al., 2011), progress has lagged 71 in identifying individualized signatures, which are critical if ongoing large-scale studies are to be 72 successful in the search for risk and disease biomarkers. This lack of progress partially reflects the 73 generally poor test-retest reliability of intrinsic connectivity estimates in many resting-state 74

Materials and Methods 144
Datasets 145 Human Connectome Project. This is a publicly available data set that includes 1206 participants 146 with extensive MRI and behavioral measurement . In addition, 45 of these 147 subjects completed the entire scan protocol a second time (referred to hereafter as "test-retest 148 sample"). All participants were free of current psychiatric or neurologic illness and between 25 149 and 35 years of age. In all analyses, subjects were excluded if they had truncated scans, less than 150 40 minutes of combined task data, or less than 40 minutes of resting-state data after censoring. 151 Much of the HCP dataset consists of twins and family members. We exploited this feature of the 152 design for conducting heritability analyses. For these analyses, 943 participants from 420 families 153 met inclusion criteria (144 monozygotic families, 85 dizygotic families, and 191 full sibling 154 families). To avoid bias due to family confounding in the prediction analyses, only one individual 155 per family was retained (the family member with the least motion during scanning). This resulted 156 in 298 subjects left to be used in our heritability analyses. Eight subjects were removed from our 157 test-retest analyses because they had at least one fMRI scan with truncated or missing data. Four 158 subjects were removed because they did not have sufficient data (i.e., > 40 minutes of rest and > 159 40 minutes of task) after censoring. Therefore, the test-retest reliability analyses included data 160 from 33 participants. 161

162
The acquisition parameters and minimal preprocessing of these data have been described 163 extensively elsewhere . In our analyses, we used the minimally preprocessed 164 data in volumetric Montreal Neurological Institute (MNI) space ("fMRIVolume" pipeline). 165 Briefly, participants underwent extensive MRI measurement that included T1 and T2 weighted 166 structural imaging, diffusion tensor imaging, and nearly two hours of resting-state and task fMRI. 167 One hour of resting-state fMRI was collected on each participant in four 15-minute scans (1200 168 time-points each) split-up into two scanning sessions over two days. In each scan session the two 169 resting-state scans were followed by task fMRI . Across the two sessions, each 170 participant completed seven fMRI tasks described extensively elsewhere (Barch et al., 2013). 171 Briefly, tasks were designed to identify functionally relevant "nodes" in the brain and included 172 working memory (810 timepoints, 10:02 minutes), gambling (506 timepoints, 6:24 minutes), 173 motor function (568 timepoints, 7:06 minutes), language (632 timepoints, 7:54 minutes), social 174 cognition (548 timepoints, 6:54 minutes), relational processing (464 timepoints, 6:52 minutes) and 175 emotional processing (352 timepoints, 4:32 minutes). Altogether, 4800 timepoints totaling 60 176 minutes of resting-state fMRI and 3880 timepoints totaling 48:30 minutes of task fMRI were 177 collected on each participant. 178 179 Dunedin Study. This is a longitudinal investigation of health and behavior in a complete birth 180 cohort of 1,037 individuals (91% of eligible births; 52% male) born between April 1972 andMarch 181 1973 in Dunedin, New Zealand (NZ), and eligible based on residence in the province and 182 participation in the first assessment at age three. The cohort represents the full range of 183 socioeconomic status on NZ's South Island and matches the NZ National Health and Nutrition 184 Survey on key health indicators (e.g., BMI, smoking, GP visits) (Poulton et al., 2015). The cohort 185 is primarily white; fewer than 7% self-identify as having non-Caucasian ancestry, matching the 186 South Island (Poulton et al., 2015). Assessments were carried out at birth and ages 3, 5, 7, 9, 11, 187 13, 15, 18, 21, 26, 32, and 38 years, when 95% of the 1,007 study members still alive took part. 188 Neuroimaging data collection is ongoing in participants who are now aged 45 years. We currently 189 have collected completed neuroimaging data for 756 study members. Data were excluded if the 190 study member did not complete the rest scan and all four task scans or if they did not have sufficient 191 data after censoring and bandpass filtering was applied to the resting-state and each task scan 192 separately (see below for censoring details). Data for 591 subjects survived these fMRI exclusion 193 criteria and were included in analyses. Additionally, 20 study members completed the entire scan 194 protocol a second time (mean days between scans = 79 Minimal preprocessing was first applied to all data. For the HCP dataset this was done with the 218 HCP minimal preprocessing pipeline  and included correction for B0 219 distortion, realignment to correct for motion, registration to the participant's structural scan, 220 normalization to the 4D mean, brain masking, and non-linear warping to MNI space. 221

222
Minimal preprocessing steps were applied to the Dunedin Study dataset using custom processing 223 scripts. Anatomical images were skull-stripped, intensity-normalized, and nonlinearly warped to a 224 study-specific average template in the standard MNI stereotactic space using the ANTs SyN 225 registration algorithm (Avants et al., 2008;Klein et al., 2009). Time-series images were despiked, 226 slice-time-corrected, realigned to the first volume in the time-series to correct for head motion 227 using AFNI tools (Cox, 1996), coregistered to the anatomical image using FSL's Boundary Based 228 Registration (Greve and Fischl, 2009), spatially normalized into MNI space using the non-linear 229 ANTs SyN warp from the anatomical image, and resampled to 2mm isotropic voxels. Dunedin 230 Study images were additionally corrected for B0 distortions. 231 Time-series images for each dataset were further processed to limit the influence of motion 232 and other artifacts. Voxel-wise signal intensities were scaled to yield a time-series mean of 100 for 233 each voxel. Motion regressors were created using each participant's 6 motion correction 234 parameters (3 rotation and 3 translation) and their first derivatives (Jo et al., 2013;Satterthwaite et 235 debated (Murphy and Fox, 2017), we also preprocessed our data with parallel preprocessing 259 methods without GSR (see supplemental figure S2). 260 261 Functional Connectivity processing 262 We investigated whole brain intrinsic connectivity using two parcellation schemes. In the 263 reliability and prediction analyses (described below), a 264 brain region parcellation was utilized. 264 This parcellation was derived in a large independent dataset (Power et al., 2011). BOLD data were 265 averaged within 5mm spheres surrounding each of the 264 coordinates in the parcellation. In the 266 heritability analyses we used a smaller 44 brain region parcellation scheme due to computational 267 constraints imposed by the time needed to fit heritability models at each edge independently 268 (described below). This parcellation was derived from the 7 network parcellation scheme described 269 in (Yeo et al., 2011)

Test-retest reliability 277
Intraclass correlation (ICC) was used to quantify the test-retest reliability of intrinsic connectivity 278 measures in the HCP and Dunedin Study test-retest datasets. ICC (3,1) was used in all analyses 279 (Chen et al., 2018). In the HCP dataset, the influence of the amount of data, as well as type of data 280 (resting-state or task) on ICC was investigated. In the resting-state analyses, intrinsic connectivity 281 matrices were calculated across a range of scan lengths (5, 10, 20, 30, and 40 minutes of post-282 censoring data) mirrored in the test and retest samples. ICCs were then calculated for each edge of 283 this matrix, yielding 34716 (264*263/2) unique ICC values for each scan length. A more general 284 definition of intrinsic connectivity was derived by including task data in the intrinsic connectivity 285 matrices. GFC was formally investigated (Figures 1, 2 and 3) by constructing datasets from a 286 combination of task and resting-state data over a variety of scan lengths (5, 10, 20, 30 and 40 287 minutes of post-censoring data). In scan lengths up to 25 minutes, an exactly equal amount of data 288 from all scan types (1/8 of total scan length from each task and resting-state) was combined. After 289 this point, equal amounts of each task could no longer be added together because the shortest task 290 scan no longer had sufficient timepoints remaining in all participants. Above 25 minutes, 291 timepoints were selected at random from the pool of remaining timepoints. HCP fMRI data was 292 collected in two phase encoding directions (LR and RL). To minimize the effect of phase encoding 293 direction on our analyses, timepoints selected from LR encoding scans were selected before 294 timepoints from RL encoding. This ensured a balance of LR and RL phase encoding across test 295 and retest correlation matrices when calculating ICCs. In the Dunedin Study, 2 ICC matrices were 296 constructed. The first was built from each participant's single resting-state scan and the second 297 from all task and resting-state scans to create the GFC matrix. Paired-sample t-tests were used to 298 compare mean differences in reliability between ICC matrices constructed with rest and GFC. 299 300 Heritability 301 For each edge of the connectivity matrix, we ran an extended twin model (Posthuma and 302 Boomsma, 2000), which is an extension of the classic twin design in that it allows for the modeling 303 of additional family members of twins. In the HCP study we had data from the families of 304 monozygotic and dizygotic twins as well as families without a twin pair. In this model the variance 305 of the phenotype (in this case correlations between brain regions) is partitioned into four sources: and had at least 40 minutes of task data and 40 minutes of rest data after censoring. Heritability 314 analyses were performed using OpenMx (Boker et al., 2011;Neale et al., 2016) in the R statistical 315 computing environment. Each model also contained parameters to adjust for the effects of sex and 316 age on the mean. Confidence intervals were obtained on the A, C, T, and E estimates using the 317 mxCI command in OpenMx which obtains profile likelihood confidence intervals (Pek and Wu, 318 2015). For each edge, an estimate was determined to be significant if the lower bound of the 319 confidence interval was greater than zero. 320 321

Connectome predictive modeling 322
The predictive utility of increasing reliability through our GFC measure was evaluated by testing 323 the ability of the different intrinsic connectivity matrices to predict cognitive ability in the HCP 324 and Dunedin Study datasets. Cognitive ability was measured in the HCP using the Raven's 325 Progressive Matrices (PMAT24_A_CR) (Raven, 1941). This measure was adopted because it is a 326 proxy for cognitive ability that has been shown to be predicted by intrinsic connectivity in the HCP 327 (Dubois et al., 2018;Finn et al., 2015;Noble et al., 2017). The WAIS-IV, a well-established and 328 validated measure of individual differences in cognitive ability (Weschler, 2008), was used in the 329 Dunedin Study. 330 331 Cognitive ability was predicted from intrinsic connectivity data using Connectome Predictive 332 Modeling (CPM . This framework provides a general method to predict any 333 measure from intrinsic connectivity matrices. In this approach, predictors are filtered by selecting 334 edges that have a p < .01 correlation with the measure of interest. Three linear regression predictive 335 models are then built, one from the positive features, one from the negative features, and one from 336 the combination of features . Here we report results from the combined model 337 (see supplemental tables S1, S2 and S3 for results from all models). In our CPM analyses, we used 338 both within-sample and out-of-sample prediction. Within-sample models were used to directly 339 compare RSFC and GFC predictions of cognitive ability using a leave-one-out cross validation 340 scheme. Models were trained with all participants except one and used to predict the measure in 341 the left-out participant. This was repeated until all participants had been left out. 342 343 Out-of-sample prediction is the gold standard to test the unbiased predictive utility of models 344 (Whelan and Garavan, 2014;Yarkoni and Westfall, 2017) and was therefore used to test the 345 generalizability of RSFC, GFC, and task data. Models were trained with RSFC, GFC, and 346 individual tasks in the Dunedin Study dataset and then used to predict cognitive ability from RSFC, 347 GFC, and individual tasks in the HCP dataset. The Dunedin Study and HCP have three parallel 348 tasks that tap into similar behavioral circuits. The gambling task in the HCP and the MID task in 349 Dunedin Study both involve reward processing. The emotional processing task in the HCP and the 350 face matching task in the Dunedin Study both involve passive processing of emotional facial 351 expressions. Lastly, the working memory task in the HCP and the Stroop task in the Dunedin Study 352 both tap into executive control. We restricted our analyses to comparisons of RSFC, GFC, and 353 these 3 parallel tasks. The abilities of RSFC and GFC was contrasted with that of these parallel 354 tasks to predict cognitive ability within tasks. In both within-sample and out-of-sample tests the 355 Spearman correlation between predicted and true scores was adopted as an unbiased effect size 356 measure of predictive utility. Model predictions of cognitive ability were assessed for significance 357 using a parametric test for significance of correlations. All p-values from correlations with 358 cognitive ability were corrected for multiple comparisons using the false discovery rate (Benjamini 359 and Hochberg, 1995). Differences in predictive utility between models were compared using 360 Steiger's z (Steiger, 1980). All confidence intervals for CPM prediction estimates were generated 361 with bootstrap resampling, using AFNI's 1dCorrelate tool. 362 363

Results 364
What is the test-retest reliability of GFC? 365 We used data from the HCP to evaluate the test-retest reliability of both intrinsic connectivity 366 derived from resting-state scans alone and of GFC derived from combinations of task and resting-367 state scans. Test-retest reliability, as measured by ICC, ranges from 0-1, and is commonly 368 classified according to the following cutoffs: 0 -0.4 = poor, 0.4 -0.6 = moderate, 0.6 -0.8 = good 369 and 0.8 -1 = excellent (Cicchetti, 1994). In the HCP, when 5 minutes (post-censoring) of resting-370 state data were used to measure individual differences in intrinsic connectivity defined within a 371 common atlas of 264 regions (Power et al., 2011), the reliability was generally poor (mean ICC = 372 .28, 95% CI [.28, .28]; 71% of edges were poor , 23% moderate, 6% good, and less than 1% 373 excellent). As more resting-state data were added, the test-retest reliability continued to increase 374 up to the limits of the dataset at 40 minutes (mean ICC = .54, 95% CI [.54, .54]; 26% of edges 375 poor, 32% moderate, 32% good, and 11% excellent) ( Figure  GFC, defined as exactly equal parts of task and resting-state data in the HCP, followed a similar 385 pattern of increasing reliability with increasing data: 5 minutes of data exhibited poor reliability 386 (mean ICC = .28, 95% CI [.28, .28]; 70% of edges were poor, 22% moderate, 7% good, and less 387 than 1% excellent) but 40 minutes of data showed good reliability (mean ICC = .56, 95% CI [.55, 388 .56]; 25% of edges were poor, 27% moderate, 32% good, and 16% excellent) (see Figures 1B,389 right panel of Figure 2). In addition, resting-state scans were not required to generate a reliable 390 measure of GFC, as this pattern of increasing reliability held when task data alone were used to 391 measure intrinsic connectivity (see supplemental Figure S1). A paired sample t-test comparing 392 ICCs across all edges revealed that GFC is significantly more reliable than resting-state functional 393 connectivity when both are estimated with 40 minutes of data (t(34715) = 140.01, p < .001). While 394 significant, this difference is not likely meaningful as it represents only a 0.02 larger mean ICC in 395 GFC. The low p-value is driven by the large number of degrees of freedom in the t-test comparing 396 a mean ICC estimate from nearly 35,000 edges. Thus, we see the mean ICC of RSFC and GFC as 397 statistically separable but practically equivalent. 398 399 This general improvement in reliability with increasing data was not unique to the HCP but was 400 also observed in the population-representative Dunedin Study.

417
Influence of the global signal on reliability 418 GSR is still a controversial preprocessing step. Therefore, the main reliability analyses in the HCP 419 were rerun without GSR to test the influence of GSR on ICCs. Mean ICCs for 5 minutes of data 420 without GSR were poor for both RSFC (mean ICC = .31, 95% CI [.31, .31 51% good, and 11% excellent). While the improvement in mean ICC with additional data with and 427 without GSR, ICCs were significantly higher when GSR was not applied: 40 minutes of GFC had 428 an average ICC that was 0.07 higher without GSR than when GSR was applied. The same pattern 429 was present in RSFC: 40 minutes of resting-state without GSR was more reliable on average than 430 with GSR (rest with GSR = .61, rest without GSR = .54). While higher reliabilities without GSR 431 could be a sign of improved detection of true individual differences in intrinsic connectivity, they 432 may also be a sign of a stable propensity to move within the scanner, creating correlated motion 433 artifact in test and retest data (Power et al., 2016;van Dijk et al., 2012). Because the global signal 434 is highly susceptible to motion artifact and there is no widely accepted method to isolate and 435 remove just the artifactual component of the global signal while retaining the neural component, 436 we adopted a conservative approach by applying GSR signal in all further analyses to ensure 437 removal of motion related artifact present in the global signal (Power et al., 2018). 438

439
Is there network specificity to improvements in reliability? 440 441 We next investigated network specificity in reliability by looking specifically at the reliability of 442 edges within each functional network rather than the reliability of the entire correlation matrix in 443 aggregate. As shown in Figure 3, there was clear network heterogeneity in reliability. While all 444 networks improved in reliability from 10 minutes of data to 40 minutes, some networks improved 445 more rapidly. With both RSFC and GFC, the limbic network had the lowest reliability at all scan 446 lengths, improving from 10 minutes of data reliabilities between RSFC and GFC (e.g., see dorsal attention network in Figure 3 This represents an increase in the amount of variance in RSFC attributable to additive genetics of 480 138% as a function of increased scan length. This increase in variance explained by additive 481 genetics was also present with GFC from 5 minutes (mean A component = .14, 95% CI [.13, .14]) 482 to 40 minutes (mean A component = .28, 95% CI [.27, .29]). This represented an increase in the 483 amount of variance in GFC attributable to additive genetics of 107% as a function of increased 484 scan length. In addition, at equivalent scan lengths, GFC had significantly higher mean heritability 485 across edges than RSFC (e.g., 5 minutes: t(945) = 10.11, p < .001) and 40 minutes: ( t(945) = 486 14.75, p < .001)). Correspondingly, higher mean heritability in GFC than RSFC led to a greater 487 percentage of heritable edges in GFC (see Figure 4D). In the ACE modeling, this pattern of 488 increased heritability with increased scan time was associated with simultaneously decreased 489 amount of variance explained by the component that comprises both non-shared environment and 490 measurement error (E) ( Figure 4C)

502
A heritable edge is defined as an edge with a lower bound of the 95% confidence interval that is larger than 0 in the 503 ACE model.

505
Exploring the connection between intrinsic connectivity and cognitive ability 506 CPM was used to predict cognitive ability from both RSFC and GFC in both the HCP and the 507 Dunedin Study. In the first set of analyses, within-sample predictions were made using leave-one-508 out cross validation in the HCP and Dunedin Study datasets separately ( Figure 5, left and center 509 panels). In the HCP dataset, we found that individual differences in cognitive ability could be 510 predicted from 40 minutes of RSFC (r 2 = 4.2%, 95% CI [1.0%, 9.7%], p = .002, q =.002) and from 511 40 minutes of GFC (r 2 = 7.6%, 905% CI [2.7%, 14.2%], p,q < .001) (see Figure 5). These results 512 were replicated and extended in the Dunedin Study using scan lengths that are more representative 513 of existing datasets than the HCP. Cognitive ability could be predicted from the 8-minute RSFC 514 data (r 2 = 6.8%, 95% CI [3.5%, 11.3%], p,q < .001) as well as from 33 minutes of GFC (r 2 = 10.2%, 515 95% CI [5.9%, 15.0%], p,q < .001; all available data). Although in comparison with RSFC, GFC 516 had 81% greater predictive utility in the HCP dataset and 50% greater predictive utility in the 517 Dunedin Study dataset, the statistical comparison of the correlations was not statistically 518 significant in either the HCP (GFC r 2 = 7.6%, RSFC r 2 = 4.2%, Steiger's z = 1.317, p = .188) or 519 the Dunedin Study (GFC r 2 = 10.2%, RSFC r 2 = 6.8%, Steiger's z = 1.72, p = 0.085). 520 521 While leave-one-out cross validation has been widely used with CPM (Finn et al., 2015;Noble et 522 al., 2017), the gold standard test of a model's generalizable predictive utility is its ability to predict 523 a phenotype in an independent sample (Bzdok and Yeo, 2017;Whelan and Garavan, 2014;524 Yarkoni and Westfall, 2017). This is particularly important for GFC, as it purports to measure a 525 generalizable feature of intrinsic connectivity and thus correlations between individual differences 526 in GFC and phenotypes of interest should be shared across independent samples with different 527 batteries of fMRI tasks. Therefore, we used CPM to train two models (one with rest and one with 528 GFC) to predict cognitive ability in the larger Dunedin Study dataset and tested generalizability 529 by measuring the ability of these models to predict cognitive ability in the HCP dataset. The models 530 built with RSFC (r 2 = 5.6%, 95% CI [1.8%, 11.9%], p,q < .001) and GFC (r 2 = 9.5%, 95% CI 531 [4.0%, 16.3%], p,q < .001) from the Dunedin Study dataset both successfully predicted cognitive 532 ability in the HCP dataset ( Figure 5 right panel and Figure 6 left panel). However, GFC had greater 533 out-of-sample predictive utility than RSFC (GFC r 2 = 9.5%, RSFC r 2 = 5.6%, Steiger's z = 2.57, 534 p = 0.010). This provides evidence that GFC is a more generalizable measure across independent 535 samples than RSFC. 536 537 538 Figure 5. GFC is better than RSFC at predicting cognitive ability both within and between samples. Results from 539 CPM models predicting cognitive ability from RSFC and GFC. The x-axis displays predictions from leave-one-out 540 cross validation within sample and out-of-sample models trained using the Dunedin Study dataset and tested using the 541 HCP dataset. Predictive utility is displayed as % variance explained (r 2 ).

Does GFC predict cognitive ability better than intrinsic connectivity measured from specific tasks? 544
Recent research has suggested that tasks are generally better than RSFC at predicting cognitive 545 ability from intrinsic connectivity data (Greene et al., 2018). Our above finding that GFC 546 outperforms RSFC at predicting cognitive ability is in line with this observation. However, given 547 that most researchers do not collect the same task scans or resting-state scans with different "task" 548 demands (e.g., eyes-closed, fixation, etc.), we next investigated if GFC could perform as well or 549 better than specific tasks at out-of-sample prediction of cognitive ability. We constructed five 550 prediction models that were trained to predict cognitive ability using the Dunedin Study dataset 551 (the larger sample) and tested for generalizability using the HCP dataset. Two models were carried 552 over from the previous analyses (RSFC and GFC) and three additional task-specific models were 553 added. These consisted of models trained and tested on three parallel tasks available in both the 554 Dunedin Study and HCP (see methods for details). All models trained using the Dunedin Study 555 dataset were able to significantly predict cognitive ability in the HCP dataset (all FDR q < .05; 556 Figure 6). Prediction of cognitive ability from RSFC data (r 2 = 5.6%, 95% CI [1.8%, 11.9%], p,q 557 < .001), GFC data (r 2 = 9.5%, 95% CI [4.0%, 16.3%], p,q < .001), emotional processing data (r 2 = 558 9.8%, 95% CI [4.3%, 17.1%], p < .001, q = .001), reward data (r 2 = 9.1%, 95% CI [3.6%, 16.5%], 559 p < .001, q = .001) and executive function data (r 2 = 4.6%, 95% CI [1.0%, 10.8%], p < .001, q = 560 .001) all successfully generalized from the Dunedin Study dataset to the HCP dataset. 561 562 Comparing the relative performance of models trained with each type of data revealed that GFC 563 was generalizable and performed as well or better at out-of-sample prediction of cognitive ability 564 compared to all other tasks ( Figure 6). Predictions of cognitive ability from GFC performed better 565 than RSFC (GFC r 2 = 9.5%, RSFC r 2 = 5.6%, Steiger's z = 2.57, p = 0.01) and executive function 566 data (GFC r 2 = 9.5%, executive function r 2 = 4.6%, Steiger's z = 2.19, p = .03), and performed as 567 well as emotional processing data (GFC r 2 = 9.5%, emotion processing r 2 = 9.8%, Steiger's z = -568 .10, p = .92) and reward data (GFC r 2 = 9.5%, reward r 2 = 10.2%, Steiger's z = -.31, p = .76). 569 570 Figure 6. Out-of-sample prediction of cognitive ability for GFC is better than RSFC and as good as task-derived 571 intrinsic connectivity. All models were trained on intrinsic connectivity data from the Dunedin Study and tested using 572 data from the HCP. Models were trained and tested with the same type of data. With task data this meant that models 573 were trained and tested with tasks that have a comparable (parallel) task in both the HCP and Dunedin study. Predictive 574 utility is displayed as % variance explained (r 2 ).

576 577
Discussion 578 Here we present GFC as a method for readily improving the reliability of estimating individual 579 differences in intrinsic connectivity by combining resting-state and task data. Across two 580 independent datasets we found that scan length has a significant impact on the reliability of 581 intrinsic connectivity estimates regardless of whether the data come solely from resting-state scans 582 (i.e., RSFC) or from a combination of resting-state and task scans (i.e., GFC). With GFC, the 583 addition of task scans to short resting-state scans, substantially increased the reliability of intrinsic 584 connectivity over and above short resting-state scans alone. Moreover, we found that improved 585 reliability was reflected in higher estimates of the heritable (i.e., additive genetics) variation in 586 intrinsic connectivity, and that this gain was larger for GFC than RSFC. In addition, GFC 587 consistently performed better than RSFC and performed as well or better than intrinsic 588 connectivity from individual task scans at predicting individual differences in a complex human 589 trait, namely cognitive ability. 590 591 These findings have several implications for current and future neuroimaging research because 592 they demonstrate that 1) reliable and heritable measurement of individual differences in the 593 functional architecture of the brain is not only achievable in niche, specialty datasets with hours of 594 resting-state scans, but also in many existing datasets if GFC is adopted; 2) this gain in reliability 595 has consequences for the study of individual differences in behavior and cognition, as 596 demonstrated by a generalizable predictive utility when studying cognitive ability; 3) to the extent 597 that commensurate gains in reliability and heritability may be achieved in other datasets, our results 598 provide a reference that researchers can use to roughly estimate the gain in reliability they may 599 achieve by adopting GFC (Figures 1, 2, S4 and S5). We elaborate on each of these points, next. 600 We found that GFC, derived from the combination of resting-state and task scans, can be a reliable 603 measure of individual differences in intrinsic connectivity. Consistent with previous studies (Birn 604 et al., 2013;Finn et al., 2015;Laumann et al., 2015), we showed that the test-retest reliability of 605 intrinsic connectivity is highly dependent on the amount of data collected. This is true of RSFC as 606 well as GFC. With only 5-10 minutes of data, both RSFC and GFC show poor reliability in the 607 HCP and Dunedin Study datasets. Only with 30-40 minutes of data do RSFC and GFC begin to 608 broadly display good reliability. These findings follow directly from past studies that have shown 609 reliability depends on scan length (Anderson et al., 2011;Birn et al., 2013;Hacker et al., 2013), 610 that task and resting-state data share a large proportion of variance (Cole et al., 2014;Geerligs et 611 al., 2015), and that both task and resting-state data measure common individual differences in 612 intrinsic connectivity (Gratton et al., 2018). Some functional networks, however, can achieve good 613 reliability with relatively little data. For example, the default mode and fronto-parietal networks 614 achieved good reliability for within-network connections with just 20 minutes of RSFC or GFC. 615 In contrast, the limbic network failed to achieve even fair levels of reliability with up to 40 minutes 616 of data. These findings suggest that the amount of data needed to derive reliable estimates of 617 intrinsic connectivity for individual differences research will vary depending on the network of 618 interest. 619 620 Importantly, the current reality is that most neuroimaging studies only have 5-10 minutes of 621 resting-state data, which are further limited by sampling variability, motion-artifacts, and other 622 sources of noise (Gordon et al., 2017;Gratton et al., 2018;Power et al., 2014). Given the reliability 623 estimates reported here, it is likely that in many studies, individual differences in intrinsic 624 connectivity will be unmeasurable with resting-state data alone (Anderson et al., 2011;Hacker et 625 al., 2013). Fortunately, it is common practice for researchers to collect 10-40 minutes of task fMRI 626 in addition to resting-state. Therefore, if researchers adopt a broader definition of intrinsic 627 connectivity, such as GFC, they can combine task and resting-state data to achieve a reliable 628 measure of individual differences in intrinsic connectivity. 629 630

Data aggregation across task and resting-state scans boosts heritability 631
We found measurable genetic influences on individual differences in intrinsic connectivity in both 632 RSFC and GFC. With 40 minutes of data, 22% of the variance in RSFC and 28% of the variance 633 in GFC was attributable to additive genetics effects. With equivalent amounts of data, GFC was 634 significantly more heritable than RSFC, and the influence of additive genetics were detectable in 635 a larger proportion of functional connections (Figure 4). In addition, we found that scan length had 636 a large impact on the heritability of intrinsic connectivity in both RSFC and GFC. The average 637 amount of variance in intrinsic connectivity that was attributable to additive genetics effects more 638 than doubled in both RSFC (increase of 137%) and GFC (increase of 105%) as scan length 639 increased from 5 to 40 minutes. While the heritability of intrinsic connectivity has been 640 investigated before (Adhikari et al., 2018;Ge et al., 2017), the link between scan length and 641 heritability has not been described. Broadly, these findings indicate that estimates of intrinsic 642 connectivity from less fMRI data will have lower heritability than estimates from more data. More 643 specifically, researchers should consider the amount of fMRI data available for estimation of 644 intrinsic connectivity when planning and evaluating imaging genetics research. 645

646
Our results have additional implications for genetically-informed fMRI research. Neuroimaging 647 measures have been considered promising intermediate phenotypes that would bring researchers 648 closer to the mechanisms through which genetics lead to heritable psychological traits (Braskie et 649 al., 2011;Hariri and Weinberger, 2003;Hasler and Northoff, 2011;Meyer-Lindenberg and 650 Weinberger, 2006). To uncover these links, large datasets with MRI and genetics, like ENIGMA 651 and the UK Biobank (Sudlow et al., 2015;Thompson et al., 2014), have been used to conduct 652 genome wide association studies (GWAS). While these investigations have had success in finding 653 genetic variants linked to structural MRI measures (Adams et al., 2016;Hibar et al., 2017Hibar et al., , 2015654 Stein et al., 2012), GWAS of measures of intrinsic connectivity from fMRI have largely failed to 655 find significant genetic associations (Elliott et al., 2017). One reason for this failure may be the 656 low reliability and heritability of the short resting-state scans used to derive individual differences 657 measures of RSFC in these studies. Our results suggest that future GWAS may be better powered 658 to find genetic variants linked to intrinsic connectivity based on GFC rather than RSFC from short 659 resting-state scans. 660 661 How general is general functional connectivity? 662 Previous research has shown that the predictive utility of intrinsic connectivity can be higher when 663 it is derived from task data than when derived from resting-state data (Greene et al., 2018). Some 664 task states in particular may reveal individual differences in intrinsic connectivity that aid in 665 predictive utility . While these findings provide insight into individual 666 differences in intrinsic connectivity and a new use for task data, they pose a problem for replication 667 and cumulative aggregation of findings across datasets. While most fMRI studies collect task data, 668 very few collect the same tasks. Our findings here suggest that GFC may provide a solution to this 669 problem. First, we found that like task-derived intrinsic connectivity, GFC can outperform RSFC 670 when predicting within-sample and out-of-sample variability in cognitive ability ( Figure 5). 671 Second, we found that GFC performed as well or better than specific tasks alone at predicting 672 cognitive ability out-of-sample. Critically, in these comparisons GFC was constructed with a 673 combination of different tasks in the training and testing datasets suggesting that GFC may 674 measure a common, generalizable backbone of intrinsic connectivity that can be applied and 675 aggregated across independent datasets with different task batteries (Tables S4 and S5). While 676 further research is needed to replicate these findings, they nevertheless suggest that GFC may 677 provide a generalizable, practical framework that researchers with resting-state and task data can 678 use to drive a cumulative, predictive neuroscience of individual differences (Button et al., 2013;679 Szucs and Ioannidis, 2017). 680 681 Is general functional connectivity too heterogeneous? 682 An understandable reluctance in implementing GFC could be that it would introduce heterogeneity 683 in the estimation of intrinsic connectivity. Whereas RSFC is thought to be a common framework 684 with generalizability across datasets, the combination of different sets of tasks in different datasets 685 may introduce additional variability. Below, we provide three reasons why we think this is less of 686 a problem than may appear at first and point to features of our analyses that support the validity of 687 GFC as a robust and reliable measure of individual differences in intrinsic connectivity. 688 689 First, resting-state data are by nature heterogenous, and the resting-state is its own type of task 690 (Buckner et al., 2013). Researchers collect resting-state data under different conditions in which 691 participants close their eyes, have their eyes open, or fixate. These differences in resting-state 692 conditions influence intrinsic connectivity (Patriat et al., 2013). Despite these differences, all 693 approaches are collectively referred to as resting-state scans. In addition, factors like thought 694 content (Christoff et al., 2009;Hurlburt et al., 2015), caffeine intake (Wong et al., 2012), recent 695 cognitive tasks (Waites et al., 2005), and falling asleep (Deco et al., 2014) can introduce further 696 heterogeneity biasing intrinsic connectivity estimates in many resting-state datasets. It has even 697 been estimated that approximately 30% of subjects fall asleep within the first three minutes of a 698 resting-state scan (Tagliazucchi and Laufs, 2014). Despite this heterogeneity, resting-state scans 699 measure a common set of functional networks (Yeo et al., 2011), and display trait-like features 700 driven by stable factors such as genetics (Glahn et al., 2010) and structural connectivity (Honey et 701 al., 2009). For these reasons resting-state data represent substantial promise as an individual 702 differences measure if enough data are collected to average out sampling variability (Gratton et 703 al., 2018). But, as already stated, few researchers collect enough data to fulfill this promise. 704 Connectivity estimates from task data are also shaped by many of these stable factors (Cole et al., 705 2014;Krienen et al., 2014), and data from different tasks predominantly measure overlapping 706 individual differences that are only weakly influenced by task demands (Gratton et al., 2018). 707 Admittedly, GFC will vary, to some extent, across samples because each study collects a unique 708 combination of resting-state and task data. However, RSFC will vary too. Despite this 709 heterogeneity, we found that disparate task and resting-states collectively measure reliable, 710 heritable individual differences with generalizable out-of-sample predictive utility in the form of 711 GFC. 712 713 Second, the benefits of GFC converged across two different samples, with different scanners, 714 scanning parameters, and demographics. The HCP represents a large healthy sample of highly-715 educated individuals , while the Dunedin Study represents a population-716 representative birth cohort with a broad range of mental and physical health conditions, 717 socioeconomic status, and full representation of variability in many complex traits (Poulton et al., 718 2015). Across samples, the data came from 11 different tasks (HCP = 7, Dunedin Study = 4), many 719 with unrelated cognitive demands, different scan lengths, stimuli, behavioral requirements, and 720 instructions. The data were processed with different preprocessing schemes and software 721 ("minimally preprocessed" in the HCP and custom scripts in the Dunedin Study). While the 722 amount of scan time was tightly controlled in the HCP analyses (i.e., equal time after censoring), 723 in the Dunedin Study motion censoring led to unequal scan lengths across participants as is the 724 case in many analyses of intrinsic connectivity. Despite these many differences between datasets, 725 the results demonstrate that GFC can achieve good test-retest reliability, inter-rater reliability 726 (different preprocessing schemes), out-of-sample reliability (convergence across datasets and out-727 of-sample prediction), and parallel forms reliability (different tasks in each sample) (Dubois and 728 Adolphs, 2016). While we present evidence in favor of the generalizability of GFC, future research 729 should further investigate the heterogeneity in GFC to find combinations of tasks that most 730 efficiently estimate individual differences in intrinsic connectivity Shah et al., 731 2016). 732 733 Third, reliability fundamentally limits the ability to detect associations between any two measures 734 (Nunnally Jr., 1970;Vul et al., 2009). Therefore, any investigation mapping intrinsic connectivity 735 to individual differences in behavior, cognition, or disease is fundamentally limited by the 736 reliability of the intrinsic connectivity measures. Relatedly, statistical power to detect true effects 737 depends on the anticipated effect size, which is in turn dependent on reliability of each 738 measurement (Kanyongo et al., 2007;Williams and Zimmerman, 1989). Given that many studies 739 only have 5-10 minutes of resting-state data and consequently poor reliability of intrinsic 740 connectivity measures, they have drastically reduced statistical power (Button et al., 2013;741 Ioannidis, 2008741 Ioannidis, , 2005, and likely tenuous correlates of individual differences unless they adopt a 742 framework like GFC. Previous research has found that collectively a large number of unreliable 743 edges can achieve multivariate reliability that results in predictive utility (Noble et al., 2017). 744 While multivariate predictive utility may be achievable in studies with poor univariate edge-wise 745 reliability ( Figure S5), the interpretability and clinical utility of these studies will be diminished 746 by low univariate reliability. This is because interpretability and clinical utility depend on isolating 747 specific brain areas or functional connections that can be said to be "most" important to the 748 phenotype of interest. This cannot be done without decent univariate reliability because accurate 749 estimation of true parameters in a prediction model depends on reliability (Cremers et al., 2017;750 Vul et al., 2009;Yarkoni and Westfall, 2017). For example, if a researcher wants to characterize 751 biomarkers of mental illness that could be targeted with interventions like transcranial magnetic 752 stimulation (Fox et al., 2014) or real-time neurofeedback (Caria et al., 2012), they will need to 753 isolate specific circuits that with confidence can be said to be "altered" and relevant to the disease 754 process in mental illness. Actionable identification of these biomarkers will only be possible with 755 adequate univariate reliabilities. GFC provides a practical solution to this problem, while also 756 addressing power and reliability issues that have been identified more generally in neuroscience 757 (Button et al., 2013;Szucs andbeyond (Aarts et al., 2015;Errington et al., 758 2014;Ioannidis, 2005;Ioannidis et al., 2001;Munafò et al., 2017). GFC provides a tool for 759 researchers to repurpose existing and future datasets to better contribute to a cumulative science 760 of individual differences (Dubois and Adolphs, 2016) with clinical relevance (Fox, 2010). For 761 these reasons, we think that in many studies the gains in reliability offered by GFC will outweigh 762 potential task-induced "heterogeneity." 763 764 Conclusion 765 Here we propose GFC as a method for improving the reliability of estimating individual 766 differences in the intrinsic architecture of functional networks. We demonstrate that, when the 767 amount of data available for analysis is held constant, the combination of resting-state and task 768 data is at least as reliable as resting-state alone. Additionally, when the same amount of data is 769 available, GFC is more heritable than RSFC. Many researchers who have less than 25 minutes of 770 resting-state data but have additionally collected task data on the same subjects can immediately 771 boost reliability, heritability, and power by adopting GFC as an individual differences measure of 772 intrinsic connectivity. Our findings also suggest that future data collection should consider 773 naturalistic fMRI (Hasson et al., 2010(Hasson et al., , 2004Huth et al., 2016;Lahnakoski et al., 2012;Vanderwal 774 et al., 2017) and engaging tasks in addition to resting-state scans when generating data for 775 estimating intrinsic connectivity. This may be especially true when studying children, the elderly, 776 or the mentally ill as these groups often cannot lie still for 25 minutes of resting-state scanning 777 (Power et al., 2012;Satterthwaite et al., 2013;Yuan et al., 2009). The current findings also suggest 778 that researchers need not think of collecting resting-state and task data as a zero-sum competition 779 for scan time when designing studies. Instead, our findings demonstrate that task and resting-state 780 data provide complementary measures that together can lead to reliable, heritable measurement of 781 individual differences in intrinsic connectivity. We think that collecting adequate amounts (> 25 782 minutes) of high-quality fMRI data should be emphasized over the traditional task-rest dichotomy, 783 when individual differences are of interest. If fMRI is going to be a part of a cumulative 784 neuroscience of individual differences with clinical relevance, reliability and measurement must 785 be at the forefront of study design. 786  Ching, C.R.K., Cuellar-Partida, G., Braber, A. Den, Doan, N.T., Ehrlich, S., Filippi, I., Ge,851 T., Giddaluru, S., Goldman, A.L., Gottesman, R.F., Greven, C.U., Grimm,O.,Griswold,852 M.E., Guadalupe, T., Hass, J., Haukvik, U.K., Hilal, S., Hofer, E., Hoehn, D., Holmes, A.J., 853 Hoogman