Reliability modelling of resting-state functional connectivity

Resting-state functional magnetic resonance imaging (rs-fMRI) has an inherently low signal-to-noise ratio largely due to thermal and physiological noise that attenuates the functional connectivity (FC) estimates. Such attenuation limits the reliability of FC and may bias its association with other traits. Low reliability also limits heritability estimates. Classical test theory can be used to obtain a true correlation estimate free of random measurement error from parallel tests, such as split-half sessions of a rs-fMRI scan. We applied a measurement model to split-half FC estimates from the resting-state fMRI data of 1003 participants from the Human Connectome Project (HCP) to examine the benefit of reliability modelling of FC in association with traits from various domains. We evaluated the efficiency of the measurement model on extracting a stable and reliable component of FC and its association with several traits for various sample sizes and scan durations. In addition, we aimed to replicate our previous findings of increased heritability estimates when using a measurement model in a longitudinal adolescent twin cohort. The split-half measurement model improved test-retest reliability of FC on average with +0.33 points (from +0.49 to +0.82), improved strength of associations between FC and various traits on average 1.2-fold (range 1.09-1.35), and increased heritability estimates on average with +20% points (from 39% to 59%) for the full HCP dataset. On average, about half of the variance in split-session FC estimates was attributed to the stable and reliable component of FC. Shorter scan durations showed greater benefit of reliability modelling (up to 1.6-fold improvement), with an additional gain for smaller sample sizes (up to 1.8-fold improvement). Reliability modelling of FC based on a split-half using a measurement model can benefit genetic and behavioral studies by extracting a stable and reliable component of FC that is free from random measurement error and improves genetic and behavioral associations.


Introduction
Resting-state functional connectivity has become a popular method to study the functional organization of the human brain ( Biswal et al., 1995 ;Greicius et al., 2002 ;van den Heuvel and Hulshoff Pol, 2010 ). Functional connectivity has shown promise as a potential biomarker for its association with neuropsychiatric and neurological disorders ( Hager and Keshavan, 2015 ;Hohenfeld et al., 2018 ;Whitfield-Gabrieli and Ford, 2012 ;Zhang and Raichle, 2010 ;Fox et al., 2014 ). In addition, it is associated with various cognitive and behavioural traits ( Vaidya and Gordon, 2013 ;Basten et al., 2015 ;Shen et al., 2018 ;Smith et al., 2015 ;Toschi et al., 2018 ), and functional connectivity is heritable to a certain extent ( Ge et al., 2017 ;Colclough et al., 2017 ;Adhikari et al., 2018 ;Teeuw et al., 2019 ). However, for resting-state functional connectivity to become a biomarker, it needs to be a reli-in the BOLD signal has a negative impact on the reliability of the signal and results in the attenuation of functional connectivity estimates based on temporal correlation between two BOLD signals ( Spearman, 1904 ;Wang, 2010 ;Birn et al., 2014 ;Mueller et al., 2015 ). Even with stateof-the-art procedures, test-retest reliability of functional connectivity is only poor to moderate for short scans (i.e. less than 30 min) ( Noble et al., 2019 ;Chen et al., 2015 ;Shah et al., 2016 ), with large variation in reliability for the different connections measured in a single individual over an extended period of time ( Choe et al., 2015 ). Although reliability of functional connectivity can be improved by increasing the scan duration up to 1.5 hour Laumann et al., 2015 ), this approach is not always feasible due to the burden on the subject or the cost and availability of MRI. The limited reliability of functional connectivity for scan lengths that are typically used in resting-state fMRI studies puts an upper bound on the heritability estimates of functional connectivity and its association with traits ( Vul et al., 2009 ;Neale and Cardon, 1992 ;Ge et al., 2017 ). Thereby making it difficult to reliably identify functional connections in the brain that are associated with particular traits ( Geerligs et al., 2017 ;Kruschwitz et al., 2018 ).
In Classical Test Theory, the true score of a measure can be obtained from the observed score if the error term is known: observed score = true score + error term ( Streiner, 2003 ;Miller, 1995 ). The error term can be approximated from 'parallel scores' (i.e. repeated measures). When it is not feasible to acquire two full measurements, two parallel half-score measures might be an option; e.g. an odd-even split in an event-related study design ( van Baal et al., 1998 ;van Beijsterveldt et al., 2001 ). Resting-state fMRI data is uniquely suited for 'split-half' reliability modelling because of the temporal nature of the BOLD signal to create two parallel half-score measures by splitting scan session data into two or more parts ( Brandmaier et al., 2018 ). For associations, such as functional connectivity based on the temporal correlation of two BOLD signals, the error term is defined by the reliability of the two measures ( Spearman, 1904 ). The true association can be obtained by scaling the observed association with a factor inversely proportional to the reliability ( Mueller et al., 2015 ;Golestani and Goodyear, 2011 ). However, classical disattenuation requires the correction factor to be known a priori and has the risk of overcorrecting the association ( Wang, 2010 ). Instead, a structural equation measurement model can be applied to the half-score measures to derive a latent variable representing the trait of interest that is "free " of measurement error without the need of an a prior correction factor or the risk of overcorrection ( Brandmaier et al., 2018 ;Cooper et al., 2019 ). Such measurement models have previously been applied in twin studies to separate the variance attributed to measurement error from genetic and environmental variance components to obtain a robust heritability estimate for the reliable part of the variation ( van Baal et al., 1998 ;van Beijsterveldt et al., 2001 ;Ge et al., 2017 ;Teeuw et al., 2019 ). The measurement model is however suited for many types of measures and study designs, as long as some form of parallel scores can be obtained. This implies that reliability modelling can be useful for the typical cross-sectional resting-state fMRI dataset of unrelated individuals. However, little is known about the effectiveness of reliability modelling of resting-state MRI functional connectivity and its ability to uncover the true associations between functional connectivity and other traits.
Here, we examine the benefits of reliability modelling of functional connectivity in association with physiological, cognitive and behavioral traits. For that purpose, a measurement model was applied to functional connectivity estimates from the resting-state functional MRI scans of the Human Connectome Project Young Adult (HCP-YA) cohort . The efficiency of reliability modelling was evaluated for various sample sizes and scan durations on (i) the ability to extract a stable and reliable component of functional connectivity and (ii) the improvement in the associations of functional connectivity with various traits. We also aimed to replicate our previous findings of increased heritability estimates for the reliable component of functional connectivity ( Teeuw et al., 2019 ). Finally, we investigate the use of the measurement Abbreviations: SD = standard deviation of the mean.
model for a typical study that comprises a single resting-state fMRI session.

Human connectome project
We utilized data from the publicly available extensively processed fMRI data package 1 that is part of the Human Connectome Project Young Adult cohort . The package provides data for 1003 related individuals (siblings, including monozygotic-and dizygotic twins; aged 22 to 37 years) from 429 families with four complete runs of resting-state fMRI scans ( Table 1 ) and consists of precomputed denoised and centered BOLD signal time series for nodes in the brain based on group-ICA decomposition of the data at various decomposition levels. The acquisition parameters and processing of this data have been described elsewhere 2 Glasser et al., 2013 ; see supplementary methods for summary). All analyses were performed using the group-ICA decomposition with 50 nodes ( Supplementary Figure  S1 ).

Functional connectivity
Functional connectivity estimates were obtained by calculating the temporal correlation coefficient between the BOLD time series of two nodes using Pearson correlation ( Biswal et al., 1995 ;van den Heuvel and Hulshoff Pol, 2010 ). Functional connectivity was estimated for different temporal blocks of the time series data to provide full-, half-, and quarter-score estimates of functional connectivity for the purpose of reliability modelling ( Fig. 1 A ). All statistical and mathematical operations on functional connectivity were performed on Fisher's r-to-Z transformed functional connectivity estimates.

Reliability modelling of functional connectivity
A measurement model is applied to half-score measures of functional connectivity to extract a reliable component of functional connectivity represented by the latent variable F i ( Fig. 1 ). Variance shared between half-score measures that can be attributed to the latent variable is quantified by the path coefficient ( f i ), which are constrained to be equal in proportion of variance for both half score measures. Any residual variance of the half-score measures, which includes measurement error, is considered noise, and is represented by the measurement-specific latent variables Es i quantified by the path coefficients es i .
To demonstrate the ability of the measurement model to extract a stable and reliable component of functional connectivity, test-retest reliability is estimated between two latent variables 1 and 2 that each connectivity. The model is used to estimate corrected test-retest reliability between the two reliable components of functional connectivity at different scan sessions ( Rphf ) quantified by the correlation between the two latent variables. ( C ) A standard association model is used to estimate the uncorrected test-retest reliability between the observed half-score measures of functional connectivity ( Rphm ) quantified by the correlation between the two observed variables. ( D ) A measurement model with one latent variable representing the reliable component of functional connectivity of two half-score measures and another latent variable representing the trait. This model is used to estimate the corrected association between the reliable component of functional connectivity and the trait ( Rphf ) quantified by the correlation between the two latent factors. ( E ) A standard association model is used to estimate the uncorrected correlation between the observed full-score measure of functional connectivity and the trait. represent a reliable component of functional connectivity based on two half-score measures of functional connectivity from independent scan sessions acquired on different days ( Fig. 1 B ). The corrected test-retest reliability ( Rphf ) is estimated as the correlation between the two latent variables. Test-retest reliability from the measurement model is compared to the uncorrected test-retest reliability of functional connectivity estimated as the correlation between the two half-score measures of functional connectivity ( Fig. 1 C ). Improvement in corrected test-retest reliability is determined by the pointwise difference compared to the uncorrected test-retest reliability estimate ( Rphm ).
The measurement model can be adapted to estimate the corrected association between the reliable component of functional connectivity and a trait ( Fig. 1 D ). For comparison, the uncorrected association between the full-score measure of functional connectivity and the trait is estimated and compared over all connections ( Fig. 1 E ). The overall improvement in association strength (i.e. the average improvement factor) is determined by the slope coefficient of the linear regression of the association strengths from the measurement model onto the association strengths from the standard association model.
All models included fixed effects of sex, age, head motion and HCP processing pipeline version as covariates on the means of the functional connectivity estimates, and fixed effects of sex and age as covariates on the means of the trait. Head motion was approximated by mean framewise displacement ( Power et al., 2012 ; Supplementary Figure 2 ). All models were specified in the OpenMx structural equation modelling (SEM) software package ( Neale et al., 2016 ;Boker et al., 2018 ) for R ( R Core Team, 2018 ). The definition of the measurement model in OpenMx is provided as a supplementary data file ( Supplementary Data File F1 ), and is available on GitHub with an example ( Teeuw, GitHub 2020 ).
The suitability of applying a measurement model to the data was assessed with the goodness of fit metrics Comparative Fit Index (CFI) and root-mean-square error of approximation (RMSEA). Model fits with a CFI > 0.95 and RMSEA < 0.05 were deemed a good fit, model fits with a CFI > 0.90 and RMSEA < 0.08 were deemed an acceptable fit, and the remaining models (CFI < 0.90 or RMSEA > 0.08) were deemed an unsuitable fit ( Hu and Bentler, 1999 ;Browne and Cudeck, 1992 ). All models with an unsuitable fit were excluded from statistical analyses.

Table 2
Parameters and their values used in the evaluation of the minimal requirements on the input dataset for the measurement model.

Physiological, cognitive, and behavioral traits
The Human Connectome Project provides rich phenotypic information on the participants. 3 Because of the computational complexity of the measurement models, we used the results from the standard association model (  Table S1 ): five traits ( BPDiastolic -diastolic blood pressure levels; CogTotalComp_AgeAdj -total composite score on cognition adjusted for age; WM_Task_2bk_Accaccuracy on all condition in the 2-back working memory task; Emo-tion_Task_Median_RT -median response time for each condition in the emotion task; and PicVocab_AgeAdj -picture vocabulary test score adjusted for age) were among the most strongly associated measures with functional connectivity at any individual connection, and two traits ( Gambling_Task_Reward_Perc_Larger -percentage of trials that received a 'larger' prediction in the gambling task; and Taste_AgeAdj -score on the taste intensity test adjusted for age) were chosen because they were only weakly associated with functional connectivity. For these seven traits, we estimated the association with all functional connectivity measures.
For the remaining 103 traits, measurement models were computed only for the top 20 connections most strongly associated with functional connectivity and the 5 connections with the weakest associated with functional connectivity (i.e. near zero association) based on the results from the standard association model. The connections were selected independently for each trait ( Supplementary Data File F2 ). This sampling scheme provides a good approximation of the actual improvement factor in association strength for the seven fully sampled traits, with mean absolute difference in improvement factor 2% (range from 0% to 5%; Supplementary Table S2 ).

Heritability of functional connectivity
To emphasize that the measurement model can be applied to the typical dataset consisting of unrelated individuals, up to this point we assumed that the subjects were independent. However, the Human Connectome Project cohort includes families with monozygotic and dizygotic twins and their siblings, and families with non-twin siblings ( Supplementary Table S3 ). We aimed to replicate our previous findings of increased heritability estimates for the reliable and stable component of functional connectivity in a longitudinal adolescent twin cohort ( Teeuw et al., 2019 ). In brief, genetic modelling of data from twins and siblings allows for the decomposition of the variance of a trait ( ) into genetic and environmental components. Often, three variance components representing additive genetic ( ), common environmental ( ) and unique environmental ( ) influences are considered ( Boomsma et al., 2002 ;Posthuma et al., 2000 ;Neale and Cardon, 1992 ). Heritability is the standardized additive genetic component: ℎ 2 = = + + . The unique environmental influences are confounded by measurement error ( ) that can be separated from the "true " unique environmental ( ′ ) influences by the measurement model: = ′ + . The variance of the reliable trait (i.e. the reliable component in the measurement model) becomes ′ = − or ′ = + + ′ . Heritability of the reliable trait is estimated as the standardized additive genetic component after excluding measurement error: ℎ 2 = ′ = + + ′ . The heritability of the reliable component of functional connectivity was estimated for a measurement model on the half-score measures of functional connectivity ( Supplementary Figure S3 ). Heritability estimates of the reliable component of functional connectivity are compared to heritability estimates from the full (i.e. uncorrected) measure of functional connectivity ( Supplementary Figure S3 ). Full details on the heritability analysis are provided in the Supplementary Materials.

Evaluation of the measurement model at different samples sizes and scan durations
We performed a parameter sweep to empirically determine the efficiency of the measurement model on improving the reliability of functional connectivity and the strength of the association between traits and functional connectivity for various sample sizes and total scan duration ( Table 2 ). We repeated the analysis for four sessions across two different days, two sessions on the same day, and single session scan data.
Each combination of total scan duration and sample size was sampled 100 times per functional connection, with a random set of the desired number of participants drawn from the full sample of participants at each iteration. To reduce the computational burden of evaluating the performance of the measurement model for all connections, the same 25 connections identified through the sparse sampling scheme previously described were used for each of the seven exemplar traits.
BOLD time series for the desired scan durations were extracted from the original full-length BOLD time series data by distributing four time blocks of equal length across all four scan sessions, with each block starting at the first volume of each scan session. The blocks were then concatenated to construct the time series of the full-, and half-score measures of functional connectivity ( Supplementary Figure S4 ). This distributed approach was adopted for multi-session scan data to prevent half-score measures from crossing scan boundaries; e.g. for a total time series length of 1600 vol (20 min), the second half-score measure would otherwise have been computed across data from both scan #1 (volumes 801:1200) and #2 (volumes 1:400), which has a detrimental effect on the reliability of the second half-score. This distributed approach was also used when the full time series could have fit in the length of a sin-

Fig. 2.
Test-retest reliability of functional connectivity estimates. ( A ) Improvement in test-retest reliability between the standard association model ( x -axis) and the measurement model ( y -axis); data points are scaled by the average proportion of variance explained by the reliable component, thereby emphasizing the more reliable and stable connections, and color-coded by the Comparative Fit Index. ( B ) Point-wise improvement in test-retest reliability between the standard association model and the measurement model. ( C ) Proportion of variance of the quarter-score measures explained by the reliable components of functional connectivity. For all panels, uncorrected test-retest reliability was estimated as the association between half-score measures of functional connectivity. The corrected test-retest reliability from the measurement model was estimated as the association between the two reliable components of functional connectivity based on the quarter-score measures of functional connectivity. The red lines indicate the mean of the distributions.
gle session (i.e. total scan duration < 15 min) to ensure time series are based on the same data. For single session data, contiguous blocks were selected ( Supplementary Figure S4 ).

Minimal requirements on the input dataset
The requirements, in terms of sample size and total scan duration , for a dataset suitable for reliability modelling was evaluated empirically to determine the threshold where the goodness of fit indices for the measurement model started to deteriorate. Since no universal absolute threshold exists, the proportion of bad fits over the hundred iterations for each combination of sample size and total scan duration is provided with a lower and upper boundary marked at 25% and 50% quantiles. The same goodness of fit indices Comparative Fit Index (CFI) and rootmean-squared-error of approximation (RMSEA) and their judgement criteria were used as described before.

High consistency of group-level mean functional connectivity
Group-level mean functional connectivity was highly consistent across the two half-score measures of functional connectivity ( rho = + 0.996; intraclass correlation ICC = 0.995, see supplementary information for details), with functional connectivity estimates ranging from -0.52 to + 0.66 (mean FC = 0.003) ( Supplementary Figure S5 ), and absolute differences between the two half-score measure of less than 0.06 for individual connections. However, there is high variation within individuals.

Reliability modelling improves test-retest reliability between scan sessions
At the level of individual connection, test-retest reliability of functional connectivity between scan sessions acquired on different days improved substantially for 760 connections (62% of all 1225 connec-tions) with an acceptable or good fit of the measurement model ( Fig. 2 ;  Supplementary Figure S6 ). On average, the uncorrected test-retest reliability estimate of functional connectivity was + 0.49 (range = + 0.17 to + 0.82; Fig. 2 A ) and the test-retest reliability estimates of the reliable component of functional connectivity was + 0.83 (range = + 0.60 to + 1.00; Fig. 2 A ). The test-retest reliability estimates improved on average with + 0.33 points (range = + 0.13 to + 0.82; Fig. 2 B ). Connections with lower uncorrected test-retest reliability improved more than connections with already high test-retest reliability due to the ceiling effect of the upper bound of + 1.00 on test-retest reliability estimates ( Fig. 2 A ). On average, 44% of the variance of the quarter-score measures of functional connectivity was explained by the reliable components of functional connectivity (range = 7% to 79%; Fig. 2 C ).

Reliability modelling improves the association between functional connectivity and traits
The improvement in the strength of the associations between functional connectivity and the seven extensively tested traits ranged from + 17.0% to + 23.7% ( Fig. 3 A; Supplementary Figure S7 ). On average, the improvement in association strength for all 110 traits was + 20% (range = + 12 to + 30%; Fig. 3 B; Supplementary Data File F3 ) using a sparse sampling scheme to approximate the improvement factor when the full set of connections would have been used ( Supplementary Table  S2 ). On average, 1035 connections (84% of all connections) passed the goodness of fit criteria, with highly similar distributions for the seven traits ( Supplementary Figure S6 ). On average, 50% of the variance of the half-score measures of functional connectivity is explained by the reliable component, with a range from 17% to 81%.

Reliability modelling increases heritability estimates of functional connectivity
The heritability estimates of full-score functional connectivity were on average 39% (range = 0% to 75%; Supplementary Figure S8 ). The heritability estimates of the reliable component of functional connectivity were on average 59% (range = 0% to 93%; Supplementary Figure  S8 ). On average, the heritability estimates increased with + 20% points (range = -3% to + 54% points; Supplementary Figure S8 ).

Efficiency and minimal requirements for using a measurement model
Test-retest reliability of functional connectivity depends on scan duration and the number of scan sessions ( Fig. 4 ). With the standard association model, the average uncorrected test-retest reliability across connections in the brain for a single session out-performed the uncorrected test-retest reliability of multi-session scan data (dashed lines ; Fig. 4 ). However, given the same number of scan sessions, the corrected testretest reliability from the measurement model (solid lines; Fig. 4 ) exceeded the uncorrected test-retest reliability for all variants and across all scan durations (dashed lines; Fig. 4 ). Of special note is the dataset variant with two scan sessions. The current measurement model ( Fig. 1 B ) estimates the corrected test-retest reliability based on the reliable components of functional connectivity that are corrected for intra-session variation with each session only. Despite improvement in the test-retest reliability, there is still substantial variation between the two sessions. If the quarter scores Q2 and Q4 are interchanged, such that the reliable components of functional connectivity now account for inter-session variation instead of intra-session variation, the corrected test-retest reliability approaches near perfect scores. Similar results were observed for single-session and four-session scan data after interchanging quarter scores Q2 and Q4. The sample size had little effect on the average test-retest reliability apart from increased variation due to sampling bias for the smaller sample sizes ( Supplementary Figure S9 ). Note that for improbable connections (e.g. connection 47-49) model fits started to degrade (i.e. increasing number of bad fits out of the 100 random samples drawn).
All seven traits exhibit the same general pattern of improvement in the association strength, with greater benefit from reliability modelling for shorter scan durations and a slight increase for the smaller sample sizes up to 1.8-fold increase averaged over the 100 iterations per combination of total scan duration and sample size ( Fig. 5 ; Supplementary  Figure S10 ). Note that for the combinations of very low sample size (25 subjects) and short scan durations ( ≤ 1600 vol, or ≤ 20 min), model fits started to degrade (i.e. increasing number of bad fits out of the 100 random samples drawn for a specific combination of sample size and total scan duration for each of the seven traits). Results for two scan sessions on the same day are nearly identical ( Supplementary Figure  S11 ). However, the improvement in association strengths are slightly di- Fig. 4. Test-retest reliability depends on scan duration and number of sessions. The average uncorrected test-retest reliability across all connections ( y -axis) was estimated for the full-score estimate of functional connectivity for four variants of the full dataset (symbol-and colour-coded curves): a single scan session where the measurement model accounts for intra-session variation between half scores of the same session ( * ), two scan sessions on the same day accounting for intersession variation between half-scores of different sessions ( ▴), two scan sessions on the same day accounting for intra-session variation between half-scores of the same session ( ■), and four scan sessions across two days accounting for inter-session variation between half-scores of sessions on the same day ( •). The total scan duration was varied from 5 min up to the maximum allowed by the dataset of 15 min per available scan session. minished for single session scan data; up to 1.5-fold increase for smaller sample sizes and short scan durations ( Supplementary Figure S11 ).
Comparing the different variants of the dataset (i.e. single session, two sessions on the same day, and four sessions across two days) re-vealed a consistent pattern for all seven measures; for brevity we present the results for the total composite score on cognition adjusted for age trait ( CogTotalComp_AgeAdj ) ( Supplementary Figure S12 ). Keeping the total scan duration the same for a single session and two sessions on the same day, both datasets produce similar uncorrected associations ( Supplementary Figure S12A ). However, the multi-session variant produces stronger corrected associations after applications of a measurement model ( Supplementary Figure S12A ). There is no additional benefit to using four sessions across two days compared to two sessions on the same day ( Supplementary Figure S12B ). However, it should be noted that the current model only accounts for measurement error between the two half-scores (i.e. Day 1 versus Day 2); more elaborate models that account for both inter-and intra-session measurement error might still benefit from multi-session data across different days ( Brandmaier et al., 2018 ). Similarly, there is no benefit to splitting single session data into more than two half-score measures of functional connectivity ( Supplementary Figure S12C ).
The goodness of fit assessment from the parameter sweep was used to evaluate the minimal requirements on the input dataset in terms of sample size and total scan duration for reliability modelling. Although no clear boundary can be defined when a measurement model is no longer suitable or practical to use, the chance that the measurement model does not describe the data well for a random sample of participants starts to increase with lower sample size or shorter scan durations ( Fig. 6 ), with similar profiles for all seven measures ( Supplementary Figure S13 ).

Discussion
We have shown that reliability modelling of functional connectivity using a measurement model on split-session half-score estimates of functional connectivity is able to extract a reliable component of functional connectivity with improved test-retest reliability between scan sessions acquired on separate days. Secondly, we found that the reliable component of functional connectivity is more strongly associated with traits Fig. 5. Improvement in association strength between functional connectivity and the traits diastolic blood pressure levels ( BPDiastolic ), age-adjusted total cognitive component ( CogCompTotal_AgeAdj ), and age-adjusted taste test score ( Taste_AgeAdj ) for various sample sizes ( x -axis) and total scan duration (color-coding). Improvement factor ( y -axis) is defined by the slope coefficient from the linear regression of the corrected association strength onto the uncorrected association strength over all sparsely sampled connections. color-shaded bands represent the 95% confidence interval of the means. For all panels, a standard association model was used to estimate the uncorrected association between the full-score measure of functional connectivity and the traits. A measurement model applied to the half-score measures of functional connectivity was used to estimate the corrected association between the reliable component of functional connectivity and the traits. The remaining four extensively tested measures showed similar patterns ( Supplementary Figure S10 ). Fig. 6. Percentage of sampled connections for each combination of sample size and total scan duration for which the goodness of fit for the behavioral association measurement models deteriorated below acceptable levels (CFI < 0.90 or RMSEA > 0.08), averaged across all seven measures (see Supplementary Figure S13 for the profiles of the individual measures). Total scan duration is reported in minutes. Dotted lines mark the boundary where on average more than 25% and 50% of the model fits are considered bad. than the full-score estimate of functional connectivity. Finally, we have empirically evaluated the minimal requirements of the dataset for reliability modelling of functional connectivity in terms of scan duration and sample size. We have previously reported increased heritability estimates for the stable and reliable component of functional connectivity in a longitudinal adolescent twin cohort ( Teeuw et al., 2019 ). Here, we have replicated this finding in the Human Connectome Project.

Reliability modelling is able to extract a stable and more reliable component of functional connectivity
The moderate uncorrected test-retest reliability of functional connectivity (average rho = + 0.49) that we found is comparable to other studies using the HCP Young Adult dataset ( Shah et al., 2016 ;Ge et al., 2017 ;S. Noble et al., 2017 a;Mejia et al., 2018 ;Elliot et al., 2019 ). Many factors influence test-retest reliability of functional connectivity, and as such, the test-retest reliability can vary substantially between datasets, ranging from poor (mean ICC ~0.15) to moderate (mean ICC ~0.65) ( Noble et al., 2019 ;S. Noble et al., 2017 b;Andoh et al., 2017 ). Despite the high quality of the HPC Young Adult dataset, we found a substantial improvement in test-retest reliability using a measurement model (average increase = + 0.33), in some cases resulting in good to excellent scores (average rho = + 0.82). Longer scan duration has a positive effect on the uncorrected test-retest reliability, as has previously been reported Laumann et al., 2015 ;S. Noble et al., 2017 a;Meija et al., 2018 ;Elliot et al., 2019 ). However, there is a penalty to inter-session test-retest reliability for multi-session scan data that should be taken into consideration when setting up a new study design. Other methods that improve test-retest reliability, such as disattenuation (i.e. scaling a measure by its reliability to obtain a true estimate), shrinkage (i.e. gravitating unreliable measures towards a group-mean estimate), or combining multiple modalities (e.g. resting-state and task-based functional connectivity), are able to increase the reliability of functional connectivity from + 25% upto two-fold improvement ( Mueller et al., 2015 ;Shou et al., 2014 ;Mejia et al., 2018 ;Elliot et al., 2019 ). Our results show that the split-session measurement model is able to extract a component of functional connectivity separately from independent scan sessions acquired on separate days that is more reliable than the individual half-score estimate of functional connectivity. This component represents a stable and reliable component of functional connectivity for a single link connection that explains on average about half (44%) of the variance in the split-session measurements. The remaining half is attributed to random variation between split-session measurements and is considered "measurement error ". In a 7-Tesla resting-state functional MRI study, a similar proportion (50%) of the variance could be attributed to spontaneous neural activity, and the other half to nonthermal physiological noise ( Bianciardi et al., 2009 ). Functional connectivity shows a strong state-like nature where it is influenced by intrinsic and extrinsic factors such as caffeine consumption, heart rate variability, circadian rhythm, daily mood, or attention ( Wu et al., 2014 ;Choe et al., 2015 ;Hodkinson et al., 2014 ;Facer-Childs et al., 2019 ;Ismaylova et al., 2018 ;Geerligs et al., 2017 ). These short-term fluctuations in connectivity strength have been the topic of investigation for dynamic functional connectivity ( Chang and Glover, 2010 ;Handwerker et al., 2012 ;Hutchison et al., 2013 ;Abrol et al., 2017 ). The residual variance that is specific to the individual half-score estimate of functional connectivity is considered "measurement error " in the measurement model, but is likely to represent both different sources of measurement error ( Brandmaier et al., 2018 ) and relevant biological transients ( Ge et al., 2017 ). For example, the increased variation between sessions (either same day or several days apart) that resulted in lower test-retest reliability will be treated as measurement-specific variation by the measurement model, but may reflect the "mental state " of the participants at the time of the scan. Instead, the stable component of functional connectivity would be of particular interest for researchers in the pursuit of a reliable and state-independent (or "trait-like ") endophenotype that could serve as a biomarker ( Beauchaine and Constantino, 2017 ).

Reliability modelling improves association of functional connectivity with various traits by revealing the true association in absence of measurement error
The stable and reliable component of functional connectivity obtained by the measurement model on half-score estimates of functional connectivity is more strongly associated with all 110 traits compared to the association between full-score functional connectivity and the traits (average improvement factor 1.2) in the full-sized dataset of the Human Connectome Project ( N = 1003 participants; ~1 hour of restingstate fMRI data). Overall, measures were only weakly associated with full-score measure of functional connectivity (maximum absolute rho < 0.25) when measured by the connection with the strongest association. These low associations are typical for resting-state functional connectivity ( Vaidya and Gordon, 2013 ;Kruschwitz et al., 2018 ;Geerligs et al., 2017 ;S. Noble et al., 2017 a;Basten et al., 2015 ;Toschi et al., 2018 ;Siegel et al., 2017 ). Even at connectome level, functional connectivity is only moderately associated with behavioural measures ( Smith et al., 2015 ;Finn et al., 2015 ;Rosenberg et al., 2016 ). It was recently shown that individual variation in the spatial distribution of functional brain network organization are stable trait-like features that may be associated with behaviour ( Seitzman et al., 2019 ;Kong et al., 2018 ), and that general cognitive ability is associated with the stability of the dynamic functional connectome ( Hilger et al., 2020 ). This could indicate that although the purpose of a measurement model is to provide a more reliable measure from parallel test scores, it could be the temporally stable (i.e. "trait-like ") component extracted from the split-session half-score measures of functional connectivity that provide the improved associations with behavioural measures. The short-term stability of withinsession functional connectivity likely reflects the "mental state " of the participant at time of the scan. Changes in the mental state of the participant between scan sessions contributes to the inter-session variations and negatively impacts the test-retest reliability. Using multisession scan data will allow the measurement model to account for these session-specific contributions of the mental states and extract the stable "trait-like " component. Splitting single session scan data into more sections (e.g. using quarter-scores instead of half-scores to represent the latent component of functional connectivity) revealed almost identical, but with overall slightly lower performance in improvement of the association strength between the reliable component of functional connectivity and behaviour for the model based on the quarter-scores compared to the model based on half-scores. In addition, the increased complexity of the quarter-score model and the associated increase in computational runtime suggest there is little benefit to gain by splitting the scan data of a single session into more than two half-scores. Splitting multi-session scan data into more than two half-score may be beneficial in combination with a model that accounts for both intra-and inter-session variation ( Brandmaier et al., 2018 ). It must be noted however, that in some cases applying the measurement model decreases (and should decrease) the observed association. For example, when a true association is absent but may arise by chance for the full-score measure, when the connection studied is not between two functionally connected regions, or when the connection is unstable or highly dynamic, we expect the corrected association to go down.

Studies with short scan duration and small samples size will experience greater benefits from reliability modelling
The parameter sweep reveals that the improvement in association strength is dependant on scan duration and, to lesser extent, on sample size. Datasets with a shorter total scan duration (down to five minutes of resting-state fMRI data) show much greater benefit from reliability modelling (on average up to 1.8-fold increase in association strength, with a similar profile for all seven extensively tested traits). A similar pattern was observed for variants of the dataset with a single session or two sessions on the same day, but with diminished results for single session scan data. The diminished results are most likely due to the higher test-retest reliability of single-session scan data. Multi-session scan data provided better results for the corrected associations after application of a measurement model, but no additional benefit was seen for four sessions across different days. However, it should be noted that the current model only accounts for measurement error between the two half-scores (i.e. Day 1 versus Day 2); more elaborate models that account for both inter-and intra-session measurement error might still benefit from multi-session data across different days ( Brandmaier et al., 2018 ). Previously, scan duration has been reported to influence reliability and reproducibility of functional connectivity estimates Laumann et al., 2015 ;S. Noble et al., 2017 a;Meija et al., 2018 ;Elliot et al., 2019 ) but estimates on the recommended scan duration for maximal reliability vary from 5 to 90 min. With typical scan duration for resting-state fMRI studies anywhere between the minimum recommended 5 min  to 8 min ( Waheed et al., 2016 ), equivalent to about 400 to 667 vol in the Human Connectome Project dataset, suggests that most dataset can expect a decent boost in association strength with reliability modelling, despite the fact that the corrected associations remain modest. Datasets with smaller sample sizes show a slightly greater benefit of reliability modelling. However, smaller sample sizes are accompanied by increased variation in the improvement factor that is due to sampling bias, which a measurement model cannot account for. It is therefore possible that the larger improvements we observe for smaller sample sizes are directly caused by the fact that the initial estimates of the uncorrected association were worse. For datasets with small sample sizes ( ≤ 50 participants), the utility of the measurement model starts to drop, with 50% to 80% of the sampled connections resulting in a bad fit. Note that there is a baseline rejection rate of on average 15% of the connections with a bad fit of the measurement model in the full-sized dataset. This baseline is also present in the parameter sweep because we did not exclude connections with a bad fit prior to selecting the connections for sparse sampling. Previous studies have examined the sample size required for structural equation modelling, with initial estimates suggested that at least 200 participants are needed ( Boomsma, 1985 ), or as a rule of thumb ten times the number of estimated parameters ( Bentler and Chou, 1987 ;Wolf et al., 2013 ). Sample sizes as low as 50 participants might be enough to obtain satis-factory fits for task-based fMRI ( Sideridis et al., 2014 ). Our parameter sweep shows that variation due to sampling bias goes down with a sample size around 100 to 150 participants, suggesting that, combined with a typical scan length of around 8 min, reliability modelling would be feasible for most contemporary resting-state fMRI studies.

Stronger genetic signal for the reliable component of functional connectivity
Measurement models have previously been used in the context of twin studies to obtain heritability estimates for the reliable portion of the variation ( van Baal et al., 1998 ;van Beijsterveldt et al., 2001 ;Ge et al., 2017 ). We have previously applied a measurement model to functional connectivity in a longitudinal adolescent twin cohort ( Teeuw et al., 2019 ). Here, we replicated our earlier finding that heritability estimates of functional connectivity can be increased substantially (from average h 2 = 39% to h 2 = 59%) by using a measurement model on data from split-half scan sessions. Previous studies on the heritability of functional connectivity using the Human Connectome Project dataset have typically reported low heritability for single link connections ( Ge et al., 2017 ;Colclough et al., 2017 ;Adhikari et al., 2018 ). One earlier study applied a custom linear mixed effects model to repeated measures of functional connectivity from the Human Connectome Project dataset (i.e. considered scans on Day 1 and Day 2 as repeated measures, similar to the two half-score measures for the full-sized dataset used in this study) and reported similar improvements for heritability estimates of around + 20% points when averaged across connections for the major functional networks ( Ge et al., 2017 ).

Methodological considerations for the application of a measurement model
There are a few methodological considerations of measurement models in general that should be mentioned ( Muchinsky, 1996 ). First, it is important to note that the ultimate purpose of the measurement model is to obtain estimates that are closer to the true value in the absence of error in the measurements. The measurement model will provide more accurate estimates that can guide future studies, but will not change the quality of the data. Secondly, while we discussed our results in terms of improvement factors, correlations obtained with the measurement model are associations between the stable components (i.e. reliable variation) rather than between the full traits (i.e. full variation) and those two are not directly comparable. In addition, there is no ground-truth available for resting-state functional connectivity of the human brain that could verify the correctness of the measurement model outputs. We would like to emphasise again that an improvement of association is not informative for the existence of "true " correlated neuronal activity. It is therefore good practice to always report the uncorrected results from uncorrected association in addition to the corrected results from the measurement model, and important to assess the goodness of fit of the measurement model. Given our finding that multi-session scan data seems to outperform a single scan session of the same length, we would recommend researchers to adopt an acquisition protocol that splits their fMRI resting-state scans into two parts; for example, 5 min at the beginning and 5 min at the end of a scan session or day, instead of a single 10 min scan. This approach would maximize the (unwanted) variation in measurement error and mental state of the participant between half scores to better capture the stable core (or trait-like) component of functional connectivity with a measurement model. For the researcher that has single session restingstate fMRI data already acquired, we would recommend splitting the single session into two parts. Splitting single-session scans into more than two parts has shown not to improve the results in our data. We have put a practical example and a step-by-step "walkthrough " of the model on GitHub ( Teeuw, 2020 ). The output of the measurement model and its goodness of fit should be interpreted within the context of what the latent component of functional connectivity represents; specifically, how much variation is explained by the latent component, as determined by the factor loadings of the measurement model, and is roughly proportional to the test-retest reliability between the half-score measures of functional connectivity. Models with a low proportion of variance explained are likely nonsense connections or not functionally connected regions, or highly unstable or dynamic connections, and may result in a poor goodness of fit. For practical studies, we would recommend targeting specific connections that are known to be useful or relevant to the trait under investigation and that the connection is between functionally connected regions based on prior knowledge. Additionally, extending the model may improve the fit to the data; e.g. including correlated error covariance structure may resolve the bad fit (e.g. Brandmaier et al., 2018 ).

Limitations to the current study
There are some limitations specific to the current study. First, there is quite some variation in quality between resting-state fMRI datasets ( Noble et al., 2019 ). Other datasets might see a shift in the performance and requirements curves based on the quality of the dataset. Secondly, reliability modelling was not applied to the traits and the reported corrected association might still be limited by the reliability of the trait. If multiple or repeated measures of the trait or measurements from different modalities are available, a measurement model can be applied to both the brain measure (e.g. functional connectivity) and the trait to obtain a more accurate estimate of the association between the two in the absence of measurement error ( Beaty et al., 2015 ;Cooper et al., 2019 ;Köhncke et al., 2020 ).

Conclusion
In conclusion, reliability modelling of functional connectivity using a measurement model on split-half session resting-state fMRI data is an effective method to compensate for attenuation of the temporal correlation coefficient due to noise in the BOLD signal. The measurement model is able to extract a stable and reliable component of functional connectivity that can reveal the true associations with traits and increased heritability estimates compared to the analysis with full-score estimate of functional connectivity. The benefit of a measurement model is greater for studies with short scan duration or a limited number of participants.

Declaration of Competing Interest
The authors declare no conflicting interests.

Data availability
The original data used in this analysis is publicly available after registration at https://db.humanconnectome.org . The definitions of the OpenMx models are available on https://github.com/jalmar/ openmx-models . Requests for access to the scripts used in this analysis should be directed to the corresponding author.