Behavioural relevance of spontaneous, transient brain network interactions in fMRI

How spontaneously fluctuating functional magnetic resonance imaging (fMRI) signals in different brain regions relate to behaviour has been an open question for decades. Correlations in these signals, known as functional connectivity, can be averaged over several minutes of data to provide a stable representation of the functional network architecture for an individual. However, associations between these stable features and behavioural traits have been shown to be dominated by individual differences in anatomy. Here, using kernel learning tools, we propose methods to assess and compare the relation between time-varying functional connectivity, time-averaged functional connectivity, structural brain data, and non-imaging subject behavioural traits. We applied these methods to Human Connectome Project resting-state fMRI data to show that time-varying fMRI functional connectivity, detected at time-scales of a few seconds, has associations with some behavioural traits that are not dominated by anatomy. Despite time-averaged functional connectivity accounting for the largest proportion of variability in the fMRI signal between individuals, we found that some aspects of intelligence could only be explained by time-varying functional connectivity. The finding that time-varying fMRI functional connectivity has a unique relationship to population behavioural variability suggests that it might reflect transient neuronal communication fluctuating around a stable neural architecture.


Introduction
The emergence of large-scale distributed networks in spontaneous brain activity as measured by functional magnetic resonance imaging (fMRI) is a widely-studied phenomenon ( Biswal et al., 1995 ;Fox and Raichle, 2007 ). These networks have been consistently identified using cross-regional temporal correlations -referred to as functional con-puts or any ground truth, discerning whether time-varying FC carries biological meaning in the resting state is indeed not straightforward ( Lurie et al., 2020 ;Kucyi et al., 2018 ). One possibility is to use indirect behavioural correlates, for example, by assessing the extent to which FC prior to task onset influences task performance ( Sadaghiani et al., 2015 ), quantifying how the execution of a task induces differences in subsequent resting-state FC ( Waites et al., 2005 ), or using a low demanding task with well-defined behavioural information as a surrogate of actual resting-state . However, these are normally subtle effects, and other researchers have reported little or no differences in FC between task and rest ( Hampson et al., 2006 ;Gratton et al., 2018 ).
Here, we take a different route, by relating time-varying FC to population variability in behavioural traits. For this purpose, we implemented a framework to predict subject behavioural traits from either time-varying FC, time-averaged FC, or structural data. Critically, this was done in such a way that the prediction could be abstracted from the very distinct nature of the features used to represent each of the three modalities, allowing us to compare their relative and unique contribution to the prediction in an unbiased manner. Using different groups of behavioural traits, we used this approach to explore the relationship between population behaviour, time-averaged FC and time-varying FC, after accounting for the explanatory power of the structural connectivity features. We reasoned that if fMRI time-varying FC represents biologically meaningful communication between neuronal populations, then it should be capable of accounting for aspects of the subjects' behavioural phenotypes not explained by the time-averaged FC or the structural information. We found that this was the case, particularly for the traits that are generally related to intelligence.
To measure time-varying fMRI FC, we used a state-based model where each state is associated with a specific pattern of FC ( Vidaurre et al., 2017 ), such that instantaneous changes in FC manifest as a change of state. This approach is based on a version of the hidden Markov model (HMM) that, in comparison to previous versions of the HMM used on fMRI ( Vidaurre et al., 2017 ;Stevner et al., 2019 ;Baldassano et al., 2017 ;Shappell et al., 2019 ;Zhang et al., 2019 ), emphasises changes in FC over changes in amplitude. To model each subject, the HMM uses a temporally-organised mixture of (quasi-) stable FC patterns in the form of region-by-region covariance matrices. This is an appropriate choice to compare time-varying FC with timeaveraged FC, since time-averaged FC is also based on region-by-region covariance matrices. To model structural variability, we used fractional anisotropy (FA; Basser and Pierpaoli, 1996 ), mean diffusivity (MD; Basser et al., 1994 ) and voxel-based morphometry (VBM; Ashburner and Friston, 2000 ).

Results
In this section, we first summarise the basic steps of the analysis, which are presented in more detail in Methods, and then go on to show that there are aspects of behaviour that are uniquely expressed in both time-averaged and time-varying FC. We also show how each of these representations explicitly relate to each other, and to the structural data, in terms of their relation to behaviour. Overall, these analyses suggest that time-averaged and time-varying FC can indeed reflect separate aspects of brain activity.

Functional representations of the data
We used 1003 subjects' resting-state fMRI data with TR = 750ms from the Human Connectome Project (HCP; Smith et al, 2013b ), where each subject underwent four 15-min sessions (two per day). We used a datadriven parcellation obtained through spatial independent component analysis (ICA), and extracted 50 components ( Beckmann et al., 2009 ). The time series of these ICA components were then standardised separately for each session.
We considered two different FC-related representations of the data. The first representation is a time-averaged FC model, where we represented each subject as one (50 by 50) Pearson's correlation matrix across all ICA component time series ( Smith et al, 2013 ). Because the time series have unit-variance, these correlation matrices are equivalent to the corresponding covariance matrices.
The second representation corresponds to a time-varying FC model, where the ICA time series were fed to a hidden Markov model (HMM), which we first ran at the group level -i.e. on the concatenated time series for all subjects. The HMM represents the data as (i) a collection of states, each represented by a certain probability distribution; (ii) time series of state activation probabilities, one per state and time point, referred to as state time courses; and (iii) a transition probability matrix containing the probability of switching from one state to another within a session ( Vidaurre et al, 2017 ;Vidaurre et al, 2018 ); see Methods and Fig SI-1 for an illustration. As opposed e.g. to Vidaurre et al. 2017or Baker et al. 2014 , which represented states using ordinary Gaussian distributions, here we implemented an HMM designed to emphasise periods in time with distinct FC (also see Fig SI-2 for a graphical, exemplary comparison). Specifically, each HMM state is represented by a covariance matrix across ICA components, so changes of state activations within session -expressed by the state time courses-correspond to modulations of covariance above and beyond the average covariance or FC. In this model, the state-specific covariances and transition probability matrix are estimated at the group level, whereas the state time courses are subject-specific ( Vidaurre et al, 2017 ). Akin to the procedure known as dual regression in ICA ( Nickerson et al., 2017 ), we then performed a process of dual-estimation to obtain subject-specific versions of the group-level HMM in order to get a fuller subject-specific description of time-varying FC, where each subject has their own set of state-specific covariances (i.e., FC matrices), transition probability matrix, and state time courses.
We trained the models with eight states, without optimising for this number; previous work, however, suggested that the relations to behaviour are relatively robust across a reasonable range of states ( Vidaurre et al., 2017 ). As often occurs with other models where the estimation depends on an optimisation process, the inference of the HMM can potentially produce different solutions depending on the initialisation . Thus, in order to ensure that our conclusions were robust, we conducted five repetitions of the inference.

Prediction of behavioural variability
From the two functional representations just described, and the three considered anatomical descriptors (FA, MD, and VBM), we went on to assess how each of these can predict the considered behavioural traits. Within a 10-fold cross-validation scheme that respected the family structure of the HCP data  by never splitting families between folds, we predicted a number of behavioural traits within six different groups of variables: demographic, cognitive, affective, personality and sleep-related (Table SI-1). The word "behavioural " is used here in a general sense, even though we included demographic and life-factor traits that are not purely behavioural.
For predicting behaviour, we used an approach based on distance matrices (DM) and cross-validated, motion-corrected kernel ridge regression (KRR; Saunders et al, 1998 ;Schölkopf and Smola, 2001 ;He et al., 2020 ). Specifically, we computed ( N by N ) distance matrices (DM), where N is the number of subjects and where the distances are calculated to capture how different a specific representation is between each pair of subjects. Overall, there is one representation for the time-averaged FC (i.e. the FC time-averaged network matrix), five representations for the HMM-based FC (i.e. one per repetition of the HMM inference), and one representation for each of the structural information measures (FA, MD, VBM); this yielded one time-averaged-FC-DM, five HMM-DMs (one per repetition of the inference), and three structural-DMs (FA-DM, MD-DM and VBM-DM). Therefore, whereas the approach Prediction scheme using representations in terms of distance matrices (DM). On top, cross-validated prediction from the structural information; at the bottom, structure-deconfounded prediction from the dual-estimated HMM, which contains information of time-varying FC. The CV-deconfounded residuals represent the traits after discounting the influence of the structural information. An analogous procedure is used for the time-averaged FC to compute the distances is specific to each modality, all modalities end up being reduced to the same format (a DM); see Methods for details about how the pairwise distances for each modality were computed. The motivation of the KRR approach is two-fold. First, because the prediction is based exclusively on distances, we can decide on a sensible distance measure to use between different representations, instead of manually deciding what features to use to represent a representation. This offers a clean solution to the problem of how to make predictions using a complex object like an HMM, which it is not obvious how to convert into a vector of representative features; instead, using an appropriate measure to quantify how different two HMMs are is relatively straightforward. Second, having all types of representation (time-averaged FC, HMM or structural) in the same format (a DM) makes it easier to compare the explanatory power of each modality in predicting the subject traits, which otherwise could be heavily dependent on their specific parameterisation. See Methods for a mathematical description of KRR.
With the goal of exploring the influence of the structural information on the functional representations, we ran the predictions on the uncorrected behavioural traits, as well as on the behavioural traits after regressing out (deconfounding) the structural (FA, DM or VBM) information; see Methods for details. A scheme of the prediction procedure is illustrated in Fig. 1 for the dual-estimated HMMs: on top, the prediction from the structural information; at the bottom, the subsequent structure-deconfounded estimation from the dual-estimated HMM. The estimation for the time-averaged FC is analogous.

Time-varying FC explains distinct aspects of behaviour
Taking into account the structural information, we next show that time-varying FC contains information from some behavioural traits that is not contained in the time-averaged FC, and vice versa that the timeaveraged FC is a better predictor than the time-varying FC representation for other behavioural traits. Fig. 2 shows a comparison of the prediction performances between the HMM representation and the time-averaged FC representation for the six behavioural groups listed in Table SI-1. This is presented for both the structure-deconfounded (i.e. for FA, MD and VBM; see above for details about deconfounding) and the non-deconfounded case. The top panels present the cross-validated explained variance ( r 2 , computed from Pearson's correlation) for the HMM and time-averaged FC representation; statistical significance through Bonferroni-corrected Student's t-tests is indicated by colour. Note that although the predictions are not very high Pervaiz et al, 2020 ), several are still significant. The middle panels reflect the difference between the two -which is positive when the HMM representation is a better predictor and negative otherwise; and the bottom panels contain an average of these differences per behavioural group. Statistical significance of whether one representation has a higher r 2 than the other across traits is indicated within the bottom panels ( * < 0.05; * * < 0.01; permutation testing) for each behavioural group. Note that, as mentioned above, in the case of the HMM there are five different predictions per trait, one In the top panels, r 2 values (upwards for the HMM and downwards for the time-averaged representation; lighter colours represent statistically significant predictions and darker non-significant; Bonferroni-corrected Student's t-tests); in the middle panels, difference between the HMM and the time-average FC representations; in the bottom panels, the average differences aggregated by behavioural group ( * and * * reflect statistical significance for significance levels of 0.05 and 0.01; permutation testing).
per run of the HMM inference; therefore, also, there are five prediction differences between the HMM and the time-averaged representation per trait, and five dots per trait in the middle panels. As observed, there is considerable variability in which type of representation (HMMor time-averaged-FC-based) represents the traits better. Also, structuredeconfounding affects the prediction accuracy considerably, confirming previous studies on the influence of the structural information on FC-based predictions ( Bijsterbosch et al., 2018 ;Llera et al., 2019 ). In this regard, Fig SI-3 shows the loss of accuracy in percentage after correcting for the structure for each modality, grouped by behavioural group. As observed, the time-averaged representation is more affected by the corrections than the HMM-based representation. For reference, Fig SI-4 shows the (uncorrected) explained variance by each structural representation.
From this analysis, two conclusions are apparent. First and most importantly, that the behavioural groups are well separated by which representation is more effective in predicting them, with intelligence being particularly well predicted by the HMM representations. Second, that correcting by the structural information improves the relative performance of the HMM-DM compared to the time-averaged-FC-DM (see also Fig SI-3).

Changes in variance and amplitude of the signal do not explain behaviour
In order to investigate the possibility that the predictions are primarily driven by within-session changes in the variance or amplitude of the signal instead of FC, we ran two additional varieties of the HMM. These will be compared with the FC-based version of the HMM used throughout the paper -here referred to as FC-HMM-, where each state is parametrised as a Gaussian distribution with zero mean and a full covariance matrix. In the first of the new varieties, referred to as mean-HMM, the states where characterised by Gaussian distributions with distinct patterns of signal amplitude (encoded in the mean parameter), and a common full covariance matrix shared across states. In the second, the var-HMM, the states were characterised by Gaussian distributions with a diagonal covariance matrix and zero mean, modelling just variance and not actual covariance between regions. In these models, the FC was not allowed to vary between states. Furthermore, while the mean-HMM takes into consideration the time-averaged FC through the shared covariance matrix, the var-HMM does not model FC at all. Fig. 3 presents the explained variance of the FC-HMM versus the explained variance of each of the other two HMM varieties. As observed, the explained Fig. 3. Behavioural explained variance r 2 (defined as squared Pearson's correlation) by the FC-based HMM model (which is the type of HMM used throughout the paper, Y-axis) vs (i) the mean-HMM, a type of HMM with one shared covariance matrix and one "mean " parameter per state that models changes in amplitude (top row; X-axis), and (ii) the Var-HMM, a type of HMM with state-specific variance parameters, i.e. with no cross-region covariances (bottom row; X-axis).
variance for FC-HMM is consistently superior, highlighting the importance of accounting for time-varying FC above and beyond changes in amplitude and variance. Note that, although these models differ on the number of parameters and their complexity (which in principle could influence the quality of the predictions) here we are abstracting ourselves from these differences by performing the predictions on the base of DMs only. This means that the KRR models have always the same number of parameters regardless of the modality (see Methods)

Time-varying FC is more dissimilar to the structural information than time-averaged FC
Through their differences in prediction accuracy of traits, we have investigated the amount of information contained in either the timeaveraged or the time-varying FC representations with respect to behaviour. A complementary question is to what extent, specifically, do these brain representations contain similar or different information about behaviour. That is, if two representations are very similar with respect to a given behavioural group, that means that they represent similar information about that specific aspect of behaviour; if they are very dissimilar, it means that their information about the behavioural group is mostly non-overlapping.
For each group of behavioural traits (see Table SI-1), we correlated the trait predictions between each pair of brain representations: timeaveraged FC, each of the three structural measures, and each of the five HMM runs for the three HMM configurations described in the previous section (i.e. FC-HMM, which is the main model used throughout this study; mean-HMM, which only models the amplitude; and var-HMM, which models changes in variance). In the spirit of the Representation Similarity Analysis literature ( Kriegeskorte et al., 2008 ), this procedure produced a (no. of brain representations by no. of brain representations) similarity matrix per behavioural group, capturing how correlated the prediction of the behavioural traits was between each pair of representations. Fig. 4 A presents the corresponding similarity matrices, for each behavioural group. The five matrices have some common patterns but some differences are also apparent. The most relevant pattern here is that, in all cases, the structural representations were much more related (in terms of explaining behaviour) to the time-averaged FC than to the FC-HMM representations, confirming that time-varying FC is more unrelated to the structure than time-averaged FC. This is further explored in Fig. 4 B, where we show the probability density of the corresponding correlations in terms of explaining behaviour between the structural representations on the one hand and either the FC-HMM (blue) or the time-averaged FC (red) representations on the other hand. The probability densities of the correlations were estimated by bootstrapping ( Efron and Tibshirani, 1986 ). As observed, the differences are large and significant (p < 0.001, bootstrap-based testing). We can also observe that the five FC-HMM representations are more consistent in explaining the demographic traits (i.e. the correlation between HMM runs is higher) than they are in the other behavioural groups, and they are also more related to the time-averaged FC representation for this behavioural group than for the others. The latter point can be seen in Fig. 4 C, which shows the probability density of correlations (estimated by bootstrap- Fig. 4. A. How similar are the different representations in explaining behaviour? Similarity matrices (in terms of Pearson's correlation) capturing how similar the prediction of behavioural traits was between each pair of representations are shown for each of the five behavioural groups: time-averaged FC, HMM-based representations including time-varying FC (FC-HMM; used throughout the paper), HMM representations including only changes in amplitude (mean-HMM) or variance (var-HMM), and structural (FA, MD and VBM). B. Distribution densities (obtained via bootstrapping) of between-modality correlations (in terms of explaining behaviour) show that the time-averaged FC representation is more related to the structural representations than the time-varying FC (p < 0.001, bootstrap hypothesis testing). C. The correlations between the time-varying and the time-averaged FC representations are higher for the demographic traits than for the other behavioural groups (p < 0.001, bootstrap hypothesis testing).
ping) between the FC-HMM and the time-averaged FC in terms of how they explain each behavioural group. The larger similarity for the demographic group is apparent. The other two types of HMM estimations, having fewer parameters and without capturing any information about time-varying FC, are in general more similar across runs and quite different from the FC-HMM, indicating that the FC-HMM approach is unlikely to be purely driven by changes in amplitude. In terms of the structural information, MD and FA are fairly similar to each other for all behavioural groups, but their similarity to VBM varies according to the behavioural group (highest for intelligence and sleep; lowest for demography).
Altogether, this analysis revealed clear differences and similarities between the different neuroimaging representations in terms of explaining behaviour, and provides further evidence that time-varying FC is more unrelated to the anatomy than the time-averaged FC.

Reproducibility of DMs
The reproducibility of the estimation of both the representations and the behavioural predictions can be relevant in terms, for example, of a clinical application. In the previous section, we have considered how different estimations of the various representations differ in their relation to behaviour. Here, we analyse another aspect the representations' reproducibility: how robust are these representations, per se, across scanning sessions.
The HCP data contains four sessions per subject, with the first two (1 and 2) being acquired on one day and the last two (3 and 4) on the following day. In order to further quantify the reproducibility of the estimations, we estimated separate time-averaged FC and (FC-)HMM models for the first day and for the second day, i.e. for sessions 1 and 2, and then for sessions 3 and 4. We also estimated models for the first session of the day, and then separately for the second session of the day. For each of these two half-split estimations, (HMM-or time-averaged FC-related), we then computed DMs. Fig. 5 presents a quantitative assessment of the reproducibility of the estimations in terms of how similar their respective DMs were across half-splits of the data. Here, the dots represent a measure of distance between one pair of subjects.
As expected, the reproducibility within day is considerably larger than between days for both types of representation. Importantly, the time-averaged FC description (being a simpler quantity to estimate) exhibits in general a higher between-session reproducibility than the HMM representation  ). This can be due to the timeaveraged FC not just being a simpler quantity to estimate, but also to the HMM being potentially better able to capture session-specific information thanks to its time-resolved nature.

Discussion
In resting state fMRI, the quantification of time-varying functional connectivity (FC) has elicited considerable interest and controversy: that is, to what extent can we measure and interpret within-session changes in the patterns of FC between areas? Whereas many studies rely on the average magnitude of activation that is evoked by a task or stimulus, FC is a second-order statistic and therefore is harder to estimate accurately. Similarly, it is unclear whether FC can reflect changing patterns of communication between distant neuronal populations, and therefore be meaningful for investigating cognition. Even though the total amount of between-subject variability attributed to stable subject FC features (i.e. that do not change within session and are preserved for each subject across sessions) is considerably higher than the within-session variability (i.e. that change within a session; Gratton et al., 2018 ), here we show that fMRI-derived FC indeed contains both stable and timevarying behaviourally meaningful information, and that time-varying FC can explain behavioural variability that is less likely to be mediated by structural connectivity and other anatomical features. This suggests that time-varying FC may represent meaningful neuronal dynamics related to certain aspects of behaviour. As a consequence, the study of FC fluctuations remains promising for the understanding of transient cognition.
To answer this question, it is informative to disentangle the different mechanisms by which time-varying FC computed from fMRI data could be non-informative: first, the characterisation of time-varying FC may be limited by fundamentally technical issues; second, the actual amount of information contained in the time variations, when assessed unbiasedly, may be negligible; and third, even if we can prove that there is non-negligible information in time-varying FC that can be reliably quantified, it may still not be cognitively significant. We argue that certain technical limitations do not apply to all methods of estimating timevarying FC equally. In the case of the HMM, for example, the technical limitation of having a statistically unstable estimation due to limited amounts of data (e.g. when using sliding windows) is overcome by using large amounts of data in the estimation of each state through the ability to pool over all the data from repeated visits to the state (on average, 125h per state in the present data set).
It has been shown that the time-averaged (subject-specific) FC features represent most of the variance in fMRI data ( Gratton et al., 2018 ). However, the fact that time-varying FC explains considerably less variance does not necessarily mean that time-varying FC is deficient in explaining behavioural traits. We consider that discussing the physiological relevance of a brain representation in terms of explained variance (of the data) is not appropriate for two reasons: (i) that "physiological rele-vance " must be connected to a specific scientific question -e.g. relevant to the study of attention; and (ii) that, provided such a question, there is not prior evidence that the most informative aspect of the signal (for that question) is the one that explains the most variance in the data. For example, in the context of prediction it is a well-known phenomenon that the first principal components of the predictor data are not necessarily the most explanatory to predict the target variable ( Frank and Friedmann, 1993 ). As an example closer to neuroscience, electrophysiological signals hold most of their variance at lower frequencies. In comparison, only a small fraction of variance is contained e.g. in the gamma frequencies ( > 40 Hz). These, however, have been demonstrated to be essential to behaviour ( Jensen et al., 2007 ). In summary, the argument that there is considerably more variability in the between-subject than in the within-subject differences cannot be used to claim the lack of biological relevance of these features.

Relation to previous work
Some of the conclusions of this study relate to the recent work from Liégeois et al. (2019) , who found, in fMRI, that the autoregressive model (which linearly represents how on average the signal at time point t depends across regions on the signal at time point t -1) was often more explanatory of behavioural variability than the standard time-averaged FC estimation. Because the autoregressive model is known to describe the dynamics of the signal well ( Liégeois et al., 2017 ), the conclusion of this study was that the dynamic aspects of the data can often explain behaviour better than (average) instantaneous fMRI correlations. Critically, there is a conceptual distinction between a model of the multivariate dynamics of the system (as captured by the autoregressive model) and time-varying FC (as captured by the HMM) that is important to the message of this study. Specifically, while both the HMM and the autoregressive model can capture time-varying FC, the autoregressive model also captures other elements such as spectral information , while the HMM captures aspects of the data that the autoregressive model does not explicitly account for, such as the identification of time-resolved transient events. Therefore, the autoregressive model is not able to answer our question, which is focussed specifically on FC: i.e., do variations over time in the fMRI FC have biological significance above and beyond the temporally averaged FC, and also the structural information? This question is important as it speaks to the extent to which FC can represent instantaneous neural communication. These questions require a model that explicitly considers variations around the time-averaged FC in a way that is not mixed with these other elements. The version of the HMM used here is one way to achieve this, but not the only one. Other data descriptions capturing related or different aspects could also be considered, such as those based on signal events ( Allan et al., 2014 ) or quasi-periodic patterns ( Thompson et al., 2014 ).
It is worth noticing that, in accordance with the growing body of work on predicting behavioural traits from functional connectivity on the HCP data, the predictions were modest; see  ( Smith et al, 2014 ;Vidaurre at al., 2017 ), allowing us to disentangle the time-varying from the time-averaged FC behavioural relevance in terms of trait prediction. Future work will aim at replicating these results on the UK Biobank, where higher prediction accuracies have been observed ( Pervaiz et al., 2020 ).

Limitations and open questions
It is also important to appreciate that neither the HMM nor other commonly used time-varying FC estimators are explicitly biophysical models. Decisions about the appropriate number of states and other parameters are useful insofar as they affect the extent to which we can address the specific question at hand. For example, estimating more states will offer a more fine-grained representation of the data, which might be necessary in certain applications but cannot be interpreted as more or less faithful to the biology. In general, different parametrisations just offer different perspectives of the data, and, assuming model identifiability, the HMM is not more or less valid than other models. We also acknowledge that, while the version of the HMM used in this work is designed to emphasise time-varying FC, it could also be sensitive to changes in amplitude ( Duff et al., 2018 ). However, we have explicitly tested that a version of the HMM only based on changes in amplitude is unable to explain behaviour to the same extent, emphasising the importance of time-varying FC. Other aspects of the data that can influence the HMM estimation are long-range temporal dependencies, which are not explicitly modelled by the HMM ( Shappell et al., 2019 ). A quantitative assessment of the long-term dependencies in the data and how they affect the HMM estimation will be subject of future work.
An important methodological consideration is that, even though all the representations are unbiasedly compared at the level of prediction because of their common DM representation, our analysis still depends on the choice of how to compute the distances. For example, in this study we used a Kullback-Leibler divergence approximation to compute distances between HMM representations ( Do, 2003 ;see Methods). Alternatively, one could compute differences purely based on the temporal aspects of the model (e.g. the transition probability matrix) or its spatial properties. Related to this point, the merits of kernel-based approaches come at the expense of neuroanatomical interpretation: since we no longer have one regression coefficient per spatial area, but one regression coefficient per subject (see Methods), and given also that the distances matrices (on which the prediction is based) are computed in an unsupervised fashion, it is not straightforward to see which areas have actually driven the prediction. Future work will address these questions, including how to compute between-model distances as part of the prediction and in an interpretable manner, so that the most predictive features of the models are identified in a data-driven way.
One aspect to consider about models for which inversion does not have a mathematically closed formulation (as is the case of ICA and the HMM among many others -but not of Pearson's correlation or the autoregressive model) is the fact that, every time we estimate the model, we might get a slightly different description of the data insofar as the estimation has a random seed (see for example Fig. 4 ). Even though the HMM inference is relatively stable on this data set ( Vidaurre et al, 2017 ), that is not guaranteed to be the case always. Again, these are not biophysical models, so all estimations are theoretically valid as far as they are considered as what they are: data-driven descriptions. Although there exist statistical testing approaches available to combine across estimations so that statistical power is boosted , in this work we have analysed each estimation separately to ensure the comparability of the results.
A further caveat is that the ICA maps are known to contain important subject-specific differences that can be relevant to behaviour ( Bijsterbosch et al. 2018 ). These differences were not considered in this paper, as we estimated both time-varying and time-averaged FC using only the ICA time series. In future work, we will study the combination of these analyses with techniques that are more suitable to account for this information ( Harrison et al., 2015 ).
Finally, it is worth noting that, to be conservative, we have performed the (group-level) HMM estimation within the cross-validation loop. However, since the HMM estimation is completely unsupervised and does not make any use of the labels, it would have been also correct to obtain the (dual-estimated) HMMs prior to, and outside of, the prediction cross-validation loop. Whether or not this is acceptable to do depends on the practicalities of the application. For example, if we wish to predict whether a new subject is going to develop a disease in the future given their brain data, it would not be a problem to rerun, on the entire data set (i.e. including the new subject), an unsupervised dimensionality reduction algorithm (like the HMM) before doing the prediction as far as such algorithm is unsupervised. Doing this would not make a di-agnosis any less valid -only perhaps slower. But sometimes we would not have access to the original data at the time of prediction, in which case a proper validation of the method would need to cross-validate the HMM estimation as we did here.
In summary, this study presents methods to use different sources of brain data and/or models for prediction, in a way that makes comparisons possible in terms of their explanatory power of behavioural or clinical variables. Using this method, we have shown that time-averaged and time-varying FC explain distinct aspects of behaviour, above and beyond the behavioural variability expressed on the considered structural brain data.

Methods
We provide some details on preprocessing, the nature of the hidden Markov model estimation and its different varieties, the computation of the distances between each pair of subjects for each of the considered measures or subject variables, and the use of kernel regression to test the relation between imaging and non-imaging variables, how we accounted for the structural information and the influence of motion.

Preprocessing
We used the "minimal preprocessed " data from the HCP. Since the preprocessing of this data has already been described in detail ( Smith et al., 2013b ), we will provide an overview here. Structured artefact removal using independent component analysis (ICA) and FIX ( Griffanti et al., 2014 ) removed more than 99% of the artefactual ICA components in the data set. No low-pass temporal filter was used, and only minimal high-pass filtering was applied (cutoff= 2000s), essentially removing the linear trends of the data. Since ICA-based methods have been shown to better characterise the signal than other data-driven approaches such as k-means (in particular on the HCP data; Bzdok et al., 2016 ), we used group spatial-ICA to obtain a "parcellation " of 50 components that covers both the cortical surfaces and the subcortical areas (without using global signal regression). This parcellation was then used to project the fMRI data into 50-dimensional time series. These time series were finally standardised separately for each scan, subject and ICA component..

An FC-focused hidden Markov model
The hidden Markov model (HMM) is a probabilistic framework used to model time series using a finite number of recurring patterns that succeed each other in some order ( Rabiner, 1989 ). The key assumptions of this approach are: that the data can be reasonably represented using a discrete number of probabilistic models; that occurrence of these models is exclusive -i.e. the state time courses' summation across states is one for each time point; and that we can reasonably model the states dynamics by a Markovian process -i.e. that the probability of a state being active depends purely on the data and which state is active in the previous time point.
Each of these patterns or states are an instantiation of a certain probability distribution. The HMM is generic in the sense that it can accommodate different state probability distributions, depending of the type of data we are processing and the features that we wish to model. A suitable state choice for fMRI data is the Gaussian distribution ( Vidaurre et al., 2017 ), where each state, indexed by k , is modelled by a certain "mean activity " parameter k and a covariance matrix Σ k . Let x t be the data at time point t -i.e. the value of the ICA time courses at t . The probability density function that describes x t , assuming that state k is active at time t , is given by where J is the number of brain regions (here ICA components), | Σ k | is the determinant of the state-specific covariance matrix Σ k and exp is the exponential function. Here, Σ k represents the covariance of the residuals, i.e. after subtracting the mean parameter k to the signal. In this paper, each state is parametrised as Gaussian distribution with no mean parameter. Note that the interpretation of Σ k as FC is not analogous to what is typically referred to when time-varying FC is assessed using sliding-window analysis ( Thompson and Fransson, 2018 ). This is because, as opposed to sliding windows, in this type of HMM the mean k is also allowed to be time-varying. Therefore, in order to focus the HMM decomposition on the FC changes, and in order to make the HMM estimation more comparable to standard analyses of time-varying FC, we enforced k = 0, by describing the probability density function for state k as where Σ k represents state-specific FC. In this model, Σ k is assumed to be distributed as a Wishart distribution. Note that this is equivalent to having each state being Wishart distributed. Importantly, this model carries information of both time-averaged FC and time-varying FC. Another important element of the HMM, also estimated from the data, is the transition probability matrix (TPM), which encodes the probability of transitioning from one state to another at any time point. Practically speaking, the TPM serves two purposes: it identifies the transitions that are more probable, and it regularises the state switching to minimise the amount of spurious transitions. In particular, whenever we have a more persistent (temporally regularised) solution, the diagonal elements will be comparably larger than the off-diagonal elements of the TPM.
The estimation of the HMM, carried out through a procedure of Bayesian variational inference ( Wainwright and Jordan, 2008 ;Vidaurre et al., 2018 ), was first computed at the group level, such that the state probability distributions were shared across subject -though the state time courses and the time spent in each state were subjectspecific.
Next, we computed subject-specific HMMs using a process that we refer to as dual-estimation (in analogy to dual-regression in ICA; Beckmann et al., 2009 ). To do this, we simply used the subject-specific state time courses to compute a subject-specific estimation the states; then, based on these state estimations, we recomputed the state time courses and the TPM for each subject. In order to make cross-validated behavioural predictions (see below), we followed the conservative procedure of estimating the group-level HMM only on the cross-validation training sets, so that, afterwards, we could obtain the dual-estimated HMMs on both the training and testing sets.

The HMM contains time-averaged FC information
The HMM contains information not just on time-varying FC (how FC changes within each session), but also regarding the time-averaged FC (the subject-specific FC information that remains stable across sessions for each subject). This is because it is possible to fully reconstruct the time-averaged FC estimation from the dual-estimated HMM simply by computing a weighted average of the states' covariance (FC) matrices for each subject, where the weights are given by the fractional occupancies and the fractional occupancies are defined as the total proportion of time spent at each state for every given subject ( Baker et al., 2014 ).
Given N subjects and K states, the group level HMM estimation represents some of the subject-specific time-averaged (or static) FC (avFC) information, according to the following expression: where i indexes subjects and represents the fractional occupancy for subject i and state k (i.e. the total time spent on that state for that subject). Given that the number of states is lower than the number of subjects ( K < N ), this is an approximation, and therefore there is some time-averaged FC information that is not captured by the HMM. Likewise, the HMM has information (for example in the TPM) that is not captured by a standard time-averaged FC estimation; formally, we refer to this differential information as time-varying FC. As opposed to the group level estimation, the dual-estimated HMM estimation captures all the time-averaged FC information where d-e denotes dual-estimated. This is because: where is the probability for subject i to be in state k at time point t , su m = 1 by the definition of a probability, and = 1 ∑ .

The HMM contains time-varying FC information
We have established that the HMM contains some time-averaged FC information. But, to which extent does the HMM capture time-varying FC information, above and beyond the time-averaged FC?
As a sanity check, in Fig. SI-5 we show that the time-averaged FC contains information that is essentially uncorrelated to the FC temporal variability. To compute a measure of the extent to which there is timevarying FC for each pair of regions, we first constructed an instantaneous estimate of FC at each time point, using a weighted sum of the dualestimated HMM states' FC, weighted by the assigned HMM state probabilities at that time point. We then took the variance of these instantaneous estimates of FC across time to produce a (regions-by-regions) matrix of estimated FC temporal variability for each given subject. We then compared this to the time-averaged FC, confirming that these are unrelated.

Other HMMs with no time-varying FC information
Previously, we have shown that the dual-estimated HMMs contain all the information there is about time-averaged FC. Having K FC descriptions per subject instead of one, plus a TPM ruling the transitions between these states, it is apparent that the HMM contains additional information beyond time-averaged FC. An important question is then what that additional information represents. There are three possible sources of variability: actual within-session changes of FC (i.e. timevarying FC), within-session changes in the variance of the signal, and estimation noise. By meaningfully relating the HMM information to behaviour above and beyond the influence of time-averaged FC (see below, and Results) we can rule out the possibility that the HMM extra parameters are purely noise-driven. However, given that both variances and correlations (i.e. FC) are part of the state descriptions, there is no straightforward analytical way to disambiguate how much these two elements drove the inference of the HMM. In order to prove the relevance of pure time-varying FC in the HMM estimation, we obtained alternative HMM estimations where the states are purely derived by changes in the variance of the signal. The probability density function of this model is given by where ∏ ( ⋅) represents multiplication across regions, 2 is the variance for region j and state k , and is the value of the signal for region j at time point t .
The fact that HMM using full covariances matrices can explain aspects of behaviour that this model was unable to explain (see Fig. 4 ) suggests that there is relevant information in the HMM that is not related to changes in variance.
Even though the mean parameter of the Gaussian distribution (which reflects the amount of activity of each state with respect to the average signal) was not included in the model in the first place, we estimated a third HMM model where the states were solely defined by the mean, i.e. without state-specific covariances -and with a shared, global covariance. The purpose of this analysis is to rule out the possibility that this type of information, though not explicitly included in the HMM description used here, permeated the state covariance matrices and determined the HMM inference. The probability density function of this model is given by with one single covariance matrix shared across states. Note that this model holds important similarities with ICA, in the sense that each state or component is represented by a map of activation. Again, as reflected in Fig. 4 , this model is less predictive of behaviour.

Measures of structural variability
We applied independent component analysis (implemented by the Melodic tool in FSL; Jenkinson et al., 2012 ) on the fractional anisotropy (FA), mean diffusivity (MD), and voxel-based morphometry (VBM) values for each subject across the whole brain (2mm resolution); resulting in 50 independent components of FA, MD, and VBM variability across subjects.
In more detail, the structural T1 weighted data was preprocessed using the computational analysis toolbox (CAT) − 12 ( Nenadic et al., 2015 ), which extends the SPM's VBM pipeline ( Ashburner and Friston, 2000 ). Before grey matter volume estimation, all participants' T1 images were affinely aligned, segmented, normalized, and bias-field-corrected, yielding images containing grey and white matter segments and CSF. DARTEL ( Ashburner, 2007 ) was then used to normalize all images to a standard grey matter template provided by CAT-12. Subsequently, all grey matter volumes were smoothed with a 9.4 mm FWHM Gaussian smoothing kernel (sigma = 4 mm). The diffusion weighted data was preprocessed using the DTIFIT routine from FSL ( Jenkinson at al., 2012 ) in order to extract FA and MD. More details about structural preprocessing can be found in Llera et al. (2019) .

Measuring dissimilarities between subjects
The kernel-based prediction algorithm employed in this paper is based on distance matrices (DM) containing the dissimilarities between each pair of subjects within the geometrical space defined by each type of representation (see Fig. 1 ). As mentioned, the main purpose of this approach here is to abstract ourselves from the specifics of each representation (e.g. time-averaged or time-varying FC) and their complexity, so that the prediction is made in a comparable fashion. Furthermore, there is not a straightforward way to unwrap the parameters of an HMM model into a vector of predictive features, so that a standard regression model can be applied. Because it is possible to compute distances between HMM models more straightforwardly, a kernel-based approach is a more natural way to make predictions in this case.
We next detail how to compute DMs in the spaces defined by the different imaging-based modalities: time-varying FC, time-averaged FC, and structural.
We first discuss the HMM model, which, as discussed above, contains information about both the time-averaged FC and time-varying FC. In particular, we computed the symmetric Kullback-Leibler divergence between each pair of (dual estimated) subject HMMs, denoted as M 1 and M 2 .
where KL( M 1 || M 2 ) represents the standard (non-symmetric) Kullback-Leibler divergence between probabilistic models M 1 and M 2 . More specifically, the Kullback-Leibler divergence represents how much information a probability distribution contains in relation to a second reference probability distribution. Whereas the Kullback-Leibler divergence has a closed form for various well-known distributions (e.g. the Gaussian distribution), this is not the case for more complex probability distributions such as the one represented by the HMM. For this reason, we adapted the mathematical approximation proposed by Do (2003) for discrete state distributions to the Gaussian case: where represents the (Dirichlet-distributed) probabilities of transitioning from state k to any of the other states according to model i (i.e. the k -th row of the TPM); is the state Gaussian distribution for state k and model i ; and k is a factor representing the weight of state k in M 1 (see below). Given the initial probabilities 1 of the HMM state time courses for model M 1 (which are computed from the data together with the TPM), can be numerically computed such that it meets the following necessary criteria (see Do, 2003 ):

The expressions for
( 1 , 2 ) and ( 1 , 2 ) are standard and can be found elsewhere ( MacKay, 2003 ). The code to compute the symmetric Kullback-Leibler divergence between two HMM models is provided in 1 . Note that these expressions require the states to be matched between HMM models; i.e. the first state of 1 must correspond to the first state of 2 . This is guaranteed here by the fact that the dualestimated HMMs are derived from the same group-level HMM. The second type of DM corresponds to the time-averaged FC. To keep the comparisons fair, and in line with the approach taken for the timevarying FC, we described the time-averaged FC by fitting a Gaussian distribution per subject. Given that the time series were standardised for each subject (i.e. they are demeaned and have variance equal to 1.0), the resulting Gaussian distributions only contain a covariance matrix that is mathematically equivalent to using a Pearson's correlation matrix. The time-averaged FC's DM was computed using the symmetric Kullback-Leibler divergence between each pair of the subject-specific Gaussian distributions, Note that, because this way we are taking into account the non-Euclidean geometry of the covariance matrices, this approach is mathematically more principled and therefore statistically more efficient than using correlations across the off-diagonal elements of the FC matrices (as is more commonly done in the literature).
Finally, the third type of DM is computed from the structural information, using the three considered structural measures: FA, MD and VBM. As discussed earlier, we have 50 ICA components for each measure, so the data consist of 50 weights per subject in each case. Given that there is no specific geometrical constrain on these values, we just used Euclidean distances between each pair of subjects in order to estimate the corresponding DMs.

Predicting subject traits using kernel ridge regression
Our goal is to capture the relationship between the representations of the brain imaging data (time-varying FC, time-averaged FC, and structural information) and subject traits. One way to achieve this is by regressing a set of nonlinear mappings of the imaging-derived features onto the behavioural traits, e.g.: where is the ( N by 1) behavioural trait of subject i , and ( ) is a nonlinear function that operates on a representation of the brain imaging data. To implement this regression model, we need to specify the choice of the function ( ) and also the imaging-derived features it should operate on -a not straightforward task.
An alternative and simpler approach to working with this regression model is to take advantage of the so-called kernel trick, whereby predictions of an out-of-sample subject's behavioural trait, ̂ , are made to depend on a kernel function without the need of manually defining ( ) ( Schölkopf and Smola, 2001 ). Specifically, we use kernel ridge regression (KRR), which is formulated as where is a ( N by 1) vector of KRR weights, and represent the observed ( N by 1) vector of values of the behavioural trait from the training CV-fold. As shown in the mathematical derivations by Saunders et al. (1998) , we can make use of the kernel trick to estimate as where is a regularisation parameter. As with other kernel-based approaches, such as the support vector machine or Gaussian processes, KRR works on a ( N by N ; where N is the number of subjects) kernel matrix H , which is computed by applying some kernel function on the corresponding DM. Here, we chose a Gaussian radial basis function kernel, parametrised by a radius parameter ( Hastie et al., 2001 ): where Dis t 1 , 2 is the distance between the representation for two different subjects within the training CV-fold, and 1 , 2 is the corresponding element of the kernel matrix. That is, once we have computed the corresponding DM, the KRR approach does not need to consider where these distances come from. The choice of the Gaussian kernel function is motivated by the fact that it generalises well to most domains, given its lack of strong assumptions ( Schölkopf and Smola, 2001 ). On these grounds, h is a (1 by N ) vector containing the result of applying the Gaussian kernel to the N distances between each of the subjects in the training set and subject i in the test CV-fold. In summary, the KRR formulation (also benchmarked in a neuroimaging context against deep learning methods by He et al., 2020 ) is equivalent to having a nonlinear prediction using an explicit nonlinear function ( ) , but without having to directly design, use, or even know, such function; instead, we only need to specify a distance measure between the representations (e.g. of the HMMs, the structural images, or the time-averaged FC matrices) and a valid kernel function. The code for KRR, which uses a nested cross-validation loop to select both and , is provided in 2 .

Accounting for brain structure in the predictions
A central goal of this paper is to assess how the predictive power of the time-averaged and time-varying FC representations relates to the anatomy. For this purpose, we used cross-validated KRR to estimate FA-, DM-and VBM-based predictions for each behavioural trait, using their respective DMs. That is, we estimated regularised KRR coefficients on the training folds and applied them on each testing fold, in turn, to eventually produce an ( N by 1) vector of predicted traits for each behavioural variable and structural modality. Then, we computed the corresponding residuals as the difference between the predicted and the empirical traits, and used these as FA-, MD-or VBM-deconfounded behavioural traits in the subsequent time-averaged-FC-based and HMM-based predictions. Cross-validation-based deconfounding was chosen because it is less aggressive and biased than standard deconfounding ( Snoek et al., 2019 ).

Motion correction
Since motion is known to influence both the estimation of timevarying FC and the prediction of behavioural variables, we used FIX and confound regressors at the level of the individual subject time series ( Smith et al., 2013b ). Furthermore, we included the derived motion parameters as confounds in the KRR prediction in order to also control for between-subject differences in motion.

Data and Code Availability Statement
The data used in this manuscript pertains to a well-known, openaccess repository: the Human Connectome Project dataset.
The code comprises our open-access code repository (https://github.com/OHBA-analysis/HMM-MAR), and custom scripts for the specific experiments performed in the manuscript; these scripts will be made publicly available at the time of publication.