Modelling subject variability in the spatial and temporal characteristics of functional modes

Highlights • We extend probabilistic functional modes with a hierarchical temporal model.• The implementation, PROFUMO, is available from git.fmrib.ox.ac.uk/samh/profumo.• Improved spatial sensitivity is demonstrated using rfMRI data from 1000 HCP subjects.• Correcting for spatial variability changes information captured by temporal features.


Introduction
One of the key changes to the landscape of the analysis of functional connectivity via rfMRI in recent years has been the proliferation of large population-level studies [Van Essen et al. 2012b;Breteler et al. 2014;Bamberg et al. 2015;Miller et al. 2016] and multi-site data-sharing initiatives [Biswal et al. 2010;Scott et al. 2011;Mennes et al. 2013;Poldrack et al. 2013; Thompson et al. 2014;Kennedy et al. 2016;Gorgolewski et al. 2017] 1 . This has allowed investigations into the population-level correlates of fine-grained changes in functional connectivity [E. A. Allen et al. 2011;Dubois and Adolphs 2016], with several studies already finding strong links with a variety of behavioural, genetic and lifestyle factors [Finn et al. 2015;Smith et al. 2015;Colclough et al. 2017;Elliott et al. 2018]; together, these findings augur well for the search for clinically relevant, personalised predictions from functional neuroimaging data [Insel and Cuthbert 2015;Dubois and Adolphs 2016;Abraham et al. 2017;Stephan et al. 2017]. In sum, there has been a shift in what is required of analysis techniques, namely that they must be both interpretable and sensitive to subject-level variability, and at the same time they need to scale to meet the computational demands posed by large data sets.

Implications of variability over subjects
In this paper, we are primarily interested in the interpretation of-and characterisation of the subject variability in-static functional connectivity 2 . Ultimately, static functional connectivity is encapsulated by the dense connectome-by which we mean the time-averaged voxels-by-voxels connectivity matrix, as defined by the statistical relationships between time courses as extracted from functional data [Friston et al. 1993;Friston 2011]. However, dense connectomes are cumbersome computationally, and the natural spatial scale of the functional data is likely to be much lower than the several hundred thousand voxels present in a typical fMRI acquisition [Van Essen et al. 2012a]. In practice, what we are seeking is a parsimonious summary of the static functional connectivity that is both readily interpretable and captures key forms of variability.
The canonical approach for analyses of static functional connectivity is to summarise the high-dimensional data in terms of a comparatively small number of either parcels or functional systems 3 . These are usually defined in terms of their spatial configuration, at which point it is possible to extract representative 1 For a more complete overview of data sharing initiatives, see the NeuroImage special issues [Eickhoff et al. 2016].
2 For the rest of this paper, when we use the term variability in relation to functional measures, it can be assumed to relate to variability in static functional connectivity over subjects or sessions. This does not consider, for example, the moment-to-moment fluctuations characterised as dynamic functional connectivity [Hutchison et al. 2013;Calhoun et al. 2014;Preti and Van De Ville 2017]. This encodes important within-subject state changes [Tagliazucchi and Laufs 2014], and there is growing evidence that this captures between-subject trait differences too [Vidaurre et al. 2017]. 3 Resting-state networks, intrinsic connectivity networks, etc.
1. Introduction bioRxiv time courses from functional data and analyse these. There will naturally be variability in functional connectivity in several domains, though based on the above framework we will focus on two key ones here: firstly, we will refer to variability in the size, shape and location of functional regions as subject variability in spatial organisation; secondly, we will use subject variability in temporal features to denote the changes in summary measures based on said time courses-in particular, the strength of functional connectivity between regions (i.e. functional connectomes). Finally, note that for clarity we will use the term functional coupling to specifically refer to the functional connectivity between regions as described by these lowdimensional connectomes 4 .
The assumption that is implicit in either the parcel or system-level analyses is that registration to a common space means that the time courses we extract based on group-level spatial descriptions are an accurate, or at least unbiased, description of each subject's data. However, given that it is by no means uncommon to observe three-fold variation in the areal extent of regions of primary visual cortex across subjects [Andrews et al. 1997;Dougherty et al. 2003]; or that non-homeomorphic morphological changes, such as subjects exhibiting different number of gyri and sulci, are prevalent [Shackman et al. 2011;Amiez and Petrides 2014] even in identical twins [Bartley et al. 1997;Hasan et al. 2011]; or that macroscale anatomical features are poor predictors of cytoarchitectonic borders [Amunts et al. 2007]; then we should expect there to be substantial disparities in the presentation of functionally homologous regions across subjects, even after nonlinear registration [Brett et al. 2002;Devlin and Poldrack 2007;Van Essen and Dierker 2007;Mueller et al. 2013]. Recent observations have confirmed this for functional data, where it has been shown that this subject variability in spatial organisation 'can give rise to divergent connectivity estimates from the same seed region in different subjects' [Gordon et al. 2017a]-with the results from several studies also suggesting that reorganisations of functionally homologous regions that cannot be represented by diffeomorphic warps seem to be commonplace [Hacker et al. 2013;Harrison et al. 2015;Laumann et al. 2015;Glasser et al. 2016a;Gordon et al. 2016;Braga and Buckner 2017;Gordon et al. 2017b;Kong et al. 2018]. Furthermore, these differences have a substantial impact on the data: cross-subject differences in static functional connectivity have been shown to be much larger than either cross-site effects [S. Noble et al. 2017] or cross-condition, within-subject changes [Gratton et al. 2018].
Loosely speaking, these spatial differences in functional connectivity after registration can arise for four reasons: there will naturally be some errors in the registration process, resulting in structural features that are not brought into correspondence; there will be locations where anatomical landmarks bear little relation to functional subdivisions, meaning structural similarity is not a sufficient condition for accurate registration; there will be genuine non-homotopic reorganisations, whereby the standard registration approaches based on diffeomorphic warps could 1. Introduction bioRxiv never succeed 5 ; and there will be dynamic-either moment-to-moment or statedependent-changes in the functional connectivity structure [Buckner et al. 2013;Krienen et al. 2014;Salehi et al. 2020]. If these different sources of variability in spatial organisation are not accounted for, then one expects the inferred mode time courses to be a farrago of contributions from the underlying 'true' set of modes E. A. Allen et al. 2012]. Worse still, if the structural differences capture meaningful cross-subject differences-which they almost certainly will do [Llera et al. 2019]-then the amount of misalignment, and hence the quality of the extracted time courses, will reflect information that is anatomical rather than functional in origin ]. This breaks the central tenet of investigations into subject variability in temporal features, as we can no longer assume that a group-level description of the functional architecture is a reliable description of individual subjects, or even that we can use these to extract unbiased estimates of functional coupling. How then, do we proceed from here?
The first approach we could take is to improve the registrations, and hope that better algorithms and utilising a richer feature set to drive the alignment will push individual subjects ever closer towards the group description [Robinson et al. 2014;Tong et al. 2017;Robinson et al. 2018]. Notably however, the multiple recent observations that single functional regions can be manifested as multiple disjoint regions in some subjects, is something that not even advanced functional registration algorithms reliant on diffeomorphic warps can correct for. The minimum requirement for this approach is therefore the use of advanced registration techniques that can non-homotopically reorganise the spatial layout of functional regions, as, for example, introduced by Conroy et al. [2013] and Guntupalli et al. [2016Guntupalli et al. [ , 2018, or Langs et al. [2010].
The alternative approach, and the one that we take in this paper, is to build algorithms that can extract estimates of subject variability in temporal features while simultaneously accounting for the variable presentation of functional regions at the subject level. Several methods have been proposed to do exactly this, using both hierarchical models of functional systems [Varoquaux et al. 2011;Abraham et al. 2013;Harrison et al. 2015;Li et al. 2017] and parcels [Liu et al. 2012;Langs et al. 2016;Kong et al. 2018]. We provide a more fulsome description of these, and their counterparts that extract subject-specific information given a fixed group template, in Appendix A. However, the majority of these methods have what is potentially a major limitation: the flow of information is almost exclusively from group to subject. In other words, there are only relatively rudimentary efforts to tap into what we might hope is a virtuous cycle: we should be able to use our group-level estimates to infer accurate subject-level information, but, crucially, we should also be able to utilise the observed variability at the subject level to refine our 1. Introduction bioRxiv group-level parameterisations. Furthermore, the same process should hold within subjects, such that accurate estimation of the individual spatial presentations should improve evaluation of the temporal information, and vice versa.
Finally, while we have tended to focus on connectomes as the principal temporal feature of interest in the above discussion, there are other types of variability we are interested in. Recent work has shown that, for example, amplitudes-by which we mean any metric which represents the amount of fluctuation in activity of a functional region over time-carry a substantial amount of information about subjects [Zang et al. 2007;Duff et al. 2008;Zou et al. 2008;Miller et al. 2016;Bijsterbosch et al. 2017], provided we are sufficiently careful in how we distinguish changes from those in functional coupling [Duff et al. 2018], and then how we interpret said changes [Qing and Gong 2016]. Amplitudes are therefore another type of subject-specific information that we would hope analysis methods could identify, and more importantly disambiguate from, the types of subject variability we have already discussed. This is an illustrative example of the complexity of the task of characterising functional connectivity: at every level of any perceptual hierarchy of features we impose (i.e. separation into spatial and temporal features, or subdivision of temporal features into amplitudes and coupling), we expect there to be multiple ways to identify the different features, and substantial cross-subject variability that is correlated across the different categories.

Outline
For the rest of this paper, we will outline our approach for simultaneously inferring group-and subject-level descriptions of functional systems. We use the term mode to describe our mathematical description of a given system.
To begin with, we present our probabilistic model for these modes, including the way we parameterise subject variability in both spatial and temporal features, and our approach for inference. This is a significant extension of the proofof-concept method [Harrison et al. 2015] in several key ways: we introduce a new hierarchical model to better capture the functional coupling between modes, incorporate a model for mode amplitudes to engender a cleaner separation between different types of functional variability, and we overhaul the entire implementation to help the inference scale to large data sets.
We then compare the performance of our method with existing approaches. We do this using both simulated and empirical data, namely the complete set of rfMRI data as released by the Human Connectome Project and 'active-state' fMRI data from a more conventionally sized study. Finally, we offer some brief discussions as to the significance of our results.

Model bioRxiv 2 Model
Our approach infers subject-level probabilistic functional modes (PFMs)-each of which can be thought of as being described by a subject-specific spatial map and a set of time courses-across the whole cohort simultaneously. Ensuring that there is correspondence between the inferred modes across the cohort is a challenge [Esposito et al. 2005], especially on resting-state data where we cannot assume any common temporal structure.
However, we can use the information at the group-level to inform the subjectspecific decompositions: both the subject-specific spatial maps and the lowdimensional, between-mode functional connectomes are constrained to vary around their group-level descriptions, and we can also leverage the expected properties of the haemodynamic response to further constrain the time courses. Moreover, we can use the subject-specific modes to learn about the variability of all these properties, thereby allowing us to not only describe typical patterns of activity, but to also quantify the extent to which observed patterns are atypical. We do this by building, and then inferring upon, a hierarchical probabilistic model for rfMRI data as described by a set of modes, and it is this that we outline in the following section.

Matrix factorisation models
Defining a mode in terms of a spatial map and time course means that it is fundamentally a matrix factorisation approach, a mathematical formulation which underpins principal component analysis, independent component analysis, nonnegative matrix factorisation, dictionary learning and several other of the well established methods for extracting modes from rfMRI data. For completeness, we briefly introduce our notation for this class of models before introducing our extensions.
Firstly, each subject, , from a cohort of subjects, is scanned times. Note that we do not assume that each of the runs for a given subject (i.e. ∈ {1, … , }) are identical from a modelling standpoint: they could, for example, represent different time points in a longitudinal study, or different conditions 6 , and we may therefore want to treat them differently. The fMRI data are acquired in voxels and at time points, which we reshape into a data matrix ( ) ∈ ℝ × . We do all our analyses after the data has been registered into a common space, so the number of voxels is constant across subjects. We do however allow the number of time points per run to vary (i.e. ( ) ∈ ℝ × ( ) ), but for notational simplicity we drop any superscripts on .
The problem we are faced with is defining an extension to the standard matrix factorisation approach to account for these multiple data. In the spatial domain, as discussed in the Introduction, we expect between-subject variability in the locations 2. Model bioRxiv of functional regions, even after registration, and we expect these effects to be amongst the dominant sources of functional variability. We make the pragmatic decision to focus on differences in the static configuration of functional systems specifically, and we target our spatial approach towards what are essentially misalignments.
Therefore, as in Harrison et al. [2015], we model subject and run variability within the matrix factorisation framework as follows. We are looking for a set of modes, and we assume that the subject variability in spatial organisation we observe across subjects, by virtue of it being driven primarily by cortical reorganisations, is consistent across all runs for a given subject. This gives a set of subject-specific spatial maps, ( ) ∈ ℝ × , that will potentially be observed multiple times. Furthermore, each run will have its own unique set of time courses, ( ) ∈ ℝ × , as well as a set of mode amplitudes, ( ) ∈ ℝ . For convenience we adopt the following convention: ( ) ∈ ℝ × ≡ diag( ( ) ). Finally, note that in general we infer a small number of PFMs relative to and , which gives a parsimonious description of the data. However, this means that the factorisation will not be exact, so we express the data as the contribution from the PFMs and a noise term, ( ) ∈ ℝ × . This set of assumptions allows us to describe the complete model for one run as In the following sections, we describe how we model the dependencies between these run-specific decompositions, as well as the key properties of rfMRI data that we are trying to capture. For reference, a full graphical model is provided in the Supplementary Material.

Spatial Model
The spatial model remains conceptually similar to the approach we used in Harrison et al. [ibid.]. For each mode, there is a rich group-level description capturing the mean group maps and typical subject variability around these; as Van Essen and Dierker [2007] discuss, in light of subject variability, it is essential that '[regions are] represented probabilistically whenever possible, in a way that reflects variability in cortical convolutions and in [their] size, location, and internal (e.g., topographic) organization'. Similarly, subject maps are parameterised such that they retain the key characteristics of the group maps, but allow for genuine variability while being robust to spurious correlations induced by noise.
A key modification we make to the previous model is to change the way we model the spatial map distribution, by relaxing the delta-Gaussian mixture model to a double-Gaussian mixture model. Previously, the weights in voxels which were inferred to be outside of a given mode were set to exactly zero. In reality however, essentially all voxels will exhibit a weak correlation with a given mode time course 7 , and, particularly in studies like the Human Connectome Project with thousands 7 Cf. the noisy estimates of beta values from a GLM fit.

Model
bioRxiv of time points per subject, there is often sufficient evidence a posteriori to model this noise as small, but nevertheless non-zero, weights 8 . The new model allows for exactly this type of 'spurious' (i.e. statistically but not biologically significant) correlation by including a noise distribution to capture small deviations from zero in the spatial map weights. While we are not interested in these small weights per se, if we do not include a more explicit noise model then the model will erroneously include them as signal thereby hindering our ability to detect genuine 'neural' signal.
This contamination by noise happens for three main reasons. Firstly, as Bright and Murphy [2015] recently showed, even well-characterised functional modes can be identified from noise processes like subject motion. Conversely, this implies that even accurately identified modes may well correlate with non-neural processes. Secondly, given the complex, long-range spatial autocorrelations present in fMRI data [Kriegeskorte et al. 2008], fMRI noise processes have a non-trivial structure. This is heightened by spatial smoothing, which is an often used pre-processing step for fMRI data (though less so for modern high spatial and temporal resolution data [Glasser et al. 2016b]). This is advantageous as it ameliorates the problem of residual spatial mis-alignment after registration, but induces heightened spatial correlations in the noise. While it would be possible to model this, estimating-and then correcting for-the true number of spatial degrees of freedom in the data is notoriously difficult [Worsley et al. 1996;Eklund et al. 2016], and would be computationally expensive over a large number of voxels. Finally, in the section on the noise model itself, we demonstrate how unstructured noise can have a stabilising effect on matrix factorisation models. Therefore, we make the pragmatic decision to account for these effects in the spatial model, rather than trying to incorporate a more complex mechanistic model for the noise.
The resulting model takes the following form. For voxel in mode , the subject-specific spatial weights are distributed as follows: Where ( ) is a binary indicator variable which represents whether a given voxel's weight is drawn from the signal or the noise component. This distribution is defined in terms of several group-level hyperparameters: the probability that a given weight is drawn from the signal rather than the noise distribution, ; the mean and standard deviation of the signal component, and respectively; and the new parameters, the standard deviation of the noise 2. Model bioRxiv component, which we parameterise as ( ) for reasons which we explain in detail later. Note how much richer this description is than the single set of group-level means that most currently used techniques infer. For example, the parameters can capture the types of spatial non-uniformity in subject variability observed by Mueller et al. [2013]. Therefore, when inferring subject maps, the inference will automatically be informed by the data more than the group mean in regions inferred to exhibit high functional heterogeneity over subjects, and vice versa for regions with low subject-to-subject variability.
The model also includes the set of distributions over the group-level hyperpriors (see the Supplementary Material for the way these, and all subsequent, hyperparameters are specified). Starting with the hyperpriors on the 'signal' component, we place a mixture model prior over the group means, which, as in the previous work, is inspired by the spike-slab distribution [Mitchell and Beauchamp 1988;George and McCulloch 1993;Ishwaran and Rao 2005;Titsias and Lázaro-Gredilla 2011]. This encourages sparsity in the group-level spatial maps, thereby encoding ideas about functional segregation, as well as allowing more flexibility when specifying the distribution of the non-zero weights. However, we introduce an extension and model the non-zero weights with a combination of two Gaussians with different variances. This allows the group-level distribution of non-zero spatial weights to have heavier tails than the single Gaussian used in the previous incarnation of the model.
Where is the probability that a voxel in the group map is drawn from each of the three distributions, and [ = ] is the Iverson bracket. The group signal standard deviations, , take an inverse-gamma hyperprior: Returning to the hyperpriors on the 'noise' component, in Equation 2, the standard deviation of the noise component of the subject-specific spatial map distribution is parameterised as ( ) . The parameter encodes spatial inhomogeneity in the noise variance: for example, we expect more structured noise due to motion around the edges of the brain; similarly, we expect more physiological noise in the brainstem. This group noise standard deviation, , also takes an inverse-gamma hyperprior: ( ) = Γ( −2 | , ) 2. Model bioRxiv However, we also expect different signal-to-noise ratios, both across subjects and modes. Therefore, we include an extra parameter, ( ) , which captures variations in the noise level 9 . We place a weak prior on ( ) , as we want the overall scale of each spatial map to be determined by the signal rather than the noise, as this makes cross-subject analyses more informative: Finally, the last hyperprior to specify is that on the group membership probabilities. This follows a beta distribution: In summary, the model has rich descriptions of the spatial maps, both at the group and subject level, and allows us to encode typical patterns of variability. Furthermore, while we have included a weak sparsity constraint at the group-level, there is no explicit constraint on, for example, orthogonality of the spatial maps. Therefore, the model can capture modes that are highly spatially overlapping in what is arguably a more natural way than independent component analysis-even despite a historic tendency to overstate those criticisms [Beckmann et al. 2005;Smith et al. 2012;Calhoun et al. 2013].
One last point to note is that when we present our results, the group maps we show are the marginal posterior means of the whole spatial distribution, rather than the parameters themselves. The group-level maps are therefore E[ | ], which has the nice property that it incorporates the uncertainty about whether each voxel is is drawn from the signal or the noise component.

Temporal Model
In the temporal domain, the unconstrained nature of rfMRI data means that we can say relatively little about the time courses from a given run, as there are no external events from which we can search for consistent time-locked patterns of mode activation. However, functional connectomics has shown that, as well as having a consistent group structure, both the interactions between modes and simple amplitude measures encode interesting information about subjects. Similarly, the haemodynamic processes lend neural processes a distinct temporal signature. That being the case, we wish to formulate a model that primarily captures these two phenomena.
However, we expect the inferred time courses to be corrupted by noise, even if we properly make allowances for the global noise process ( ) . As mentioned in the Spatial Model section, there are likely to be structured noise processes that violate our haemodynamic assumptions. This needs to be accounted for before we can introduce the targeted models of the BOLD signal.

Model bioRxiv
Analogously to the the spatial model, we extend the model from Harrison et al. [2015] by making the pragmatic decision to allow noisy time courses. Therefore, our time course model contains two terms: the first represents the clean BOLD time courses, ( ) , while the second represents the noise that corrupts these, ( ) . This gives: There is an additional benefit of this explicit parameterisation of the BOLD time courses. Recent work has claimed that the [fractional] Amplitude of Low Frequency Fluctuations ([f]ALFF) [Zang et al. 2007;Zou et al. 2008;Zuo et al. 2010a], as derived from rfMRI data, captures aspects of subject variability related to disease. Our parameterisation allows us to derive a related quantity, which we term the fractional amplitude of BOLD time courses (fABT). This is simply defined as the power in the clean BOLD time courses ( ) , relative to the power in the noise time courses ( ) , calculated for each mode and each run individually. Conceptually, this is very closely related to fALFF, but it has the clear advantage that it does not require defining 'low' frequencies in terms of an arbitrary threshold; rather, the signal of interest is based on an explicit model of the HRF. Secondly, the calculated fABT measures specifically relate to the activity in different functional systems which makes the measure more interpretable.

Haemodynamic model
We use the haemodynamic response function (HRF) based model that we introduced in Harrison et al. [2015]. This is a relatively simple, computationally efficient, linear model that captures the gross properties of the HRF via the temporal autocorrelations that it induces in the data. We assume a white noise 'neuronal' process convolved with a canonical HRF 10 , whose autocorrelation function we can capture using a full covariance matrix, ∈ ℝ × , for all the time points in a given run. As the overall variance of the time courses is arbitrary given the explicit amplitude parameters, we simply ensure that is scaled such that all entries on the main diagonal are unity.

Subject-level mode interactions
The major extension relative to the previous model is an explicit parameterisation of the functional coupling between modes. As discussed earlier, we expect to observe temporal interactions between modes, and this will lend some structure to the mode time courses. We define these interactions in terms of the precision matrix between the mode time courses. In other words, we combine the HRFderived autocorrelation structure with a prior on the between-mode precision matrix, ( ) ∈ ℝ × , in a matrix normal distribution.

Model bioRxiv
The combined prior on the haemodynamic time course for all the PFMs in a given run is then:

Group-level mode interactions
The temporal interactions between modes have been characterised as having a consistent structure across the group [Shehzad et al. 2009], so we introduce a hierarchical model to capture this. Subject-or run-level variability will manifest itself as deviations from this set of group interactions. This formulation we use is, in essence, the same model as that proposed by Marrelec et al. [2006], but where we have two principal advantages: firstly, inference is informed by the full posteriors on the rest of the model (i.e. rather than point estimates); and, secondly, that the regularisation that arises from these priors will inform the inference of the rest of the model parameters.
Starting at the subject level, we estimate the subject/run-specific temporal precision matrix ( ) to keep track of the functional connectivity between modes. These precision matrices follow a Wishart distribution, and we introduce a hyperparameter, ∈ ℝ × , that encourages the interactions to be consistent across subjects and/or runs. This takes the form of a hyperprior on the subject-specific scale matrices, and again this follows a Wishart distribution.
Furthermore, we can also place restrictions on the type of variability we want the model to capture. If, for example, subjects are scanned multiple times but always under the same conditions, then it may well be appropriate to generate a consensus set of interactions for that subject by pooling over runs. We can do this straightforwardly by setting ( ) ≡ ( ) . Alternatively, if the runs vary across the group in a consistent way (e.g. 'before' and 'after' scans) then we may want to explicitly model these conditions as separate entities. We can do this by introducing a family of group-level interactions, , and selectively using these as the hyperpriors on ( ) as appropriate. This gives us enormous flexibility and allows us to increase our statistical power by making targeted assumptions about the key modes of variation.

Time course specific noise model
The noise time course of mode at time , ( ) , is simply drawn from a Gaussian distribution with precision ( ) . This gives

Amplitude Model
Again, the amplitude model is an extension to our previous work. This has a straightforward formulation, with these parameters simply designed to account for the run-to-run variations in the overall activity of each mode. These are parameterised in terms of ( ) ≡ diag( ( ) ), and follow a Gaussian distribution: The group-level parameters, and capture any consistent cross-subject relationships between the mode amplitudes. For example, Bijsterbosch et al. [2017] recently reported that the amplitudes of sensorimotor modes are correlated with one another, as are the amplitudes of cognitive networks. It is exactly these types of effects that these hyperpriors are able to capture.
The hyperpriors are formulated as follows: Furthermore, we impose a post-hoc positivity constraint on these variables as part of the inference procedure. As there is a multiplicative ambiguity as to the signs of the components in a matrix factorisation model, we can do this without loss of generality.

Noise Model
The final part of the model left to specify is the noise process, ( ) , which we assume is zero-mean, white Gaussian noise, with an overall precision for each run, ( ) . This specifies the likelihood: This noise precision then takes a standard gamma hyperprior: This relatively simple structure assumes that the noise variance is the same in every voxel, which is particularly useful as it allows us to exploit the properties of the matrix normal distribution, leading to very computationally efficient inference 2. Model bioRxiv [Stegle et al. 2011]. We can preprocess the data in such a way that this is a reasonable assumption to make, and this is discussed in Appendix B.
What is perhaps more problematic is that this model does not acknowledge the spatial smoothness of fMRI data, which means that the noise is not truly independent over voxels. It would be possible to model this, for example by inferring a full spatial covariance matrix for the noise that acknowledged the dependencies between voxels that smoothing introduces. Again, we decide that the benefits of this more complex model are outweighed by the increased computational burden, and again we discuss a way in which we can mitigate the effects of this model misspecification via a relatively straightforward adjustment for the spatial degrees of freedom introduced by Groves et al. [2011], as discussed in Appendix D .

Spatially and temporally specific noise models
One of the key changes to the model as introduced here and its previous incarnation is the way we model noise on the spatial maps and timecourses, as well as the overall noise described above. Interestingly, these different sources of noise can be beneficial for matrix factorisation models even in the absence of the fMRI-specific effects we postulated.
To demonstrate this, we use a simple, single-run version of our generative model, = + , and we assume the maps and timecourses are full rank to simplify the derivations below. The ordinary-least-squares single-regression estimator for the spatial maps,̂[ ] , given the ground-truth timecourses is: If we instead run dual regression-using the Woodbury matrix identity for the key rearrangements-we find a different estimator for̂[ ] : What is surprising is that the dual regression estimator is closer to the ground truth, even though the intermediate timecourses,̂[ ] , are noisy. This unintuitive behaviour occurs because dual regression involves two regressions on the same noise, and this has concrete implications for the PFM model. When we fit the haemodynamic model to the timecourses, we exclude the temporally specific noise terms from the estimation of the functional coupling between modes. However, we need to include the temporal noise terms when using the timecourses to estimate the subject-specific spatial maps, as removing it could increase the variance of the 2. Model bioRxiv inferred maps. The situation is directly analagous with the model for spatial noise: while it is not a quantity of interest for cross-subject modelling, its inclusion can improve the stability of the overall estimation.
In sum, the PROFUMO approach uses the spatially and temporally specific noise where the stabilising effect on matrix factorisation models means that it is expedient to do so, but seeks to avoid letting it confound cross-subject analyses. By way of contrast, dual regression is not naturally able to separate these types of noise.

Inference Approach
We use a computationally efficient variational Bayesian approach to infer upon the probabilistic model outlined above. This technique is well established for graphical models that have a conjugate-exponential structure, as is the mean-field approximation that renders the inference tractable [Attias 2000;MacKay 2003;Winn and Bishop 2005;Blei et al. 2017]; as such, we will not cover the details of that here. In the Appendices, we outline several of the implementation details, including our data preprocessing pipeline, the way we handle large data sets, tweaks to the model and the initialisation procedure.
The combined inference and analysis package, PROFUMO (from PRObabilistic FUnctional MOdes) is available from git.fmrib.ox.ac.uk/samh/profumo and is compatible with FSL [Jenkinson et al. 2012]. All subsequent analyses were performed with version 0.11.1.
The model clearly has a large number of hyperparameters, but as described in the Supplementary Material we can drastically reduce the effective number given that the overall variance of the data is fixed by the internal preprocessing. Furthermore, the vast majority of the parameters that need setting govern the group-level hyperpriors and, as such, are several steps removed from the subject-level decompositions. This means that we can use the same default hyperpriors for all the analyses presented here, and that the inference generalises well across simulated, volumetric, and surface-based data, as well as datasets with very different numbers of subjects.

Model Summary
In summary, we explicitly model many of the properties of rfMRI data within the PROFUMO framework. In the spatial domain, we have a complex group-level model that captures both mean effects and typical patterns of variability, and use these to regularise the subject-specific spatial maps. The temporal model is based around the physiological properties of the BOLD signal, and includes another hierarchical model for the functional coupling between modes. Similarly, we capture differences in the overall activity levels of modes via the amplitude parameters. Finally, we can generate additional summaries by combining parameters as desired, 2. Model bioRxiv which includes, for example, the measures related to the fractional amplitudes of the BOLD signal.

Results
Here, we demonstrate the performance of PROFUMO using a set of simulated data and two empirical datasets. All comparisons are with spatial independent component analysis and dual regression (ICA-DR) [Calhoun et al. 2001;Beckmann et al. 2005;Zuo et al. 2010b;Nickerson et al. 2017], as this is what has been used in previous publications on the empirical data.

Simulations
The simulation framework is explicitly designed to be challenging, such that it tests the various ways in which the assumptions the different models make are most likely to be violated. This includes spatial and temporal correlations between components; spatial variability, including a model for misalignments; amplitude variability across subjects and components; a (weakly) nonlinear HRF that varies over both subjects and space; and spatial and temporal smoothness in the residuals. This extends previously published analyses [Harrison et al. 2015;Bijsterbosch et al. 2019], and all simulation code is available from git.fmrib.ox.ac.uk/samh/PFM_

Simulations.
Specifically, we simulate data containing 15 components in a group of 40 subjects, each with two runs containing 10,000 voxels and 500 timepoints at a TR of 2.0 s. A more detailed overview of the data generation procedure is provided in Appendix F. We then test how well PROFUMO and ICA-DR can recover the ground-truth parameters, pooling results across 10 different simulated datasets. Finally, to give more detailed insights into the performance of ICA-DR we include several intermediate steps: firstly, to separate the performance of ICA and dual regression, we include a dual regression analysis starting from the ground-truth spatial maps (GTg-DR); secondly, we include the thresholded variant of dual regression proposed by Bijsterbosch et al. [2019] which is designed to reduce the observed bias in functional coupling (ICA-DRt, GTg-DRt).
Four key performance metrics are shown in Figure 1, and a much more detailed set of comparisons is included in the Supplementary Material. PROFUMO is able to accurately recover spatial maps, amplitudes and functional coupling network matrices (netmats), and much more so than either ICA-DR or the improved thresholded variant (ICA-DRt).
Crucially, the inferred PFMs are also unbiased in the presence of spatiotemporal correlations between components, unlike ICA-DR. What Bijsterbosch et al. [ibid.] demonstrated was that inaccurate estimation of the group-level spatial correlation structure-an inevitable consequence of the orthogonality constraints of ICA-leads to biased estimates of functional coupling. What we show here is a stronger result: this effect is present even when starting from the correct group-level spatial maps (GTg-DR). In this case, the effect is driven by the mismatch between the true subject-level spatial correlations and those between the Cosine similarity Spatial maps Temporal netmats Amplitudes (cross-subject) Spatiotemporal netmat interactions (cross-subject) Figure 1: Performance of different algorithms on simulated data. For each metric, optimal performance is shown by the horizontal green line. The metrics are accuracy in recovery of the subject-specific spatial maps, recovery of the run-specific network matrices (netmats), recovery of cross-subject differences in amplitudes (as different approaches normalise the data differently, we look at relative changes in amplitudes across subjects), and any biases in the recovered temporal correlations towards the spatial correlation structure. As well as PROFUMO and ICA-DR, we test dual regression starting with the ground-truth spatial maps (GTg-DR) and thresholded dual regression (ICA-DRt, GTg-DRt).
group-level maps. In other words, this bias will be present for all dual regression analyses, however the group-level maps are generated.
Furthermore, in the Supplementary Material, we repeat the simulations but with the addition of structured noise, including subject-specific artefacts that can be either spatially specific or global. While the differences between methods are less pronounced, there are still clear benefits to using PROFUMO. However, performance does suffer, and, as such, we strongly recommend ICA-based artefact removal before running PROFUMO, as is the case for the two empirical datasets presented here.

Human Connectome Project data
To evaluate the ability of PROFUMO to detect subtle subject-specific variations in functional connectivity, we use data from the Human Connectome Project (HCP) [Van Essen et al. 2012b;. This is for two main reasons. Firstly, the most recent data release includes high-quality functional data from over 1,000 subjects and, as such, is an ideal test for methods that purport to be suitable for population-level studies as mentioned in the Introduction. Secondly, the functional pipeline has been published [Smith et al. 2013a] and the results are available to download-thereby offering a comparison that is independently verifiable. The pipeline uses spatial ICA and dual regression to characterise subject variability in both spatial and temporal features. While it would also be possible to examine the equivalent pipeline based on temporal ICA, this has not been used so extensively: for example, the HCP's MegaTrawl analyses are based on the spatial ICA pipeline 11 . Similarly, this pipeline does not make use of the new thresholded variant of dual regression. Based on the simulated data, this would improve the results slightly, though PROFUMO still outperforms this variant on essentially all of the metrics we tested. Again, the aim is to use the existing, publicly available results as a baseline.
A key aim of modern, large-scale studies of functional connectivity is to relate neurobiological changes to individual differences in genetic, lifestyle and behavioural factors. Using the HCP data also allows us to do this by comparing our results with a wide range of information about subjects. The data involves a battery of cognitive tests, and also records a range of metrics based on health and lifestyle: we will refer to differences in these as subject variability in behavioural measures. We can indirectly assess the effects of genetics and environement by calculating the heritability of key imaging metrics; we do this by utilising the fact that many twins and siblings were involved in the study. Finally, we can examine subject variability in structural measures by relating functional measures to the thicknesses, areas and volumes of key cortical and subcortical structures as derived from the structural MRI scans . In this way, we can quantify to what extent different methods are able to capture key aspects of functional 3. Results bioRxiv variability, and if there are meaningful relationships with other measures.
A more detailed overview of the data, and the tests we carry out here, can be found in Appendix G.

Analyses
Both PROFUMO and spatial ICA were run at a dimensionality of 50, at which point the modes were reordered for visualisation and noise components-or, in the case of PROFUMO, modes eliminated by the Bayesian model complexity penaltieswere removed. Even on the extensive and high-quality HCP data, PROFUMO does not identify more than 50 PFMs: when run at higher dimensionalities, more PFMs are simply eliminated from the model. We discuss why PROFUMO is likely to be conservative in this regard in more detail later.
For the full HCP data, PROFUMO therefore infers the posterior over approximately 25,000,000,000 parameters (1,000 subjects, 100,000 grayordinates, 50 modes, 5 parameters per grayordinate). In terms of computational requirements, this analysis took approximately 110 hours using 18 cores on a single compute node, and memory usage peaked at 350 GB.
Finally, note that subsequent figures display spatial maps on the cortical surface for simplicity and concision. However, all grayordinates (comprising approximately 60,000 cortical vertices and 30,000 subcortical voxels  were used in all analyses.

Overview of the PFM spatial model
To begin with, in Figures 2 and 3 we show examples of the group-and subject-level spatial maps for four PFMs in order to demonstrate the richness of information contained within the PFM model. We do this to emphasise that PROFUMO is able to identify PFMs with strong spatial relationships with one another (in terms of overlap and anti-correlations), while at the same time being able to identify complex, subject-specific reorganisations of the group templates.
The most striking feature of the subject maps in Figure 3 is simply how much variability relative to the group maps there is. These results are from data already aligned using surface-based registration driven by functional features, which arguably represent the current 'gold-standard' for warp-based registrations [Glasser et al. 2016b;Coalson et al. 2018]. Despite this, and as we and several others have demonstrated, there are pronounced differences between subjects, with both shifts in the relative location of functional regions over surprisingly large distances, and complex, non-homotopic splittings and reorganisations of the regions themselves. Furthermore, as highlighted in the figure, even though the PFM itself is large, there are several subject-specific features that are too small to be accurately represented at the typical spatial scale of parcellations applied to fMRI.
However, while the descriptions of modes in terms of the mean group-or subject-level spatial maps are familiar, a key advantage of the PFM framework is the more detailed group-level parameterisation. In other words, we can go beyond simply noting the degree of subject variability: we can now quantify it in detail on a per-mode level. In Figure 4 we again take the default mode network [ Figures 2 and 3] as an example and plot the four key group-level spatial parameters: the probability that a given voxel belongs to the DMN, the mean and variability over subjects of the signal component of the DMN's voxelwise weights, and the standard deviation of the spatial noise component. The information encoded by the mean weights is familiar, but the other parameters add novel and complementary information. For example, the memberships [Panel (b)] demonstrate that default mode activity is distributed over a surprisingly large area, with consistently detected activity across much of the lateral prefrontal cortex. This is an effect that has been captured by several recent, high-powered single-subject analyses [Gonzalez-Castillo et al. 2012;Laumann et al. 2015;Poldrack et al. 2015;Huth et al. 2016]. However, while the activity is widespread, it is also distinct: the areas of high and low probability are sharply delineated. Similarly, the standard deviations [Panel (c)] add extra information by telling us about variability in the size of the weights-that is, in the strength of the detected activity-and we can see that, in this instance, the activity in the inferior parietal lobule is much more variable in strength across subjects than that in the precuneus.
This detailed characterisation of non-homogeneous variability across the cortex is a key advantage of the more complex group-level model we have adopted, and we expand upon this in Figure 5. This summarises the membership probabilities and weight standard deviations across all modes. There is a clear pattern whereby association cortex contains more overlapping modes than sensory cortices [Panel (a)], and that the spatial weights are also more variable in association cortex [Panel (b)]-note how this is in agreement with the results of Mueller et al. [2013]. Finally, the uncertainty in the memberships themselves [Panel (c)] tells us about shifts in locations between subjects. For example, note the very clear area of variability in medial frontal cortex between SMA and pre-SMA [Johansen-Berg et al. 2004]. This metric is presumably particularly sensitive to this region because variability here tends to manifest itself as relatively simple anterior-posterior shifts of the SMA/pre-SMA boundary, whereas more complicated 2D rearrangements of overlapping PFMs are present elsewhere. In summary, the PFM spatial model captures familiar group-level modes, and exhibits many of the complex subject-specific rearrangements already described in the literature. However, the key advantage is the way in which we have parameterised this model. Crucially, the richness of the group description allows us to make specific claims about the patterns of variability across the population that are ordinarily hard to tease apart.

Comparison with spatial ICA
To begin with, we examine the performance of the different models in terms of their inference of the group-level spatial descriptions. In Figure 6 we plot the similarity between these group-level descriptions.
There are several key points to note. Firstly, there are strong spatial correlations between the PFM maps, especially within the different categories. By way of contrast, the independence assumptions in spatial ICA preclude this. Secondly, PROFUMO is relatively conservative: it only infers 36 signal modes compared to the 48 found by ICA, and the difference is particularly pronounced in the subcortical regions. This subcortical difference is predominantly driven by the different signal properties of the HCP data between cortical and subcortical grayordinates, and the different data normalisation strategies the two algorithms use. The result is that ICA tends to find subcortical regions appearing in components without much cortical involvement, whereas PROFUMO tends to find subcortical regions appearing in components with cortical involvement. Finally, despite the above differences, there is fundamentally a strong relationship between the two sets of maps. Most cortical modes appear in both decompositions, and often look fairly similar; this is encouraging, as we do not expect a radically different patterns of functional connectivity at the group level given how many published methods have converged on similar descriptions. Figure 6: Spatial similarity between the sets of group-level spatial maps as inferred by PROFUMO and ICA. Modes were split into five categories and reordered: visual (Vis); motor (Mot); auditory (Aud); cognitive (Cog); and subcortical (Sub). This ordering is used for all subsequent sections.

Properties of subject variability in spatial organisation
Given that the group-level descriptions are fairly similar between PFMs and sICA, the obvious question are to what extent does the extra group-level information in the PFM model regularise the subject-specific decompositions, and in what ways do the subject-specific maps diverge from the group-level representations? We deal with the former first, and in Figure 7 we look at that the consistency of the subject maps as inferred by PROFUMO and the ICA-DR pipeline. As expected given the regularisation from the more complex group-level priors, the PFM maps are much more consistent across subjects.  Figure 7: Similarity between the subject-specific spatial maps, for both PFMs and ICA-DR, as inferred from the HCP data. For each voxel and in every pair of subjects, we compute the Pearson correlation coefficient between the twodimensional vectors of mode weights. The maps shown here are the correlation coefficients averaged over every pair of subjects.

ICA-DR PFMs
However, this increase in consistency could also be explained if the subjectspecific PFM spatial maps were simply pushed closer to the group maps by the priors, thereby being less faithful to the 'true' patterns of functional connectivity at the subject level. While this does not appear to be the case for the exemplar subject maps, what we really want to quantify is whether they are capturing 'interesting' aspects of subject variability in spatial organisation. In other words, are the differences between the approaches meaningful, and do they make different predictions about the subjects themselves?
To investigate this, we use the fact that the HCP includes data from twins and siblings to investigate the influence of genetics and environment. We estimate the voxelwise broad-sense heritability of the subject-specific spatial maps we observe: in each voxel and each subject, we extract the vector of PFM or ICA map weights, and look to see if these weight vectors are more consistent in monozygotic than dizygotic twins (see Appendix G for full methodological details). The results of this analysis are shown in Figure 8.
The results show a clear increase in heritability for the PFM spatial maps, suggesting that they are more sensitive to subject variability that we can attribute to genetic factors. Furthermore, this is not simply attributable to a reduction in noise or as the result of the priors pushing the subject maps closer to the group. While the PFM maps are more consistent across subjects than ICA-DR [ Figure 7], the heritability relates to the difference in consistency between monozygotic and dizygotic twins and, as such, a global increase in consistency is not enough to explain the increased heritability. We can also gain further insights into this observation by utilising the HCP's retest data. 46 subjects underwent the full HCP imaging and behavioural testing protocol twice, of which there is full rfMRI data from 42. This allows us to examine how the algorithms perform on the hitherto unseen retest scans. The group-level representations from the full data (i.e. the ICA spatial maps, and the group-level PFM posteriors) were used to derive new subject maps from the independently acquired retest data.

ICA-DR
PFMs Group 149337 Retest Figure 9: Example spatial maps for the language mode, for both PFMs and ICA-DR, as inferred from the full HCP data and the HCP retest data for subject 149337.
Only the left lateral surface is shown.
In Figure 9 we compare subject-specific realisations of the language mode as derived by PROFUMO and the ICA-DR pipeline. This particular mode was chosen because a characteristic split in area 55b in some subjects was reported and examined in some detail by Glasser et al. [2016a]. In terms of a comparison between PROFUMO and ICA-DR, both are clearly sensitive to the same gross re-organisations that occur. For example, both can detect the rearrangement of area 55b in the original and retest data for the subject shown here. However, the most marked difference is in the noise-level and appearance of anticorrelations. Relative to ICA-DR, the PFMs show much reduced background noise in regions not associated with the networks, and do not exhibit anticorrelations (indicated by negative weights, shown in blue) tightly interposed between positive weights. This is presumably a simple consequence of dual regression's inability to separate signal from noise, as we discussed in the section on noise modelling. By way of contrast, the information encoded by the group-level parameters in the PFM model suppresses the background noise in regions that are not part of the language network, but in a way that does not preclude inferring complicated rearrangements of functional regions.
To assess the reliability of the different decompositions on the retest data more 3. Results bioRxiv quantitatively, we assess the specificity of the inferred spatial maps as 'fingerprints' that uniquely identify different subjects [Finn et al. 2015;Horien et al. 2019]. This is shown in Figure 10.

ICA-DR PFMs
(c) (d) (a) (b) Figure 10: Specificity of the subject-level spatial maps as inferred from both the original and retest HCP data by PROFUMO and ICA-DR. The group results from the full data are used to derive subject-specific spatial maps in the unseen retest data. In (a) we show the similarity of the inferred maps in the same subject, seperately for each mode. In (b) we calculate the fingerprint specificity, or how much more similar the maps in the same subject are as compared to maps from non-matching pairs of subjects, averaged over modes. This is equivalent to the difference between the diagonal and the off-diagonal elements (calculated for each column separately) in the full simmilarity matrices as shown in (c) and (d).
Firstly, we compute the spatial similarity between the new subject-specific spatial maps from the retest data, and the original set from the full data, for every pair of subjects. We pool these retest results over all modes and subjects, and this is shown in Panel (a). Again, the subject-specific PFM maps are much more consistent across the two acquisitions.
Secondly, we assess whether this leads to more specific fingerprints. In Panel (b) we show that the fingerprint specificity (i.e. the amount by which the two sets of maps from the same subject are more similar than paired maps from different subjects) is also higher for the PFMs. In other words, not only are the maps generally more consistent across subjects, but there is an increase in subject specificity above and beyond this effect.
In summary, the comparisons with ICA-DR have demonstrated that while 3. Results bioRxiv the group-level descriptions are similar, the more complex hierarchical modelling in PROFUMO allows us to infer spatial maps that are more consistent-on both the original data and the held-out retest data-as well as being more specific and capturing more informative aspects of cross-subject variability.

Overview of the PFM temporal model
Here, we briefly give a summary of the key temporal parameters-that is, the amplitudes and the functional coupling between modes-as inferred by PROFUMO on the HCP data. Note again that these are new parameters: in other words, it was only possible to investigate these in a post-hoc fashion based on the previous PFM model. Firstly, in Figure 11 we plot the cross-subject correlations between the mode amplitudes, as captured by the parameter. Encouragingly, we see a clear replication of the results of Bijsterbosch et al. [2017], who reported strong correlations between the amplitudes of sensorimotor modes, as well as between cognitive modes, but relatively weak correlations across the two categories. However, the crucial difference between this result and the original observation is that this behaviour was initially demonstrated from a purely post-hoc analysis of the ICA-DR results, whereas it is explicitly parameterised and inferred within the PFMs model. What this means is that this knowledge of the systematic relationship between mode amplitudes is available during inference, and it is therefore naturally incorporated as an extra factor regularising the subject-specific decompositions. Figure 11: Cross-subject relationships between the amplitudes of the PFMs, as inferred from the HCP data. For visualisation purposes, we display the posterior precision matrix, , after transforming it to both full and partial correlation coefficients.
Secondly, in Figure 12 we plot the PFM functional coupling parameters, and ( ) (these represent the group-and subject-level temporal network matrices respectively). What is striking is how weak the functional coupling is between modes in the group-level network matrix (netmat), especially given that we have an explicit hierarchical model to allow for just these interactions. This is not trivial to explain away as a spatial effect either: despite the fact that these interactions are more similar to what we would expect from temporal ICA, the PFM spatial maps are similar to those inferred by spatial ICA which typically infers strong functional coupling between modes. We quantify the implications of this different view on functional coupling from the PFM model in the following section. Figure 12: Group-and subject-level functional coupling between the PFMs, as inferred from the HCP data. For visualisation purposes, we display the posterior parameters (group-level) and ( ) (subject-level) as partial correlation coefficients. As in Figures 3 and 9, subject 149337 is chosen as the exemplar.

Multivariate relationships with behavioural variables
How then, are we to interpret the differences between the PFM and ICA-DR approaches? Do they simply represent a different trade-off between sensitivity and specificity in the spatial and temporal domains, or are they telling us something fundamentally different about brain activity?
To probe this further, we performed a series of multivariate analyses to investigate the different ways in which the two models encode cross-subject information. Like in Smith et al. [2015], canonical correlation analysis (CCA)-a multivariate analysis technique used to find the linear relationships between sets of variables [Hotelling 1936]-was used to summarise the key correspondences (see Appendix G for methodological details). Furthermore, as some sets exhibit more than one strong linear relationship, we use the RV coefficient [Robert and Escoufier 1976] to give a principled summary of the multivariate information reported by the CCA. In Figure 13, we examine the full set of pairwise relationships between the behavioural and structural variables from the HCP, and the spatial maps, amplitudes and network matrices from both PROFUMO and ICA-DR.
There are several key results we can glean from this analysis. Firstly, the cross-subject information captured by the different aspects of the PFM model is relatively distinct. Comparing the similarity between the PFM measures with those for the ICA-DR variables (i.e. the on-diagonal blocks), we can see that the scores are typically lower for the PFMs. In other words, the temporal measures derived from the PFMs carry relatively different information from the spatial measures about the subjects themselves, at least compared to their ICA-DR equivalents. Figure 13: Relationships between the cross-subject information encoded by different analyses. The non-functional variables (NFVs) have been separated into variables from the HCP's battery of behavioural tests, and variables derived from structural MRI relating to brain size and morphology. On the left we plot the log RV coefficient calculated between the subspaces of the top ten CCA components as calculated between every pair of sets of variables, and on the right we reproduce the relationships with the non-functional variables (i.e. the top two rows / two leftmost columns) as a bar chart for ease of visualisation. Higher values of the RV coefficient indicate that more similar cross-subject information is being captured.

ICA-DR PFMs
Secondly, if we examine the relationships with the behavioural and structural measures in the bar graph on the right, there are several striking differences between the methods. As we would expect from our earlier analyses, the PFM spatial maps are the best predictors of structural variables. They are also good predictors of the behavioural variables, though slightly less so than the ICA-DR netmats. However, the stories for the temporal information are very different. The PFM amplitudes, fABT and netmats are relatively poor predictors of both behavioural and structural variables, though, intriguingly, they are better predictors of behaviour than structure. By way of contrast, the ICA-DR amplitudes and netmats are better behavioural predictors, though surprisingly they are also good predictors of structure (e.g. one can predict the sizes and thicknesses of cortical areas better than behavioural measures from the ICA-DR amplitudes).
Given the simulation results, the interpretation is relatively straightforward: the ICA-DR pipeline contains inherent biases that conflate spatial and temporal information. Furthermore, even though we do not explicitly test it here, it is interesting to note that using the thresholded version of dual regression to correct this bias also reduces the correlation between temporal netmats and behaviour [Bijsterbosch et al. 2019]. In other words, and consistent with the results on simulated data, thresholded dual regression is an improvement on ICA-DR but is less accurate than the full PROFUMO model. The question that remains however, is what information, if any, is the PFM temporal model capturing if not the trait-like behavioural variables examined here? bioRxiv

Summary
Given the full set of results presented on the HCP data, the implication is that the PFMs, by virtue of the improved spatial modelling in particular, are better able to capture interesting information about cross-subject variability in spatial organisation. However, this does not address the relative lack of information encoded in the various temporal measures that PFMs capture. We address this point using another data set in the following section.

Active-state data
Given the way that subject variability in spatial and temporal features simultaneously co-varies with a wide range of non-imaging derived subject measures, it is very challenging to conclusively disambiguate them from studies like the HCP. However, if we manipulate the functional connectivity at the subject level, for example by changing the cognitive state [Shirer et al. 2011;Krienen et al. 2014;Vanderwal et al. 2017;Gratton et al. 2018;Kieliba et al. 2019;Salehi et al. 2020], then we can begin to examine temporal differences in more detail. Crucially, by looking at multiple conditions for the same subject we essentially eliminate the influence of structural variability from the functional data.
To do this, we use a dataset collected where subjects were scanned when in different active states-these are induced by performing simple, continuous tasks in the scanner, of which rest (i.e. eyes-open fixation) is just one [Duff et al. 2018;Kieliba et al. 2019;Sala-Llonch et al. 2019]. There are five runs for every subject, each collected under different steady-state conditions: a standard restingstate acquisition (Rest); a finger-tapping based motor task (Mot); a passive visual condition (Vis); an independent combination of the visual stimulus and motor task (V-M); and a condition where the specifics of the motor task changed based on the visual stimulus (V+M). A more detailed descriptions of the tasks and data itself can be found in Kieliba et al. [2019]. Furthermore, this dataset offers a validation of our method on data acquired using a more conventional sequence and scan duration than the HCP, with fewer subjects, shorter scan durations, and all analyses performed on volumetric rather than surface-based data.

Analyses
As per the modelling assumptions, PROFUMO infers one consensus spatial map per subject, but a separate set of time courses per run. We choose to infer run-specific temporal precision matrices, ( ) , with a consistent group-level hyperprior, , which is shared across all conditions. Note that we could have chosen to use condition-specific group-level priors, { ( ) } =1 , but this has the side-effect of invalidating the assumptions behind any subject-level statistics where we compare between conditions. In short, it reduces the cross-subject, within-condition variance which invalidates the typical null hypothesis we use. We leave the problem of performing statistical inference on these types of models for future investigations.

Results bioRxiv
We infer 30 modes for both PROFUMO and ICA-DR, which again seems to be close to the upper limit for PROFUMO on this relatively small dataset. Again, artefactual modes were eliminated and those remaining were reordered for visualisation. In terms of computational requirements, the PROFUMO analysis took approximately 12 hours using 15 cores on a single compute node, and memory usage peaked at 25 GB. Compared to the HCP analysis, the demands are higher than expected given the number of subjects for two reasons: firstly, the volumetric analysis contains over twice as many voxels as grayordinates; secondly, we do not do within-subject data reduction for this analysis.
For the ICA-DR pipeline, we use MELODIC [Beckmann and Smith 2004;Beckmann et al. 2005] to infer a set of group maps, followed by dual regression to generate the run-specific time courses.

Overview of the PFM model
In Figure 14 we show the group-level properties of the default mode as inferred from this data set. This is directly comparable with Figure 4, and simply demonstrates that we are able to infer similar summaries of the mode itself, and heterogeneous variability, from fourteen subjects rather than one thousand. In Figure 15, we demonstrate some of the properties of the inferred time courses from the PFMs. This data is more challenging than the HCP in that the runs are shorter, and the data has not benefited from resampling onto the cortical surface. Nevertheless, the HRF-based prior constraint results in a temporally smooth timecourse, which we are able to cleanly separate from the high-frequency noise which contaminates them. Furthermore, this is stable when we undo the temporal blurring that the HRF induces, with straightforward estimation of the underlying 'neural' process via whitening with respect to the autocorrelation induced by the HRF.
Finally, in Figure 16, we display examples of the network matrices to illustrate the typical patterns of, and subject variability in, the functional coupling between PFMs. Interestingly, in this data, PROFUMO infers PFMs with much stronger functional coupling between them than at the run level from the HCP data.

Comparison with ICA-DR
One would hope that the PFM model allows us to more accurately infer the true functional coupling between modes. To begin with, we look at the relationships between the condition-specific network matrices as inferred by PROFUMO and ICA-DR. These are shown in Figure 17. While the PFM network matrices are less consistent between conditions and subjects than their ICA-DR counterparts, there is some indication that there is condition-specific modulation across subjects (as indicated by the block diagonal). By way of contrast, the ICA-DR network matrices are dominated by the subjects themselves (i.e. the multiple strong off-diagonal lines in the ICA-DR plot), with no real indication of condition-specific modulations.
In summary, ICA-DR computes netmats that are more similar within subjects than they are within conditions across subjects. By way of contrast, PROFUMO infers netmats that are somewhat more similar within conditions than within subjects. Again, this suggests that the different models for subject variability in spatial organisation have a profound influence on downstream estimates of functional connectivity.
Next, we test whether the different conditions induce focal changes to the between-mode patterns of functional connectivity. The results of a statistical analysis that looks for modulations at the level of individual network matrix edges are shown in Figure 18. Both the PFM and ICA-DR pipelines detect changes in the coupling of visual regions induced by the visual stimulus, and it appears they both have similar sensitivity to the changes in coupling induced by the changes in cognitive state. There are some differences between the methods: for example, the visual changes detected by PROFUMO are more consistent across the three conditions with visual stimuli than for ICA-DR. Similarly, the types of changes for the combined visuo-motor condition are somewhat different, with ICA-DR finding changes in amplitude predominantly, whereas there are more changes in coupling for PROFUMO.
However, the results are fundamentally fairly similar and the numbers of edges Group s25; Rest s25; Mot s25; Vis Figure 16: Example PFM network matrices, capturing the functional coupling between the mode timeseries. We display the group network matrix alongside the network matrices from subject 25 in the rest, motor and visual conditions. As in Figure 12, we display the posterior precision matrices (i.e. for the group level and ( ) at the run level) as partial correlations. Modes were split into three categories and reordered for visualisation of the network matrices: visual (Vis); motor (Mot); and cognitive (Cog). Figure 17: Correlations between the network matrices, for both PFMs and ICA-DR, as inferred from the active-state data. The network matrices are grouped by condition, and the subjects have a consistent ordering within each block. Correlation is the Pearson correlation coefficient between the unwrapped uppertriangle of the network matrices.

Mot
Vis V-M V+M Figure 18: Changes in between-mode functional connectivity as induced by different active states relative to the rest condition. The raw difference between the group mean network matrix during the active condition and during rest is shown above the diagonal, and any significant changes (p<0.05) are highlighted by blue or red squares, for increases and decreases in coupling respectively, below the diagonal. Changes in amplitudes are shown on the diagonal. All tests were family-wise error corrected and computed using the accelerated permutation inference in PALM [Winkler et al. 2014;. The black dots denote elements that were significant under an f-test over all contrasts. As per Figure 16, modes were split into three categories and reordered for visualisation of the network matrices: visual (Vis); motor (Mot); and cognitive (Cog).
that exhibit significant changes is relatively low-and, perhaps, lower than we might expect given the strong manipulations of cognitive state 12 -suggesting that the statistical power might be the limiting factor here, especially given that there are only 14 subjects included in this analysis. Finally therefore, we do one further set of tests to probe whether the multivariate information in the network matrices and amplitudes captures condition-specific information. Repeating the analysis of Sala-Llonch et al.
[2019], we investigate whether a support vector machine (SVM) can be trained to distinguish between network matrices from different conditions. The accuracy of the SVM classification is tested using a leave-one-subject-out cross-validation framework , of which we provide more methodological details in Appendix H. As well as comparing PROFUMO and ICA-DR in this way, we additionally examine the effect that the haemodynamic model has on the temporal information that we infer. In other words, can the changes to estimates of functional connectivity be attributed to the advanced spatial modelling alone, or does the regularisation in the time domain improve our estimates too? As well as the explicitly inferred PFM network matrices, we do a post-hoc estimation of the temporal network matrices based on both the BOLD time courses and the combined time courses (i.e. ( ) , which includes both the BOLD and noise time courses) to assess what, if any, effect the modelling hierarchy has.
The results from the SVM analysis are presented in Figure 19. The SVM achieves a significantly better classification accuracy when trained on the PFM netmats, as opposed to those estimated by ICA-DR. Again, this suggests that by correcting for subject variability in spatial organisation the PFM framework allows us to estimate state-induced changes in functional coupling with greater fidelity. By way of contrast, the conflation of spatial and temporal information by ICA-DR masks these more subtle state-related changes in functional coupling. Finally, there appears to be a distinct performance improvement when using the inferred PFM network matrices, suggesting that the hierarchical temporal modelling is advantageous and that we are not discarding relevant information by focusing on the predominantly low-frequency HRF-derived time courses. Figure 19: Posterior classification rates for a multi-class SVM trained to distinguish between the different active-state conditions. The results on the left are when the off-diagonal elements of the network matrices are fed in, and the results on the right are when the amplitudes are used as features. Posterior densities are based on the number of correct and incorrect classifications out of the full set of 70 tests (14 subjects; 5 conditions), combined with Haldane's uninformative beta prior [Haldane 1932]. The modes of the distributions are shown by the black bars, and the chance level is shown by the dashed blue line. The two p-values are calculated via McNemar's test (mid-p variant) and Bonferroni corrected [Fagerland et al. 2013]. For the PFM netmats, the variants are: PFMs: network matrices inferred as part of the PFM model, ( ) . PFM (BT): network matrices estimated as the partial correlations between the PFM BOLD time courses ( ) . PFM (CT): network matrices estimated as the partial correlations between the combined time courses ( ) = ( ) + ( ) .

Discussion
In summary, the results presented above demonstrate three key attributes of PRO-FUMO. Firstly, the algorithm is applicable to modern, large-scale data, whereby it is exquisitely sensitive to cross-subject variability in spatial organisation. Secondly, the joint inference framework allows estimation of subject variability in temporal features that does not appear to be confounded by spatial differences, which at times leads to a radically different view of functional connectivity. Finally, the implication of these results is that after accounting for spatial variability, the functional coupling between modes is much more reflective of current cognitive state rather than trait-like qualities.
Furthermore, we have shown that there is significant value added in terms of interpretability from the practitioner's point of view in using models of this form.
To give a few concrete examples, we can only make the claims pertaining to the dissociation of non-homogeneous spatial variability-as illustrated in Figure 5-if we can both consistently identify equivalent functional systems across multiple subjects and model the different ways in which variability can arise. Similary, the ability to capture cross-subject amplitude effects (Figure 11) or use the model to define alternatives to, for example, fALFF-type measures ( Figure 13) means that many of what would have been post-hoc analyses can be simplified and made more interpretable.

Group-versus subject-level approaches
The comparisons in this paper have been with ICA-DR, as this is probably the most common method for finding functional modes from resting-state data and is a key part of the HCP's pipelines. However, while PROFUMO and ICA-DR try and infer on many of the same quantities, they make fundamentally-and not necessarily compatible-assumptions about the data itself.
The key difference between the two is the way PROFUMO entails a holistic model for group-and subject-level representations, whereas ICA-DR assumes they are separately estimable. The majority of group-level ICA methods assume all subjects are in a common space, and proceed to analyse the data without recourse to individual decompositions. This formulation gives much more flexibility for the group-level decomposition to utilise the extra statistical power that concatenating over subjects affords, which means that the ICA modes depart-at times fairly radically-away from what we can resolve at the subject level. As such, ICA seems to be able to identify up to several hundred plausible components, that ultimately begin to resemble a parcellation [Kiviniemi et al. 2009;Smith et al. 2013b].
However, what we show here is that group-level representations are not enough. In the simulated data, even if the ground truth is known at the group-level, the subject-level information inferred by dual regression will be biased and noisy.
What PROFUMO attempts to do is to model as many different facets of multisubject rfMRI data together as is plausible. Here, we expand on two concrete implications of this approach as compared to other methods.
Firstly, the implication of the joint subject-level modelling in PROFUMO is that for a mode to appear at the group-level it has to be resolvable in the majority of individual subjects. Therefore, this engenders a fundamentally different view on what the dimensionality of the data is. The Bayesian model complexity penalties seem to result in no more than thirty or forty PFMs being identified, essentially regardless of the pre-specified model dimensionality. While more subjects do offer increased regularisation of the subject-level modes, this can only do so much. This is why the inferred number of PFMs is on the same order as the number of signal components as inferred by ICA-FIX (23.3 ± 6.6 at the run-level for HCP data [Marcus et al. 2013]).
Secondly, we have demonstrated the importance of modelling different characteristics of the data together. In the simulated data, even if the ground truth is known and thresholded dual regression is used to reduce the inherent spatiotemporal biases [Bijsterbosch et al. 2019], PROFUMO is still more accurate than ICA-DR like approaches. Similarly, in the classification of the active-state data, there are clear performance benefits from modelling the netmats hierarchically, even after the spatial variability has been accounted for. This is not to say that the PROFUMO model is perfect, as it clearly contains many simplifying assumptions. However, it is at least an internally consistent framework within which one can begin to explore the implications of different modelling decisions.

Spatial representations
One of the key messages from this work, in line with other recent reports [Hacker et al. 2013;Harrison et al. 2015;Laumann et al. 2015;Glasser et al. 2016a;Gordon et al. 2016;Braga and Buckner 2017;Gordon et al. 2017a;b;Kong et al. 2018], is that complex rearrangements of functional regions in individual subjects are ubiquitous and of a surprisingly large spatial scale. Figures 3 and 9 provide reasonable examples of these effects. Even after the advanced multi-modal, surface-based registration employed by the HCP, one often observes spatial rearrangements where subject-specific features are shifted relative to the group by many millimetres.
The difficulty we face when working at the group-level is that the summary features we extract are not necessarily representative of those at the subjectlevel; they are, and should always be thought of as, probabilistic representations [Van Essen and Dierker 2007]. As discussed in the previous section, we cannot automatically expect that it will be straightforward to project group-level results back to meaningful characterisations of functional connectivity at the subject level. Furthermore, the characteristic size of misalignments probably represents a limit in terms of the size of functional features we can project from the group back to the subject-level; while the native resolution of the subject-level data may well be higher, methods that work on the functional data alone like ICA-DR or PROFUMO will always struggle in the absence of additional constraints if the misalignments are large enough to mean some regions do not overlap with their group-level homologues at all.
In other words, misalignments are now often larger in scale than the fundamental resolution limits imposed by the physics and physiology that governs the properties of the data itself. Subject-level representations are limited by the properties of the data itself: 2 mm isotropic voxels are now common, and the spatial characteristics of the HRF do not appear to blur much beyond this [Shmuel et al. 2007]; at the group-level, the effective resolution of the data relates to the characteristic size of these residual misalignments between subjects, and these are likely to be larger. What this means is that functional MRI currently occupies an interesting liminal space, where the spatial resolution of high-powered single-subject analyses can now surpass that of studies that employ multitudes of subjects. This probably explains the recent resurgence of exploratory studies based on small numbers of subjects [Gonzalez-Castillo et al. 2012;Raemaekers et al. 2014;Laumann et al. 2015;Poldrack et al. 2015;Huth et al. 2016;Braga and Buckner 2017;Gordon et al. 2017b;Salehi et al. 2020]. Fortunately, recent work has suggested that there is scope to further reduce the size of the residual misalignments [Guntupalli et al. 2018], and use multi-modal data to help identify regions at the subject-level [Glasser et al. 2016a], both of which will be essential parts of the push towards finer spatial scales.
Finally, these observed spatial differences also have implications for parcelbased analyses. Given the many fine-scale variations in the spatial maps and the amount of overlap between PFMs, it may be that we need multivariate analysis techniques that go beyond one summary time course per parcel to capture the richness of the functional data at sub-parcel spatial scales [Geerligs et al. 2016;Anzellotti and Coutanche 2018;Haak et al. 2018].

Interpreting spatiotemporal connectivity patterns
One of the striking differences between PROFUMO and ICA-DR is their inferred patterns of functional coupling between regions. Not only do these suggest fundamentally different group-level coupling strengths, but the predictive power at the subject and run level is also different. Whereas ICA-DR netmats primarily correlate with trait-like properties, PROFUMO netmats are more sensitive to changes in cognitive state. Here, we expand on these observations as a final discussion point.
Clearly, there is a complicated relationship linking spatial variability and the functional coupling between modes, and indeed concerns about the interpretability of functional connectivity in the presence of anatomical variability are far from new [Brett et al. 2002]. The effect that subject variability in spatial organisation might have on its temporal counterpart has been noted in simulation studies. For example, E. A. Allen et al. [2012] observed a sharp decrease in the ability of a variant of ICA-DR to detect subject-specific modes in the presence of subject variability in spatial organisation, an effect which was compounded by spatial overlap between 4. Discussion bioRxiv modes 13 . This links to functional coupling via the work of Smith et al. [2011], who noted that if ROIs were misspecified such that the time courses contained a range of contributions from the true underlying regions, then '[t]he results are extremely bad'. It is the latter result in particular which is particularly shocking: if we do not extract accurate subject-level estimates of functional regions then it is essentially impossible to characterise the functional coupling between them.
Furthermore, a key claim of the related recent work on subject variability in functional connectivity by Bijsterbosch et al. [2018] is that it is not possible to make meaningful claims about what drives cross-subject changes in functional coupling between regions if said regions are not properly delineated at the subject level. In other words, spatial variabiality does not simply make it harder to estimate functional coupling, it can also fundamentally bias our inferences. Again, the results here-particularly the simulations-extend these results, showing that the way in which dual regression biases functional connectivity estimates away from the spatial correlation structure [Bijsterbosch et al. 2019] is really an inherent property of mapping between group and subject levels in this way. While this bias can be reduced with the thresholded variant of dual regression, the simulation results, and short theoretical analysis on the role of noise, suggest that the PFM model will be much more performant than this variant.
What we show here with regards to the predictive power of the PROFUMO netmats is that, in line with other work Pervaiz et al. 2020], they are relatively poor predictors of trait-like quantities. Instead, we have shown that they are much more predictive of current cognitive state. However, for analyses that try to use functional coupling to make predictions about individual subjects [Abraham et al. 2017;Dadi et al. 2019;Pervaiz et al. 2020], the ICA-DR netmats are likely to produce more accurate predictions. In that case, one has to contend with the fact that the induced biases reduce the interpretability of the findings, which may or may not be desirable depending on the specifics of the problem at hand [Stephan et al. 2015;. Of course, the presence of confounds that are themselves behaviourally relevant-such as head motion [Power et al. 2012;Satterthwaite et al. 2012;Van Dijk et al. 2012;Couvy-Duchesne et al. 2014;Hodgson et al. 2017;Laumann et al. 2017], physiological noise [Power et al. 2017;Glasser et al. 2018] or brain volume [Bartley et al. 1997;McDaniel 2005;Qing and Gong 2016]-makes this problem of interpretability very challenging in practice for any method.
The results we have presented here suggest that the spatial information encoded by PROFUMO is likely to give much better predictive performance in this context. This is similar to other work which has demonstrated increased performance of spatial features such as, for example, task-based maps  or parcel topography [Kong et al. 2018], and, furthermore, that this has a close relationship with structural information [Llera et al. 2019]. The obvious questions are therefore why do spatial rearrangements of functional regions seem 13 Figures 4 and 5 in particular.

Conclusions
bioRxiv to be so predictive in cross-subject analyses, and how do we interpret them? One hypothesis is that this variability in spatial organisation of functional regions is simply reflecting variability in the brain's macroscale structure, for which there are already well established links between environmental, genetic and lifestyle factors [Reiss et al. 1996;Shaw et al. 2006;Stein et al. 2012;Douaud et al. 2014;K. G. Noble et al. 2015;Elliott et al. 2018].
However, it would be an enormous surprise if this reductionist reading of these functional changes as simply reflecting structural variability is the whole story, especially after the registration approaches used. Rather, it is vitally important to understand both what mechanisms give rise to these spatial changes, and, in particular, what unique information does the functional variability carry over and above what can be derived from other techniques and modalities.

Conclusions
All analyses of complex, multivariate functional data require us to make simplifying assumptions, and, as such, the results we see are inevitably coloured by the modelling choices we make. This might involve, for example, deciding decide whether to run a parcel-or mode-based analysis, or when choosing which specific method to use. As such, it is essentially impossible to conclusively determine whether one method more accurately characterises the general organisational principles or subject variability from the functional data alone. However, we feel that the above results demonstrate that PROFUMO and the PFMs model are providing a novel and worthwhile perspective on the analysis and interpretation of functional MRI data. We hope that this approach-by virtue of having a model tailored to the properties of fMRI data, the enhanced spatial sensitivity and specificity, and the way spatial variability is automatically accounted for when estimating functional coupling-proves useful.

Appendices A Alternative approaches
There are now several methods that characterise resting brain activity in terms of functional modes, both at the group and subject level. The standard pipeline is essentially a two-step process, where the group-level modes are estimated before some form of back-projection is used to extract subject-specific versions of these. Dual regression and related variants thereof, typically combined or integrated with a group-level spatial independent component analysis (ICA) [Calhoun et al. 2001;Beckmann et al. 2005], have been the de facto standard for analyses of the subject variability in spatial and, more recently, temporal features of modes for at least the past decade. Dual regression proceeds by regressing the group-level spatial maps into the data to get a set of time courses-from which subject variability in temporal features may be estimated via any number of functional connectivity metricsbefore regressing the time courses back into the data to get subject-specific spatial maps [Calhoun et al. 2001;Beckmann et al. 2009;Zuo et al. 2010b;Erhardt et al. 2011;Nickerson et al. 2017].
This approach has been extended over the years, with several proposed refinements to either the method for identifying group-level modes [Damoiseaux et al. 2006;Varoquaux et al. 2010;Lee et al. 2011;Smith et al. 2012;G. I. Allen et al. 2014;Hjelm et al. 2014;Karahanoğlu and Van De Ville 2015;Dohmatob et al. 2016], or to the way subject-specific information is extracted [Du and Fan 2013;Hacker et al. 2013;Zöller et al. 2019].
However, there have been several more extensive departures from the above framework that are more similar in spirit to the hierarchical PFMs model. For example, Varoquaux et al. [2011] and Abraham et al. [2013] proposed a more holistic model that finds a set of systems regularised by not only the group-level properties, but also by the consistency of both spatial and temporal information at the subject level. More recently, Li et al. [2017] introduced a model based on non-negative matrix factorisation (NMF) that jointly optimises subject-specific decompositions such that the spatial maps are both sparse and consistent over subjects, though without explicitly leveraging any information about temporal consistency.
As mentioned in the Introduction, these methods all have potential shortcomings in terms of the extent to which typical patterns of variability are learnt from the multiple subject-specific decompositions. These shortcomings are particularly apparent for dual regression type approaches, where the estimation of subject variability is completely post-hoc (and, moreover, the estimated subject variability in spatial organisation only indirectly informs the subject variability in temporal features), but it is also problematic for the more complex models which we have mentioned, for which no explicit parameterisation for the observed variability over subjects is inferred.
More recent methodological work has focused on deriving subject-specific parcellations, both based on a fixed group-level template [Dhillon et al. 2014;Wang et al. 2015;Glasser et al. 2016a;Chong et al. 2017;Gordon et al. 2017a;Salehi et al. 2017], and formulated as a hierarchical model [Liu et al. 2012;Langs et al. 2016;Kong et al. 2018]. However, while both mode-and parcel-based approaches have shown promise, our concern is that the subject variability in spatial organisation that has been reported often features reorganisations of a similar scale to our current best estimates of the sizes of distinct functional regions [Van Essen and Dierker 2007;Van Essen et al. 2012a], and as such, reliable identification at this scale is arguably beyond all but the most sophisticated, multimodal approaches utilising high quality data [Glasser et al. 2016a]. Therefore, in this work, we stick to a system-level description and base our method on a decomposition into a set of modes. Intuitively, a functional system is more protected from the deleterious effects of misalignment than a functional region in two key ways: firstly, functional systems have a greater spatial extent than parcels; secondly, a reorganisation of one region within a larger system can be straightforwardly corrected for if the other regions are relatively stable.

B Preprocessing
The aim of the preprocessing pipeline is to normalise the data such that it has a consistent scale across subjects, and that the properties of the unstructured noise follow the assumptions that are contained in the generative model. The approach is as follows.
• Voxelwise normalisation. For each voxel independently, the time course (i.e. ( ) ) is set to zero mean and unit variance. This ensures that each voxel has a roughly equal contribution to the SVD in the next step.
• Voxelwise normalisation of the noise subspace. Each voxel is independently normalised such that the variance of the unstructured noise is unity. This matches the assumption of isotropic noise in the generative model. The unstructured noise subspace is estimated via the SVD. The whole data matrix is decomposed and the components with the highest singular values are assumed to represent the structured signal subspace and are removed. The noise subspace is reconstructed from the remaining components, the variance is calculated in each voxel, and the data is renormalised on a voxelwise basis such that the variance becomes unity.
• Global normalisation of the signal subspace. There is one final degree of freedom remaining. The generative model assumes isotropic noise, but does not assume a fixed variance. Therefore we can apply a global renormalisation to set the overall variance of the modes we observe. As an approximation, if = and we assume independence over modes, then we can say that E . In other words, if the maps, amplitudes and time courses have unit variance then the signal variance will be equal to . Therefore we use another SVD decomposition and set the overall variance of the assumed signal subspace (i.e. the first components) to match the above by applying exactly the same normalisation to each voxel.

C Data reduction
The scale of modern rfMRI studies is now such that even manipulating all the data in its raw form simultaneously is impossible. For approaches that start by inferring group-level descriptions of the data, such as ICA, it is possible to use online algorithms that work by passing over the data sequentially Mensch et al. 2017], thereby removing the dependence between memory required and the number of subjects under study. However, our approach is explicitly designed to simultaneously extract group-and subject-level features, and as such we need the data from each subject to be available.
To facilitate analyses of large data-sets, we apply subject specific data reductions, but do not collapse these down further to group-level summaries. The approach we take is to approximate each run with a low-rank singular value decomposition (SVD). As our model is defined in terms of both spatial and temporal features, we have to retain both the spatial and temporal singular vectors. However, as the PFM model assumes that subject-specific spatial maps are conserved across all runs for a given subject, we make further savings by only maintaining a single set of spatial singular vectors per subject.
To do this, we calculate the SVD of the matrix formed by temporally concatenating all data from a given subject. This combined data matrix, ( ) ∈ ℝ × , is then represented by ( ) , ( ) and ( ) . To approximate this with a low rank SVD, we simply only retain the singular vectors associated with the top singular values. For example, assuming > and ignoring columns associated with singular values equal to zero, ( ) ∈ ℝ × is replaced bŷ( ) ∈ ℝ × . Finally, we can partition the temporal singular vectors, according to the order the individual runs were concatenated, in order to reconstruct the data from each run individually, or in other words,̂( ) ∈ ℝ × is decomposed into a set of̂( ) ∈ ℝ × . In summary, each data matrix, ( ) , has three approximating matrices, namelŷ ( ) ∈ ℝ × ,̂( ) ∈ ℝ × and̂( ) ∈ ℝ × . The last thing we do is to combine these three matrices into two matrices. This simply saves some computation each time we need to calculate any expectations involving the data. The final form for the approximate data is therefore We can simply substitute this approximate expression for ( ) any time we need access to the data in the inference procedure, and this has the added bonus of being computationally, as well as space, efficient. However, we explicitly calculate, and cache, the overall data variance from the full data, rather than ignoring the contribution from the subspace of discarded singular values 14 . This means that the estimate for the noise precision, ( ) , will be comparable whether or not we choose to utilise this low-rank approximation, or indeed across different values of .
We now have an explicit method for reducing large data-sets to a more manageable size. However, there is one final complication: computationally, calculating the SVD of every ( ) actually turns out to be prohibitively expensive in most cases. In order to circumvent this, we utilise the fact that we are explicitly looking for a low-rank approximation and implement an extremely efficient randomised algorithm to directly calculate the truncated SVD. This approach is described in the excellent review by Halko et al. [2011].

D Degrees of freedom correction
fMRI data has an inherent spatial smoothness-such that there are non-trivial spatial autocorrelations in the noise processes-which is amplified by the spatial smoothing that is a standard pre-processing step for most analyses. As discussed earlier, this is not acknowledged in our specification of our model of the noise process. In essence, this means that the model assumes that there are more independent spatial measurements than actually exist.
Fortunately, as Groves et al. [2011] discuss, there is a simple way to mitigate some of the effects of this within the Bayesian framework. Intuitively, if we have smoothed the data then we should be able to downsample it without loss of information. At some stage, this would result in the noise becoming genuinely spatially independent again, thereby satisfying the assumptions of the generative model. However, this presents several practical problems, so rather than actually downsample the data, we simply downweight the spatial information by a factor . This represents the proportion of voxels that would be retained if we were to optimally downsample. 'This is analogous to fixing that only a random fraction of the data points will be kept, but at each stage averaging over all possible choices of decimated voxels ' [ibid.]. While this approach still does not explicitly acknowledge the relationship between noise in nearby voxels, it does counter most of the deleterious effects of this model misspecification, especially when combined with the models for noise in the subject-specific spatial maps and time courses. The main advantage of this approach, compared to a more formal model for smoothness, is that it remains particularly computationally efficient.

bioRxiv E Initialisation
With a model of this complexity, it is important that the algorithm is appropriately initialised. By doing so, we can improve reliability and computational stability whilst reducing the computational time required for convergence. Our approach is to compute a consensus group-level set of modes, and use these to initialise the full model.
To do this, we mimic the temporal concatenation approach employed by most existing algorithms and compute a consensus set of spatial singular vectors (this can be done even more efficiently if we have already utilised the data reduction technique described previously [Calhoun et al. 2001]) using another randomised SVD algorithm that streams over the data. These singular vectors are reweighted via an adjusted set of singular values. More specifically, we use the properties of the Marchenko-Pastur distribution to find the noise level that ensures the SNR at the group-level decomposition is similar to the SNR at the subject level. We then run a Bayesian version of spatial ICA-with the spatial priors set to mimic the group-level priors of the full model-to generate the group-level modes. The SNR recalibration ensures we do not get over-splitting of the modes at this stage. We can then propagate this set of group-level modes through the rest of the algorithm, thereby ensuring all parameters are initialised with plausible values.

F Simulations
The spatial model consists of two levels: parcels and modes. We simulate 100 spatially contiguous parcels within a one-dimensional space comprising 10,000 voxels. We then apply a random diffeomorphic warp to each subject separately as a model for residual misalignments after registration. We then simulate a set of 15 modes consisting of blocks of spatially adjacent parcels. There is variability in the mode weights over subjects, and we introduce overlap such that, on average, each voxel is a member of 1.4 modes.
In the temporal domain we simulate a set of sparse, correlated 'neural' timecourses for each of the two runs per subject. There is variability in the betweenmode correlation structure at the run and subject level. These are then convolved with a random draw from the FLOBS basis of haemodynamic response functions [Woolrich et al. 2004], which introduces variability over subjects and space. This results in 500 timepoints at a TR of 2.0 s.
The spatial maps and timecourses are combined via the outer product model, and a nonlinear saturation is applied such that the highest amplitude moments of instantaneous voxelwise activity are reduced. Finally, random noise is added with a degree of spatiotemporal smoothness such that the overall SNR (expressed in terms of variance) is 0.1. For the simulations presented in the Supplementary Material, we also add some structured, subject-specific noise components, again using an outer-product model. These can either be spatially specific or global, and are designed to contribute a similar amount of variance per component as the individual signal modes.
To allow inference algorithms to model the aforementioned artefacts, 18 modes are inferred. After inference, modes are paired to the ground truth based on the similarity between both the spatial maps and timecourses, averaged over runs and subjects. The full set of performance metrics shown in the Supplementary Material is then calculated.

G Human Connectome Project data and analyses
For the HCP analyses, all data was from the 1,200 Subjects Data Release: humanconnectome. org/study/hcp-young-adult/document/1200-subjects-data-release. We used the 1,003 subjects for whom there was full behavioural, structural and rfMRI data (i.e. 4 runs, each of 1,200 volumes). All analyses are of the MSMAll and FIX cleaned data (i.e. rfMRI_REST1_LR_Atlas_MSMAll_hp2000_clean.dtseries.nii etc.).
The amplitudes and netmats were estimated from the time courses released by the HCP: the amplitudes were taken as the standard deviations, while the netmats were the partial correlation matrices. Tikhonov regularisation was used when calculating the inverse of the full correlation matrices, with = 0.1 . Heritability was estimated via Falconer's formula, 2 = 2( − ) [Falconer 1960]. We calculate the correlations, and , between the voxelwise spatial map weights. In other words, for each subject and each voxel we extract a length vector of weights: ( ) using the PFM notation, and compute the correlation between these for every pair of subjects.
For the CCA analyses, we used the full set of restricted information released by the HCP. We first removed all variables relating to study completion or quality control. The structural variables were all the remaining variables in the Free-Surfer category; all others were taken as behavioural. To preprocess the behavioural variables, we first removed any variables that were either more than 20 % NaN, or those for which more than 95 % of subjects had exactly the same entries. We then imputed any missing values using the SoftImpute method [Mazumder et al. 2010] as implemented in the fancyimpute Python package (github.com/ iskandr/fancyimpute). The following were regressed out as confounds (one-hot encoded where necessary) in all subsequent analyses: Release, Acquisition, fMRI_3T_ReconVrs, rfMRI_motion, Age, Gender, Race, Ethnicity, Handedness, Height, Weight, BMI, BPSystolic, BPDiastolic, Hematocrit_1, Hematocrit_2, FS_IntraCranial_Vol, FS_BrainSeg_Vol. A detailed description of all variables can be found at wiki.humanconnectome.org/display/PublicData/HCP+Data+Dictionary+ Public-+Updated+for+the+1200+Subject+Release.
For the CCA, all groups of variables were normalised and then reduced to their top 25 components via the SVD, before a CCA was run on every pair of variable groups. The RV coefficient was then calculated between the top 10 paired components from each CCA.