Accurate predictions of individual differences in task-evoked brain activity from resting-state fMRI using a sparse ensemble learner

Highlights • We propose an approach to accurately predict individual variability of task-evoked brain activity.• Prediction accuracy is on par with the repeat task-fMRI (tfMRI) scans, both on the Human Connectome Project and the UK Biobank datasets.• The shape and overall intensity of individual tfMRI activations can be modelled separately and explicitly.• Training on residuals improves prediction of individual-unique features in tfMRI activations.


Introduction
Studying individual differences in brain activity and how they relate to cognitive and genetic traits is an important area of research in basic and clinical neuroscience. Traditionally, functional Magnetic Resonance Imaging (fMRI) analysis has primarily been concerned with group-average inference. While averaging data across individuals substantially improves signal-to-noise (SNR) ratio and has proved fruitful in identifying common patterns across subjects, this approach treats unexplained individual variations as noise, discarding unique attributes of brain activity specific to a particular subject. Individual variations in neuroanatomical or functional activity often carries valuable information. For example, if a small number of subjects in a large cohort has a rare disease, an indiscriminate data reduction prior to the analysis will very likely obfuscate this information.
The rapid development of cutting-edge neuroimaging techniques in recent decades has led to substantial improvements in the reliability and validity of blood-oxygen-level-dependent (BOLD) measurements, providing an unprecedented opportunity to investigate individualised patterns of brain activity. Moreover, emerging "big data " projects, such as the Human Connectome Project ( Van Essen et al., 2013 ) and UK Biobank ( Sudlow et al., 2015 ), have collected multi-modal neuroimaging data on very large samples, enabling researchers to more closely examine individual variations in neuroanatomical patterns and functional activities with enhanced statistical power. Among previous fMRI studies of individual variabilities, an active line of research focuses on understanding how individual brains vary in response to external cognitive tasks. Fol-lowing the work of ( Cole et al., 2016;Tavor et al., 2016 ), a number of studies ( Cohen et al., 2020;Dohmatob et al., 2021;Ellis and Aizenberg, 2020;Jones et al., 2017 ) have since shown that spatial patterns of task-evoked activation form a stable trait marker, encoded in restingstate brain activity, i.e., in the absence of any explicit task. In contrast to previous studies that mostly rely on correlation analysis (in-sample inference) to investigate individual differences, these works adopt predictive frameworks that allow for out-of-sample inference and greatly improved generalisability of these investigations of individual variability.
Why is task-free fMRI predictive of task-evoked activation? Previous studies have suggested that resting-state networks and task networks may share the same intrinsic architecture ( Cole et al., 2014;Elliott et al., 2019;Krienen et al., 2014;Smith et al., 2009 ). Therefore, a reasonable corollary is that resting-state heterogeneity should inform on variability of task-evoked brain activity. Typically, restingstate data are summarised as spatially continuous parcels distributed across the brain Bellec et al., 2010;Calhoun et al., 2008;Craddock et al., 2012;Van Den Heuvel et al., 2008;Yeo et al., 2011 ). These spatial maps are often referred to as "functional modes ", characterising functionally unified sub-processes underlying human cognition. Among the approaches of finding functional modes to predict task-fMRI, dual-regression Filippini et al., 2009 ) is a widely-used algorithm, showing ability to predict individual idiosyncrasies in their response profiles ( Cohen et al., 2020;Dohmatob et al., 2021;Ngo et al., 2021;Tavor et al., 2016 ). Although these previous attempts have successfully characterised individual-unique patterns of task-evoked brain activity, there are a few limitations yet to be accounted for. For example, these approaches ei-ther focused on cortical regions ( Ngo et al., 2021 ) or relied on predetermined brain parcellations to make predictions within small patches of brain and concatenate the results ( Cohen et al., 2020;Dohmatob et al., 2021 ). Here, we are interested in activations across the whole brain, not necessarily limited to the cortex. Furthermore, our model uses global features, because parcellating the brain into small patches a priori may introduce more free-parameters and increase the risk of over-fitting. More importantly, these approaches did not attempt to explicitly model cross-subject variability of the rest and task states per se , and thus may be sub-optimal to capture cross-subject variations. In contrast, Ngo et al. (2021) introduced a contrastive loss in combination with the common loss to maximise inter-individual differences in cortical regions. However, in practice, such loss functions are often non-convex and may have complicated behaviours (e.g., multiple local minima) rendering optimisation difficult ( Boyd et al., 2004 ). To fully account for the inter-individual variations, an alternative is to explicitly train on residualised data, i.e., residuals where group-average information has been regressed out. The data obtained in this way has minimal shared variance with the group-level information, thus serves as a cleaner description of individual-level differences.
Accurate prediction of individual differences in task-fMRI response using resting-state fMRI has a number of potential applications. In a clinical setting, considerable variability across patients makes it difficult for surgeons to rely on anatomy (or population-averaged task activation maps) when finding eloquent surgical targets. Task-fMRI can be used to localise surgical targets more accurately for individual patients, often by asking them to perform a specific cognitive/sensory-motor task. However, in practice, task-fMRI localisers are often limited by the patients' poor performance (or, the patients are unable to perform the task) and measurement noise. As a stable trait marker, resting-state fMRI has proved a good alternative to task-fMRI localisers, without the need of task execution prior to an operation. Accurate predictions of task-fMRI localisers using resting-state fMRI can therefore greatly benefit surgical targetting. Furthermore, having a better rest-predict-task model may improve our understanding of the mechanisms underlying individual differences in resting-state and task-fMRI.
Here we propose a framework that explicitly models individual variations in task-evoked brain activity using the resting-state variability, the latter profiled by a set of common spatial modes derived from a recently developed technique, Stochastic Probabilistic Functional Modes (sPROFUMO). We show that, consistent with previous studies ( Farahibozorg et al., 2021;Harrison et al., 2020;, sPROFUMO provides better sets of "bases " (later referred to as PFMs) to reconstruct the variations in task-evoked activation patterns than the widely used dual-regression. Additionally, we show that an ensemble learner that combines global and local bases has improved capacity of not only reproducing typical activation patterns but also preserved patterns unique to individuals. We demonstrate that modelling of individual-level task contrast maps comprises the modelling of two separate sources of variability, shape of activations and the overall activation strength. Considering these two aspects separately in task prediction is at least as effective as or even more desirable than simply modelling the original task contrast maps. Furthermore, the proposed model can recapitulate the spatial patterns of inter-individual variability, recovering regions that are more variable at the group-level. The model achieves state of the art prediction accuracy for both datasets, and is also on par with task testretest reliability. These results demonstrate the potential of resting-state features to reproduce task-fMRI features, and serve as a supplement to task localisers in pre-surgical plannings.

UK Biobank data
UK Biobank (UKB) is a large national project that collects a wide range of health-related measures for over 500,000 subjects, initially aged between 40 and 69. We used the resting-state and task functional MRI data from a total of 17,560 subjects. The acquisition parameters and processing details can be found in Miller et al. (2016) , Alfaro-Almagro et al. (2018) . Briefly, all resting-state fMRI scans were acquired with identical scanners (3T Siemens Skyra) with a TR of 735ms for a total of 490 time points for each individual. After the initial preprocessing, the data were ICA-FIX cleaned to remove structured artefacts ( Salimi-Khorshidi et al., 2014 ), and then registered to the standard MNI space. Next, each individual's resting-state 4D time series were further spatially smoothed with a Gaussian kernel of sigma 3mm. The task used is the Hariri faces/shapes "emotion " task Hariri et al., 2002 ), scanned and processed under the same protocols as the resting-state data (except that the task-fMRI data is not ICA-denoised). Individual as well as group-average activation z-statistic maps of three contrasts (faces, shapes, and faces-shapes) were estimated from the task fMRI scans using FEAT ( Woolrich et al., 2004;2001 ). Additionally, 473 subjects in this 17,560 subset received second-time scanning (mean testretest-interval 2.25 years, std 0.12). These second-time scans provided test-retest reliability scores as a benchmark for our model performance.
Details on the acquisition protocols and processing pipelines can be found in Van Essen et al. (2013) , Glasser et al. (2013) , Robinson et al. (2014) . Resting-state and task fMRI data from 991 subjects, aged 22 to 35 years, were used in the analysis. Each individual had four runs of resting-state scans with a TR of 0.72s for a total of 1200 time points per run. The data were ICA-FIX denoised to remove the effect of structured artefacts automatically, then resampled onto the "32k_fs_LR " grayordinates space and minimally-smoothed by 2mm FWHM. All subjects were MSMAll-registered to improve functional and structural alignment ( Robinson et al., 2014 ). To further increase the signal-to-noise ratio, an additional smoothing of 4mm FWHM was applied to the MSMAll-registered data (with subcortical structures smoothed within parcel boundaries, and cortical data smoothed in 2D on the surface) using the Connectome Workbench ( https://www.humanconnectome.org/software/connectomeworkbench ). The task fMRI scans were acquired and pre-processed in the same way (though without FIX). We used the MSMAll-registered individual and group-average contrast maps with 4mm FWHM smoothing in the analysis, including 47 contrasts across seven task domains .
Similarly to the UKB dataset, we used the HCP repeat scans as the reliability benchmark for the predictions. Among the 991 subjects, 43 have received second-time scanning under the same 3T imaging and behaviour protocols with test-retest-interval ranging from 18 to 328 days (mean 134.78; std 62.49).

Generation of resting-state functional modes
We used resting-state functional modes to predict individual task-fMRI. Functional modes are typically modelled as parcel-like spatial configurations of unified functional processes distributed across brain, each characterised by a summary time course that captures mode activity over time. Here we explored two approaches of generating individual resting-state modes, group-ICA followed by Dual-Regression (DR-ICA) and Stochastic Probablistic Functional Modes (sPROFUMO). DR-ICA is a conventional group-average algorithm to estimate individual "versions " of group-level spatial configurations, using a set of common spatial modes as templates ( Nickerson et al., 2017 ). In DR-ICA, group-PCA was carried out on each dataset (UKB and HCP) by MELODIC's Incremental Group-PCA  on the resting-state time series of all subjects (temporally demeaned and variance normalised), producing 1200 weighted spatial eigenmaps for UKB and 4500 eigenmaps for HCP. These eigenmaps were subsequently fed into ICA using FSL's MELODIC tool to generate group-ICA spatial maps at multiple ICA-dimensions (i.e., the number of distinct ICA components). To obtain dual-regression maps for a specific subject at a given ICA-dimension , we first regressed the corresponding -dimensional group-ICA spatial maps into the individual 4D time-series data, yielding a set of time courses per subject. The resulting time courses were subsequently regressed into the same 4D time-series, generating dual-regression spatial maps for each subject. However, a major limitations of DR-ICA is that it only allows unidirectional flow between group and individuals, i.e., the estimated individual modes cannot in turn drive the refinement of group-average modes, and may have limited ability to cope with individual deviations from the group-average ( Bijsterbosch et al., 2019;. A recently developed technique, sPROFUMO, uses a Bayesian model that simultaneously estimates functional modes both at group-and individual-level, and is scalable to large datasets ( Farahibozorg et al., 2021 ). In sPROFUMO, individual resting-state time-series are factorised into a set of spatial modes and their summary time courses (one per mode), together with the time course amplitudes. The group-level parameters constrain the estimation of (the posteriors over) individual-level parameters, of which the posterior evidence is accumulated across individuals to in turn infer the group-level parameters. The bidirectional information flow between the group and individuals aims to result in improved subject-specific spatial alignments. Below, we refer to the resting-state feature maps as either DR-ICA maps or PFMs depending on the approach used to derive them.

Residualisation of the resting-state and task contrast maps
Our aim is to derive a model that can predict task activation in individuals given their resting state modes. One of the innovations in this paper is to try to explicitly capture individual variations in our model. We propose that training and evaluating the model on residualised data (i.e., data and features where the group-averaged maps have been regressed out) would be of more value than training a model on the original resting-state and task contrast spatial maps.
To understand this, consider each individual task contrast map. It can be written as the sum of the group-average activation map (scaled by some factor) and a spatial residual map specific to the given individual. The resting-state feature maps can also be similarly decomposed. Thus, the mapping between resting-state features and target maps contains both desirable and undesirable mappings, in the sense that we don't want to optimise, e.g., for resting-state residual features to map onto the group-average activation map. Residualisation separates the groupaverage components from the residual components, thereby the two desirable mappings are learned separately as opposed to jointly ( Fig. 1 a). Furthermore, using the residualised data has another added benefit in model evaluation. The test-retest reliability of the variability (with the group-mean effects removed) between different scanning sessions is an estimate of noise ceiling, which theoretically bounds the performance a model can achieve ( Lage-Castellanos et al., 2019 ). Consider + + 1 and + + 2 as the activation maps measured at the first and the repeat session respectively, where is the group-average activation, the individual-unique features and 1 , 2 the measurement noise (and session variability) of the two sessions. The test-retest reliability of the residuals + 1 and + 2 is corr ( + 1 , + 2 ) ≈ 2 ∕( 2 + 2 ) , where corresponds to the variability of the signal and corresponds to the variability due to the measurement noise. Suppose the predicted activation is ̂ + ̂ ; the prediction accuracy of the residuals corr ( + 1 , ̂ ) , similarly, can be written as cov ( + 1 , ̂ )∕ std ( ̂ ) √ 2 + 2 . It can be easily seen that when the prediction ̂ is perfect (i.e., = ̂ ), the accuracy can be higher than the test-retest reliability due to the denoising effect. Therefore, by comparing prediction accuracy against its upper bound, we can show whether the model can fundamentally capture the individual differences in activation. However, if we use the un-residualised data to evaluate accuracy and reliability, the bias due to the group-average would make it difficult to establish a straightforward comparison between the test-retest reliability and prediction accuracy. Hence, we built the model entirely on residualised data. The residualised resting-state functional modes and the residualised task contrast maps are also referred to as "resting-state variation maps " and "task variation maps " in the remainder of the paper. To residualise the restingstate data, each of the group-average (across training subjects) ICA spatial maps was regressed out from the corresponding individual DR-ICA maps for all subjects (i.e., a one-variable linear spatial regression per subject per dual-regression map), and similarly, each sPROFUMO grouplevel spatial map was regressed out from the same mode's individuallevel sPROFUMO spatial maps (PFMs). These residualised spatial maps represent individual variations in resting-state activity, serving as features to predict individual variations in task-fMRI.
The task activation maps were residualised similarly for each individual, to give task variation maps. For a given task contrast, the groupaverage activation map was regressed out from the individual contrast maps (i.e., a simple linear regression per subject per contrast, with the group-average activation map as the regressor). These task variation maps describe individual differences in task-evoked brain activity that deviate from the typical activation patterns. Therefore the mapping between the rest and task states is entirely based on the variations rather than on the original resting-state and task-evoked activity. See Fig. 1 b for an illustration of residualisation.
Finally, to compare this residualised model with the model trained on the original resting-state and task contrast maps (i.e., unresidualised), the task activations are predicted as a combination of the modelled variations and the average activation patterns. For both the rest and task data, we record the regression parameters as part of regressing out group-mean maps; these measures of overall "amplitude " are used later in the work, described below, and can of course be used (multiplied by the group-mean maps) to add the group-mean contribution back in where desired.

The ensemble learner
Our overall ensemble approach combines two separate models, "baseline " and "sparse ". We start by describing these two individual models, and then go on to describe the ensemble method.

The baseline model
The baseline model assumes that, for a given task contrast, the individual task variations (i.e., residualised activation maps) can be represented by a linear combination of the variations in resting-state functional modes (i.e., residualised DR-ICA maps or PFMs). In this sense, the resting-state modes serve as a set of "bases " that span the task space. To obtain the reconstruction coefficients for each subject, we regressed the subject's resting-state bases into its spatial activation map (i.e., a multiple-regression per subject per task contrast, with restingstate variation maps as regressors and the task variation map as response). More specifically, suppose the number of voxels is , and each individual has number of bases (i.e., there are group-average ICA spatial maps); to find the reconstruction coefficients of a specific task contrast map (a × 1 vector) for a given subject , we regress the given subject's resting-state variation maps, denoted by a × matrix = [ 1 , 2 , … , ] for = 1 , 2 , … , into the task variation map of this subject. As a standard linear regression problem, the reconstruction coefficients (a × 1 vector) of subject minimise the following loss function for ∈ , where  is the set of training subjects. The estimated reconstruction coefficients ̂ are given as ( ) −1 y , where X is subject 's resting-state variation maps. These coefficients were averaged across

Fig. 1. Model illustration. (a)
The residualisation separates the features and targets into two (the group maps, denoted by subscript and the subject residual, denoted by superscript ). The mapping between rest and task for subject contains desirable and undesirable mappings, in the sense that we don't want to optimise, e.g., for the individual subject features to map onto the group maps. (b) Step 1. Residualisation of the resting-state modes and task contrast maps. The residualised resting-state maps were further ICA-reduced as the input of the sparse model. (c) Step 2. Training of the baseline and sparse model: per training subject, the baseline model yielded reconstruction coefficients (one coefficient per map), which were averaged across subjects as the final baseline coefficients (orange). Next, the resting-state residual maps and the task residual maps were concatenated across subjects accordingly and then reduced to lower dimensions via ICA. The sparse model was trained on the (ICA-reduced) across-subject residual matrices to give the sparse regression coefficients (green).
Step 3. the estimated baseline coefficients and sparse coefficients were applied to the training subjects to get the baseline-model-fitted (pink) and sparse-model-fitted (light blue) task residual maps. Next, for each voxel across subjects, we estimated the ensemble coefficients (yellow) by fitting another linear regression model with the baseline-model-fitted activations and the Lasso-fitted activations (in the given voxel) as the two regressors.
Step 4. These three sets of coefficients were finally applied to the test subjects to make new predictions (dark blue). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) the training subjects to give the final estimates of the reconstruction coefficients, i.e., ̂= 1 ∑ ∈ ̂ , where || = is the number of training subjects.
To predict the activation map of an unseen subject in a test set  , we applied the reconstruction coefficients averaged from the training set to the subject's own resting-state variation maps , i.e., Note that the baseline model is different from Tavor et al. (2016) , Cohen et al. (2020) , Dohmatob et al. (2021) in two ways. First, their models were primarily local, i.e., one linear regression per brain region, rather than a global linear regression for the whole brain. Second, with the group-average content regressed out from both the resting-state dual-regression maps and task activations, our baseline model aims to establish linear relationships between the variations of the two states (relative to the group-average) rather than the original resting-state and task activity (which is possibly dominated by the group average).

The sparse model
The baseline model has a few limitations. First, it has very few free parameters, resulting in one reconstruction coefficient per basis, which is then pooled (averaged) across all subjects. Crucially, each feature (spatial map) is associated with a single regression coefficient, regardless of which part of the brain is being modelled. Second, the coefficients learned from each training subject are estimated separately, which ignores patterns of between-subject variations. We re-formulated the problem in a more flexible, highly-parameterised framework, referred to below as the sparse model, with appropriate regularisation techniques to protect against too much flexibility.
First, to create feature maps that contain information of cross-subject variability, each of the resting-state variation maps is first concatenated across the set of training subjects , yielding one × restingstate matrix per group-average spatial map, with a total of such matrices. Denoting the th matrix ̃  = [ 1 , 2 , … , ] , where is the th resting-state variation map of subject , we then dimensionality-reduce these matrices into a set of components using ICA: ̃  =  for = 1 , 2 , … , . Following this decomposition,  is an × mixing matrix and is a set of independent components representing common spatial variations across the training subjects . The mixing matrices of each ICA contains "coordinates " of each individual in the restingstate space spanned by these common modes, providing profiles of the resting-state variabilities of these individuals. The mixing matrices are concatenated to give a single reduced variation matrix rest  as the final predictors, where rest is an × matrix. Likewise, the task variation maps (residualised activation maps) are concatenated across the training subjects , resulting in an × task variation matrix  = [ 1 , 2 , … ] per contrast. The reduced restingstate variation matrix ( rest  ) will be used to predict the concatenated task variation matrix. Under this formulation, the model has a large number of potential predictors. To prevent over-fitting, we enforce sparsity on the prediction regression coefficients, to enable selection of the subset of features that are most desirable for prediction. In addition, given that predictions made on the original task matrix are not only computationally expensive but also involve many redundant and noisy features (which will likely compete with the "real " features in the training), we also consider to similarly decompose the task matrix into a set of independent components, i.e.,  = task  task , where task  is the × mixing matrix, and task is the set of independent components. Thus, both the features matrix ( rest  ) and the regression target used in training (  ) are sparse, low-rank versions of their original versions (through ICA), and contain information on individual variations (through concatenation of subjects).
To find the sparse coefficients , we solve the following regularised regression problem on the ICA-reduced task matrix task where the Lasso penalty is univariately applied to columns of (with different hyper-parameters to allow differential amounts of regularisation), encouraging it to be element-wise sparse. Note that an alternative way of introducing sparsity is to use an 1 , 2 penalty on that enforces row-wise sparsity, as commonly applied in the grouped Lasso and the multivariate Lasso. That strategy would permit simultaneous use of all outputs to estimate a sole regularisation parameter. It implicitly assumes that predictions of different outputs (columns of task  or  ) tend to require the same set of features. This underlying assumption of row-sparsity penalty is not very appropriate and tends to require heterogeneous feature selection. Other alternatives that simultaneously use all outputs include Partial Least Squares (PLS), Canonical Correlation Analysis (CCA), and their variants, as well as a range of multi-task learning approaches. Given that multi-task learning approaches with sparsity regularisations usually have more complex behaviours than the pure Lasso, we simply choose the Lasso penalty, which is also particularly easy to parallelise across columns of task  or  (i.e., across task voxels). To predict task variation maps for a set of unseen subjects, denoted by  , we first need to translate the subjects' resting-state variations into the subspace spanned by the resting-state common modes (decomposed from the training subjects). This is conducted by regressing each acrosssubject basis matrix onto the corresponding set of resting-state common modes. Again, suppose the th (across-subject) resting-state variation matrix of the test subjects is denoted by an × matrix ̃  , where  is the test set, and = | | the number of test subjects. We seek to solve the linear regression problem where ̂  is the estimated "mixing matrix " of the th resting-state variation matrix across the test subjects  , and is the independent components calculated from the training subjects for = 1 , 2 , … , . Next, the sparse coefficients ̂ , estimated via (3) or (4), are applied onto the concatenated variability profiles ̂ rest , to give predictions for the set of unseen subjects  if ̂ is solved via (4). This completes the specification of the sparse model. To summarise the approach, we use concatenation of training subjects to incorporate information on subject variability in the training, we apply ICA to sparcify this data to help with fitting, and we employ further regularisation via the Lasso cost function on the regression coefficients. For UKB, we chose to reduce each across-subject resting-state matrix to 3000 independent components and the task matrix to 4000 independent components (however, reducing the resting-state and task matrices to 1000 independent components yields comparable results, see Figure S1). For HCP, in contrast, we chose to reduce each resting-state matrix to its full rank (i.e., number of the training subjects, around 900 in each fold) but kept the original spatial dimension of each task matrix (i.e., no ICA on the task matrix), which yielded the best performance on a left-out subset (see Figure S2).

The ensemble model
A single model usually represents a single hypothesis space of the particular prediction problem. Although the single models may contain the hypothesis space already well-suited for a specific problem, combining multiple hypotheses allows for more flexible structures to exist between predictors and response variables, and can potentially improve model performance (again, as long as over-fitting is avoided through correct use of, for example, cross-validation or left-out data). Here the two single models are tailored to different aspects of the underlying hypothesis that variations in resting-state activity can inform task variations. As mentioned above, the baseline model treats the resting-state variation maps as a set of "bases " that spans the task variation map for each individual. It is obvious that the baseline model assumes that the mapping between resting-state and task space is within-subjects and thus ignores between-subject patterns of variations which might also be useful for the predictions. The underlying hypothesis of the sparse model captures a different aspect, though closely-connected with the baseline model hypothesis. With the variation maps reduced to the corresponding subspaces, the sparse model assumes that the "coordinates " of the subjects in resting-state space can be translated into their "coordinates " in task space.
Here we aggregate the predictions of each single model, to give the final prediction for unseen subjects using simple linear regression. Suppose ̂ baseline  is the × baseline-model-fitted activations of the training subjects , and ̂ sparse  is the sparse-model-fitted maps. Particularly, we use ̂ baseline ⋅ and ̂ sparse ⋅ to denote the fitted activations in voxel across subjects (i.e., each is an × 1 vector). At the ensemble stage, we aim to find the coefficients for each constituent model by column-wisely fitting a simple linear regression on the task matrix of training subjects for the "true " activations in voxel across the training subjects, denoted by  ⋅ , for = 1 , 2 , … , . The two coefficients, (1) and (2) , will then be applied to the baseline-model-predicted and sparse-modelpredicted maps to yield predictions of task variations for the unseen subjects. See Fig. 1 for an illustration of model training and Table S2 for a summary of the notations.
For UKB, the ensemble model and its constituent base models were trained and tested on 17,560 subjects (3-fold cross-validation); for HCP, the models were trained and tested on 991 subjects (10fold cross-validation). The hyper-parameter of the 1 penalty was optimised within each fold's training data via nested cross-validation (3-fold). The other free parameters (e.g., the number of resting-state bases and the number of independent components in the sparse model) were determined on a different subset of 4700 subjects for the UKB dataset (trained on 4000 and tested on 700). Due to the limited number of HCP subjects, we randomly selected 10% of the HCP subjects and investigated how the choice of these parameters would affect the model.

The amplitude model
The amplitude model aims to predict the task activation amplitude for each individual (i.e., the beta coefficients from regression against the group-average activation map, as recorded during residualisation, one scalar value per subject per contrast) using the resting-state amplitude (i.e., the beta coefficients from regression of resting maps against the group-mean dual-regression maps, one scalar value per subject per basis). There are a few reasons for incorporating a separate amplitude estimation. First, one important source of individual variabilities in taskevoked activity is the (overall) activation amplitude. Explicitly predicting this information may help capture a different kind of individual variability that cannot be fully modelled by the aforementioned spatial models (indeed we would not expect to capture this from the residualised predictions). Second, the final predictions for test subjects are ideally given as a combination of modelled residual variations and the typical activation patterns. In order to recover the activation maps from the variation maps for each individual, the group-average activations are yet to be added back in appropriately, scaled by the activation amplitude of the specific individual. However, the activation maps of the test subjects are of course not seen during training. Therefore, we are not able to estimate the activation amplitude (i.e, betas from task residualisation) by simply regressing the group-average activations into the inidividual activations. As an alternative, the resting-state amplitude may be predictive of the overall activation amplitude ( Figure S3 and S4) and thus may serve as a substitute for this information. The surrogate activation amplitude was generated as follows. Remembering that each subject has resting-state amplitude values, corresponding to each of the group-average spatial maps (i.e., one amplitude value per map): for a given contrast, a multiple linear regression model with the activation amplitude as the response and the resting-state amplitude as the predictors was trained across subjects (3-fold cross validation on UKB; 10-fold cross validation on HCP). These surrogate activation amplitude are subsequently applied to the predicted variation maps as the new beta coefficients, such that the re-scaled group-average effects can be added back in accordingly.
The other hypothesis about the overall activation amplitude is that it serves as another important source of individual variabilities. To explore this possibility, we also consider to incorporate the amplitude information into the ensemble stage to test whether it can further improve model performance. Given that the resting-state amplitude values of the sets of dual-regression maps are correlated (across subjects), we reduce the amplitude features into a few principal components, the number of which are determined via cross-validation. These components are included in the ensemble model as additional predictors to predict each column (voxel) of .

Measures of model performance
Assessment of model performance is primarily based on Pearson's correlations between predicted maps and the actual maps (in subjects left out of the training process). Apart from the standard MNI152 brain mask applied at the beginning of all the analysis, we choose not to apply further thresholding of the resulting maps. Although further masking of the images may emphasise certain regions that are more of interest, the choice of thresholds can have a complex impact on evaluation and requires caution.
For a given task contrast, the predicted maps are correlated with the actual maps for all subjects, yielding a "subject by subject " correlation matrix, where the entry in the th row and th column corresponds to the correlation between subject 's predicted map and subject 's actual map. The mean of the diagonal elements measures the overall prediction accuracy, i.e., how well the model can reproduce the spatial patterns of activation for each subject, averaged across subjects. However, this measure cannot fully quantify model performance because the overall model accuracy can be boosted by simply reproducing the group-average activation, particularly when most subjects are "normal ", having activations patterns close to the group-average. Therefore, it is also important to make differentiated predictions, i.e., how well the model can capture atypical variations that deviate from the group-average activations. This necessitates measuring the extent to which, for a specific subject, the model can make predictions that are closer to the subject's own activation maps than to the others. This is of course particularly relevant if doing non-residualised prediction.
The new evaluation measure is calculated as follows: after the correlation matrix (between predicted maps and the actual maps for all subjects) is normalised via Fisher's transformation, for each subject, we calculate the difference between two values: (i) correlation between the subject's predicted map and the subject's actual map; (ii) mean of the correlations between the subject's predicted maps and other subjects' actual maps. The difference between (i) and (ii) provides a quantitative evaluation of the model's capability of predicting individual differences distinct from the group mean. In the following text, the first measure is referred to as "prediction accuracy ", and the second one is referred to as "prediction discriminability ".
Additionally, we calculated the between-subject standard deviation map of the actual task variations (as a measure of inter-individual voxelwise variability) and also of the predicted variations (as a measure of predicted variability) for each contrast. We then correlated the predicted variability maps against the actual variability map as a third measure of model performance. A higher correspondence between the two standard deviation maps indicates better ability to reproduce the spatial pattern of between-subject variability.

The ensemble model outperforms its constituent single models
To compare DR-ICA maps with PFMs, we chose the optimal dimensionality of each method, DR-ICA25 and PFM50 for UKB, and DR-ICA50 and PFM150 for HCP, respectively. The fact that PFM optimal dimensions were found to be higher than those of DR-ICA suggests that the former yielded more reliable functional modes particularly at higher dimensions. (However, note that PFMs consistently outperformed DR-ICA across all dimensions. See Figure S5). In the baseline model, overall, most variation maps contributed to the predictions ( Figure S6), suggesting that these resting-state variation modes did capture a significant proportion of the variance in task variation maps.
We found that the sPROFUMO modes had overall higher accuracy in predicting task variations than the DR-ICA maps, consistently across the baseline, sparse, and ensemble model ( Fig. 2 a and b). Compared with predictions based on DR-ICA, the biggest improvement introduced by sPROFUMO modes was evident from the baseline model, suggesting that  Table S3, all significant after being Bonferroni-corrected. (c) Correlations between the predicted and the actual inter-individual variability maps calculated across 17,560 UKB subjects. Overall, ensemble trained on PFM yielded the highest correspondence with the inter-individual variability. (d) Equivalent plots across 991 HCP subjects. See Figure S7 for all HCP contrasts. The statistical tests for (c) and (d) can be found in Table S4. (e) The actual (first row) and the predicted (second row) inter-individual variability across 17,560 UKB subjects of the three contrasts, shown volumetrically. Warmer colors indicate higher variability with the maximum normalised to 1. (f) The actual (first row) and the predicted (second row) inter-individual variability calculated across 991 HCP subjects of the seven representative contrasts, shown on the cortical surface. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) sPROFUMO provides a fundamentally better set of resting-state bases to reconstruct task variations than DR-ICA. This corresponds with previous evidence that sPROFUMO better accounts for cross-subject misalignment and accommodates higher predictive power of population heterogeneity ( Farahibozorg et al., 2021 ). Additionally, sPROFUMO modes also exhibited higher prediction accuracy for the sparse and ensemble model. Interestingly, the baseline and sparse model based on DR-ICA had very distinct performance on the two datasets. For HCP, the baseline model yielded higher prediction accuracy than the sparse model ( Fig. 2 b, blue and green), while for UKB, this relationship was entirely reversed ( Fig. 2 a, blue and green). However, introducing sPROFUMO modes as bases substantially enhanced prediction accuracy of the baseline model for UKB, making it tend to outperform the sparse model. Here we provide a possible explanation for this discrepancy. The two single models are tailored to different data scenarios. If the resting-state modes form the set of bases that do fundamentally have the ability to predict the task maps, then the baseline model should suffice, i.e., we don't need the sparse model to emphasise specific spatial features. In practice, however, DR-ICA maps are not the perfect sets of individual "versions " of the group-average modes, containing many noisy voxels irrelevant to task-fMRI prediction. A major difference between the two datasets is that the UKB data we used to train the model is volumetric while the HCP data is grayordinates. As a consequence, there is more functional spatial variability (misalignment) in the UKB data ( Coalson et al., 2018 ) and hence more "errors " in its individual dual-regression maps. In addition, HCP data is MSMAll-aligned and UKB is not. On the other hand, sPROFUMO better accounts for cross-subject misalignment and allows more fine-grained delineation of individual differences in resting-state data, thus it has improved ability to capture variations in task data. Furthermore, due to the shorter scanning sessions, the resting-state and task-fMRI scans in UKB have higher noise than in HCP, requiring additional benefits of identifying which voxels/spatial features are more desirable in the modelling. Hence, UKB requires greater spatial modelling complexity as well as greater spatial smoothing, provided by the sparse model (note that conducting ICA on the resting-state and task matrices across subjects in the sparse model may serve as a kind of de-noising).
For both datasets, overall, the ensemble model outperformed its constituent single models. Remember that the task variations are the residuals of regressing the group-average activations into the individuals, thus they are orthogonal to the group-mean by design. This also implies that these task variation maps have minimal overall cross-subject similarity, i.e., the spatial correlations between pairs of subjects fluctuate around zero. Therefore, the plots of prediction accuracy and of discriminability will look almost identical, because the predicted maps will have nearto-zero correlations with the maps of the other subjects, i.e., the offdiagonals of the (subject by subject) correlation matrices (between the predicted maps and the actual maps) are all close to zero (Figures S9 and S10). The discrimination metric for prediction accuracy on the residuals can be found in Figure S8.
In addition to predicting the individual variations in task activity, all three models could reproduce the spatial pattern of inter-individual variability (standard derivation maps across subjects) for both datasets ( Fig. 2 c and d). Similar to the previous scenario, using sPROFUMO modes as bases improved the prediction of inter-individual variability for the baseline model on both datasets ( Fig. 2 c and d, blue), corroborating the conclusion that sPROFUMO better aligns the subjects, refines the spatial details of cross-subject heterogeneity, and thus provides a better set of bases to reconstruct task variation space. In terms of the sparse and ensemble model, DR-ICA and sPROFUMO yielded comparable correspondence with the true inter-individual variability.
Finally, we also calculated the subjects identification accuracy, i.e. the probability that predicted maps had the highest correlation with the subjects' own residual maps, for each task contrast. The subject identification accuracies are all close to 100% ( Figure S9 and S10), suggesting that the predicted variability for a given subject best corresponds with the subject's own variability patterns than with the others. These results demonstrate that resting-state variations can capture well the inter-subject differences in task-evoked brain activity. These actual and predicted (via the ensemble model) inter-individual variability maps are shown in Fig. 2 e and f. Regions of higher variability across subjects are those more involved in the corresponding task execution. For example, somato-sensory and motor regions are more variable across subjects in the motor contrasts; fronto-parietal regions exhibits higher variability in more cognitive contrasts; the visual areas tend to be more variable in general, for all contrasts. In summary, all three models are able to capture individual-unique activation patterns that deviate from the typical activation patterns as well as recapitulating the spatial pattern of inter-individual variability. In the subsequent analysis, we used PFM50 for UKB, and PFM150 for HCP.

Training on the un-residualised data is suboptimal to capture individual differences
The next question we asked is whether the model can recover individual idiosyncrasies in task-fMRI, if trained on the un-residualised resting-state spatial modes and the task activations, as opposed to the residualised data (i.e., variation maps). Having close-to-zero shared variance with the group-average, the residuals more accurately profile the individual differences by design; we posit that training on residuals avoids the contamination of group-level information and thus may potentially facilitate capturing individual-unique patterns. To fairly compare the two options requires recovering the actual task-evoked responses (as opposed to the residuals) from the predicted variations for each individual. To explore this, we next generated the surrogate activation amplitude using the PFMs' amplitude for each individual, then added the group-average activation map (scaled by the resting-statepredicted amplitude) back to the predicted variation maps. These predictions with group-average added back in were correlated against the actual (un-residualised) activations for all subjects, again yielding a subject by subject correlation matrix per contrast. We calculated the prediction accuracy and discriminability from these correlation matrices, and compared them with the model trained on the un-residualised PFMs and task activations.
Overall, both options manifested considerable predictive power of individual activations, as suggested by the overall accuracy and discriminability ( Fig. 3 , red and orange). Additionally, we found that although training on variations exhibited little improvement on the actual prediction accuracy ( Fig. 3 a and b), it tended to improve prediction discriminability ( Fig. 3 c and d). This suggests that it is more desirable to establish a mapping between the variations in rest and task data per se than simply use the original data with group-average effects present. This is probably because residualisation orthogonalises the individual maps with respect to the group-average maps and prevents the dominance of the typical activation patterns. Furthermore, this shows that separating out the modelling of overall amplitude from (group-mean-removed) map variability, and then recombining these parts of the model later, is at least as effective as predicting raw task from raw resting maps. This is valuable, as it does suggest that these different data aspects can indeed be considered separately. The subject identification accuracies based on residualised predictions (with group-average effects added back in evaludation) are again close to 100% for most contrasts, shown in Figure S11 and Figure S12.
We also benchmark our model against previous GLM-based methods ( Tavor et al., 2016 ) using the same subjects. The Tavor method is based on multiple GLMs, essentially very similar to the baseline model, except for a few differences: (1) instead of training a global GLM for the whole brain between the resting-state and the task maps (as in our baseline model), the Tavor model seeks to fit multiple "local " GLMs within each of the pre-determined parcels; (2) the features of the Tavor model are seed-based connectivity maps, while our baseline model uses the dual-regression maps (i.e., multiple regression against the many "seed " timeseries output by the first stage of dual-regression). The ensemble model, trained either on the un-residualised data or on the variation maps, yielded higher prediction accuracy than the Tavor method. On the UKB dataset, the ensemble model substantially improved prediction accuracy and discriminability; on the HCP dataset, the Tavor method and the ensemble model trained on variations manifested comparable discriminability, both superior to the ensemble model trained on un-residualised data (see Figure S13 for all HCP task contrasts). We also compared our approach with other studies ( Cohen et al., 2020;Dohmatob et al., 2021 ) that also utilised the ensemble technique. Cohen et al. 2020 showed that using Random forest bootstrap aggregation enhanced prediction accuracy, benchmarked against the linear regression approach used in Tavor et al. 2016. Dohmatob et al. 2021 used random parcellations to improve prediction accuracy. Instead of using a single parcellation scheme to predict local patches of activations (one  Figure S13 for all HCP contrasts). The statistical tests can be found in Table S5. ridge regression per patch, concatenated afterwards), they averaged the predictions based on random parcellations as the final prediction. We show that our ensemble approach outperforms the others, particularly on the UKB dataset ( Figure S14). Note that, among the HCP contrasts, motor-tasks exhibited weak prediction discriminability. A possible explanation for this is that the individual response profiles to motor-related stimulus had little cross-subject variations, such that the model was not able to extract sufficient information to discriminate between subjects. The relatively lower prediction accuracy of motor tasks is, on the other hand, unexpected, especially considering the strong activations in cortical regions that are supposed to enable the model to learn the mapping between resting-state and motor tasks. Understanding this discrepancy between motor tasks and resting-state activity requires future investigations and would be important to understand the ongoing interplay of resting-state networks in task execution.
The fact that the model trained on the variations per se (with an explicit and separate amplitude prediction) can better capture patterns unique to individuals than its un-residualised counterpart corroborates the assumption that, in addition to the spatial layout (shapes) of activations, the overall activation intensity may also contribute to the variability of task-evoked brain activity. Following this, we also tested whether incorporating resting-state amplitude as additional predictors explicitly at the ensemble stage would further facilitate capturing individualunique patterns for the un-residualised model. We found that, though having little effect on the actual prediction accuracy ( Figure S15a), including the PFMs' amplitude as explicit predictors (in addition to the other two predictors, the baseline-model predicted and sparse-modelpredicted values in the corresponding voxel) did further improve discriminability ( Figure S15b), particularly on UKB. This improved discriminability for the un-residualised model ( Figure S15b, gray bars), however, is still not as good as the discriminability of the residualised predictions ( Figure S15e, orange bars). For the ensemble model trained on the residualised data, regressing out the group-average response "removes " the overall activation intensity relative to the group-average activations for each individual. Therefore, introducing resting-state amplitude to the residualised ensemble model, in theory, should have little effect on model performance. However, in practice, we found that incorporating resting-state amplitude as additional features in the ensemble stage also increased prediction discriminability for the residualised ensemble model ( Figure S15e). There are a few possible explanations for this discrepancy. One possible explanation is that the group-average activation patterns were not entirely removed particularly from the subjects that are very atypical, probably due to GLM's sensitivity to outliers or noise in the fitting (e.g., related to regression dilution). In this sense, including resting-state amplitude as additional features thus accounted for the remnants of the amplitude information particularly for those atypical subjects, and thus increased the overall prediction discriminability. Another possibility is that the overall activation intensity may still inform the (strength of the) variabilities of the shape of activations. This possibility can be partially validated by the findings that it further improved the fit with the spatial pattern of inter-individual variability by including resting-state amplitude as additional features at the ensemble stage ( Figure S15c and S15f). Note that, however, the resting-state amplitude is not expected to be a perfect surrogate of the task amplitude, as 2 between the actual and the predicted task amplitude is actually small (Figure S3 and S4).

Prediction accuracy paralleled test-retest reliability
To evaluate whether the predicted task contrast maps can reliably capture individual differences in task-evoked brain activity, we compared the accuracy of task predictions against their corresponding testretest reliability, leveraging the repeat fMRI scans in UKB and HCP datasets. The test-retest reliability of task-fMRI was defined as the spatial correlation between the first-visit and repeat-session task contrast maps, for each subject and each contrast. For both datasets, the PFMpredicted contrast maps yielded higher overall accuracy than the repeat scans, consistent across all task contrasts ( Fig. 4 b and c, light blue and light red), suggesting that resting-state predicted activations can surpass task-fMRI test-retest reliability. This coincides with previous studies that resting-state features serves as a reliable trait marker and may even be more heritable than task-fMRI phenotypes ( Winkler et al., 2010 ). Note  Fig. 4. Test-retest reliability of PFM-predicted task maps. (a) . Illustration of how first-and second-visit accuracy and test-retest reliability were calculated for the residualised and "group-average-added-back-in " predictions. (b) and (c) . Blue: prediction accuracy of first-visit contrast maps. Green: prediction accuracy of second-visit contrast maps. Red: test-retest reliability. Light colours denote the prediction accuracy of the un-residualised/group-average-back-in maps, dark colours the accuracy of the residualised maps. The last columns show all contrasts pooled together. Error bars show 95% CI of the means. (d) and (e) . For both datasets, the PFM-predicted task amplitude for the first-visit scans gave higher accuracy than the repeat scans (blue vs red). On HCP, the predicted amplitude also gave higher accuracy for the second-visit task amplitude than the corresponding test-retest reliability (red vs green). However, on UKB, this accuracy was worse (green vs red in Fig. 4 d), again suggesting that the UKB resting-state sessions are less reliable, possible due to the much shorter scanning time. The statistical tests can be found in Table S6 and S7. See Figure S18 and S19 for equivalent plots of (c) and (e) of all HCP contrasts. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) that, the accuracy of PFM-predicted task maps, which is on par with the test-retest reliability, is not a result of over-fitting to the first-visit task-fMRI data. To illustrate this, we also correlated the predicted task maps against the second-visit task contrast maps as "second-visit prediction accuracy " ( Fig. 4 a). If the model is over-fit to the noise component or simply reflects some analytic circularity in the first-visit sessions, this second-visit prediction accuracy should be much lower than the firstvisit accuracy and far below the test-retest reliability benchmark. However, we found that the second-visit accuracies were very close to the first-visit accuracies ( Fig. 4 b and c, blue vs green). This suggests that the model trained on a single session generalises well to the repeat sessions and is not over-fitting to in the first-visit sessions.
As mentioned in previous sections, predicting residuals is of more interest. The residualised task activation maps, with the group-average removed, consists of two components that explain the across-session variability, one the measurement noise and the other the true individualunique features. Thus, the test-retest reliability (correlation between sessions) of residuals measures the proportion of the true signal's variance relative to the noise variability, and is a fundamental criterion for the model performance ( Lage-Castellanos et al., 2019 ). A model that yields prediction accuracy comparable with this reliabil-ity is fundamentally capable of capturing the inter-individual differences in activation. On the HCP dataset, the prediction accuracy of residualised activations was close to the test-retest reliability of task residuals ( Fig. 4 c) for most contrasts, and even gave higher accuracy for a few contrasts (GAMBLING_REWARD, GAMBLING_PUNISHMENT, SOCIAL_MATCH-REL, etc., see Figure S18). On the UKB dataset, however, the (residualised) repeat-session task-fMRI scans still yielded much higher accuracy than the PFM-predicted task variations ( Fig. 4 b), possibly because of the much shorter resting-state scanning sessions. The repeat scans also had higher prediction discriminability than did the group-average-back-in predictions ( Figure S17), which is un-surprising due to the noise ceiling effect.
We also investigated the reliability of task activation amplitude predictions by comparing the accuracy of predicted amplitude against the test-retest reliability of the activation amplitude. Similarly, the testretest reliability of activation amplitude is defined as the correlation between first-and second-visit regression betas (i.e., the coefficients obtained by regressing the group-average task map out from the individual task contrast maps) across subjects. We tested whether restingstate-predicted amplitude is more robust than that measured directly in repeat-session task-fMRI scans. The PFM-predicted activation amplitude Fig. 5. Predicted, actual, and group-average activations of 6 example UKB subjects. The predicted activations captured the atypical activations in individual subjects (with group-mean-related components added back in). The subjects ranked between 50% to 75% according to their correlations with the corresponding group-average activations. See Figure S20 for the plots of the predicted and the actual task variation maps of the same example subjects.
(using the first-visit resting-state scans) indeed proved more reliable to task-fMRI scans in replicating the actual activation amplitude ( Fig. 4 d and e, blue vs red) for both datasets. Similarly, to exclude the possibility of over-fitting for the amplitude model, we also correlated this PFMpredicted amplitude against the second-visit actual amplitude ( Fig. 4 d and e,green). For HCP, the predicted amplitude still gave higher accuracy for the second-visit data than the test-retest reliability ( Fig. 4 e, green vs red). However, this no longer held for UKB ( Fig. 4 d, red vs green), possibly also due to the much shorter scanning time. Overall, these results did suggest that resting-state data is potentially a more stable trait marker than task-fMRI features, but this also depends on the reliability of the resting-state scans. Figures 5 and 6 show the comparison between the predicted, actual, and group-average activations volumetrically (for UKB) and on the surface (for HCP). It can be seen that the predicted activations provide a "smoothed " estimation of the individual activations, while preserving the unique patterns in individual subjects (for the actual and the predicted task variation maps of the same example subjects, see Figure S20 and S21; for the comparisons between the un-residualised predictions and the residualised predictions with group-average added back in, see Figure S22 and S23).

Discussion
In this paper, we extended previous GLM-based approaches ( Cohen et al., 2020;Dohmatob et al., 2021;Tavor et al., 2016 ) and proposed an ensemble learner to model individual variations in task activations on two large datasets, UKB and HCP. Enabled by a recently developed technique, sPROFUMO, we exploited the richness of individual variability in resting-state to reproduce task-evoked activation patterns unique to individuals. We demonstrated that sPROFUMO can accommodate higher predictive power than DR-ICA, especially in terms of the overall capacity of reproducing between-subject differences. This added advantage of sPROFUMO arises from its enhanced ability to depict fine-grained resting-state variability in rich detail due to its bidirectional and hierarchical architecture between the group-average and individual,in contrast to the unidirectional group-average algorithms (e.g., DR-ICA). Furthermore, we showed that modelling the individual activation profiles as a combination of the group-average and predicted variations can be more desirable than simply modelling the raw task map, suggesting that two sources of task variability, shape and amplitude, factorise into different compartments and can be modelled separately. Characterising different aspects of task variability is important to understanding the sources of these cross-subject differences. Overall, resting-state functional modes serve as a set of bases that can not only sufficiently reconstruct individual task-fMRI space but also yield more reliable localisation of individual task-evoked response profiles.
Our ensemble framework consists a baseline model and a sparse model, each tailored to a different scenario. In the baseline model, for each individual, the resting-state modes span the space of the task activation maps and thus, in theory, can reproduce task-fMRI in itself. In practice, however, more spatial complexity is often required to select local features that are "cleaner " or of more interest. The sparse model largely accounts for this limitation. For example, the motor network in resting-state modes contains components that are often in sync with each other and are part of the same spatial basis. The baseline model cannot split them, while the sparse model may select the components more desirable for prediction. However, the sparse model has another caveat. Despite the existing rescaling techniques (e.g., fitting another OLS on top of the selected features; introducing a re-scale factor), the Lasso penalty often introduces too much shrinkage, particularly when the prediction involves too many candidate features. As a results, the Fig. 6. Predicted, actual, and group-average activations of example HCP subjects. The predicted activations captured the atypical activations in individual subjects; these subjects ranked in the lower 50% percentile according to their correlations with the corresponding group-average activations. See Figure S21 for plots of the actual and the predicted task variations maps of the same example subjects.
predicted response may become too biased towards zero thus degrading model discriminability. The ensemble model, by fitting another OLS on top of each voxel, de-biases the over-shrinkage of the sparse model.
It is important to consider our approach in the context of other frameworks, for example, the activity flow approach used in this line of studies ( Cole et al., 2014;, which also investigated the idea of residualisation. It is a different type of framework in that it did not establish a supervised model trained to relate resting-state data to task-fMRI; instead, it sought to explicitly build the underlying mechanisms that link functional connectivity to task activations by testing whether such a mechanistic construct can capture task activations better than chance. The two types of frameworks are of different nature but both are able to reproduce individual task-fRMI patterns. They are tailored to different purposes. The activity flow approach provided an explanation for the observed rest-state and task-fMRI relationships and tested the corresponding hypothesis against chance level. It may help us better understand the intrinsic mechanism or biology underlying functional connectivity and task activations. Our framework is a supervised model, which may give more accurate predictions of individual activations. Note that the activity flow approach also investigated the idea of residualisation. There are major differences between their residualisation and ours. For example, the resting-state data were residualised differently in the two studies. In our analysis, the component to be regressed out is the group-average spatial map, while in their model, this component is the subject-general voxel-to-region FC patterns. Furthermore, when predicting residual activations for a given subject, they used the subject-general activation patterns (calculated from other subjects) in combination with the residualised FC patterns of the given subject, whereas we did not utilise this information of subject-general task activations at all. Both approaches exhibited the benefits of residualisation. Their approach showed that the individualised functional connectivity routes can predict individualspecific activations better than chance. This provides further evidence for the utility of residualisation in making more individualised activation patterns for individual subjects.
Note that the group-average activation patterns alone can have considerable overlap with individual activation maps. Thus one can obtain moderate prediction accuracy by simply reproducing the group-average. Hence, the accuracy of residualised predictions, or the discriminability of the group-average-added-back predictions, are more informative on the model's ability to make individualised predictions. This is of particular importance, because many existing algorithms tend to push predictions towards the mean. In a higher-dimensional setting, the relation between the two measures becomes complicated, but it is not difficult to see that the improvement of discriminability may degrade accuracy a little. Training and evaluating the model on residualised resting-state and task data thus have more desirable properties, not only to simplify the assessment of model performance but also to maximise ability to capture inter-individual differences. Other approaches to improve prediction discriminability include introducing a contrastive loss term to push between-subject differences to be large ( Ngo et al., 2021 ). It is yet to be investigated whether the two approaches are comparable. However, introducing extra terms may complicate the loss function (for example, turn a convex loss function into a non-convex one) and thus may be less stable. Training on residualised data keeps the original loss function structure and is usually simpler to train.
It is also important to note that the accuracy of such predictions is not by itself a good indicator of how well the model captures the underlying individual differences. The prediction accuracy is not only limited by inaccuracies of model assumptions or inappropriate specification (i.e. mismodelling) but also by other sources of variability such as measure-ment noise. In general, computational models do not account for the variability due to measurement noise, thus this imposes a bound to the model's ability to capture the true signal of interest. This effect is usually referred to as noise ceiling, and can be estimated by calculating the correlation between two repetitions of the test set (i.e. split-half estimator). By comparing the accuracy against the noise ceiling. If a model can fundamentally capture the variations in brain response due to individual or stimulus differences, the accuracy of prediction should at least verge on the test-retest reliability, or even be higher. It has previously been recommended in the neuroimaging community to report the performance of a model with respect to the noise ceiling ( Lage-Castellanos et al., 2019 ). In our results, the prediction accuracies on the HCP dataset almost paralleled the test-retest reliability for most task contrasts, suggesting that our model can fundamentally capture the individual variations in activation. On the UKB dataset, the accuracies are less encouraging, which is unsurprising due to the shorter scanning sessions and higher noise level.
In addition to predicting individual-unique activations, it is also of value to investigate the causes of the variations in task-evoked activations, particularly, what information in resting-state activity drives the individual differences in task activity. For example, do variations in peak activation patterns correspond to the changes in resting-state activity in the same location, or is it actually driven by more complicated configuration of the dense connectivity pattern? Such investigations would help us understand the nature of the inherent resting-state features that characterise variations in task activity. For example, these features can be "structurally " inherent (characterised by brain organisation and connectivity) or "functionally inherent " (related to the cognitive state of subjects during the resting-state scan) ( Tavor et al., 2016 ), both of which may cause the re-configuration or re-allocation of peak activation patterns. Note that, individual differences in task-evoked activations may be partially due to inter-subject misalignment. Indeed, registration remains an empirical question and may be sub-optimal in practice. However, it is very unlikely that our results only account for misalignment between subjects, as the model can capture variations not only in shape and position but also in topology of the activations. Indeed, it is likely that the relatively state-of-the-art alignments used here in preprocessing reduced intersubject variability, rather than increased it.
Using resting-state fMRI scans to infer individualised task-evoked responses has a wide range of implications in translational and clinical neuroscience. One potential application of the proposed model is to infer individualised functional localisers based on resting-state fMRI scans. This is important because task-fMRI scans are often of limited accuracy and reliability ( Elliott et al., 2020 ), possibly due to poor task performance and noise that is hard to remove in pre-processing. Such a framework can supplement task localisers, potentially improving the prediction of individual functional mapping (and for multiple networks and regions) and facilitating investigations of individualised response profiles of localised brain regions. Although our results primarily focus on the residualised results, this does not conflict with application targets such as pre-surgical planning. One of the biggest challenges in surgical targeting is to account for individual differences in the location of a target structure, which may vary considerably across subjects. Therefore, it is more desirable to have a model better capable of capturing the underlying inter-individual differences, not just reproducing the typical patterns shared by all subjects. The model trained on the residualised data, with group-average added back in afterwards, yielded more differentiated or "individualised " predictions, which matches this goal. The next question arises is that how many subjects are required in the training to create such a model and make predictions useful, as in a surgical context, it is impractical to collect substantial number of subjects as in the UKB and HCP datasets. Firstly, it is worth noting that even though the numbers of training subjects are large in this study, neither of the two datasets requires a big "N " to achieve good prediction accuracy ( Figure S24 and S25). In fact, prediction accuracy as a function of training size converges very quickly. Secondly, one does not neces-sarily have to re-collect training data using the same scanner or task for a given clinical dataset, because the model trained on one dataset may be transferable to another if aided by transfer learning or data harmonisation techniques. For example, Jones et al., 2017 applied the GLM model ( Tavor et al., 2016 ) trained on the HCP healthy cohort to a disease cohort and successfully learned the variations in their language maps. Another possibility is to generalise the resting-state features across different datasets, instead of transferring the mapping between restingstate and task data from one dataset to another. More specifically, one can generate matched dual-regression or sPROFUMO maps for different datasets. The coefficients thereby learned on one dataset may also be useful to another, as they refer to the same spatial configurations. Of course, to accurately generalise such predictions requires more complicated modelling techniques; developing and validating such approaches is outside the scope of this paper. As numerous multi-site multi-scanner consortia emerge, in future, it is important to develop a model that is capable of learning features invariant across scanners and insensitive to confounds. Once generalisable to other populations, such a model can be used to localise regions of interest for those who cannot perform tasks, such as paralysed patients and infants.
There are a few limitations in this study. First, the ensemble model is a linear combination of two single (largely) linear models and thus has limited ability to capture higher-order non-linear relationships between the resting-state and task-evoked brain activity. Second, the decompositions of common modes of variations are unsupervised. In the future, more complex modelling could be adopted to simultaneously estimate the common modes of variations and the reconstruction coefficients. Third, the rich information derivable from T1 and diffusion MRI scans may further aid the predictions of individual differences in task-evoked activity, and this model is yet to be adapted into a multi-modal framework.

Code availability
Code for the model and analysis in this paper can be found in https: //github.com/yingqiuz/predict-task-individual-variability .
Code for obtaining PFMs will be made available in an upcoming FSL release. It is currently available in https://git.fmrib.ox.ac.uk/rezvanh/sprofumo_develop .