Evidence for Embracing Normative Modeling

In this work, we expand the normative model repository introduced in Rutherford et al. (2022a) to include normative models charting lifespan trajectories of structural surface area and brain functional connectivity, measured using two unique resting-state network atlases (Yeo-17 and Smith-10), and an updated online platform for transferring these models to new data sources. We showcase the value of these models with a head-to-head comparison between the features output by normative modeling and raw data features in several benchmarking tasks: mass univariate group difference testing (schizophrenia versus control), classification (schizophrenia versus control), and regression (predicting general cognitive ability). Across all benchmarks, we confirm the advantage (i.e., stronger effect sizes, more accurate classification and prediction) of using normative modeling features. We intend for these accessible resources to facilitate wider adoption of normative modeling across the neuroimaging community.


27
Normative modeling is a framework for mapping population-level trajectories of the relationships 28 between health-related variables while simultaneously preserving individual-level information Mar-29 cell represents a unique between-network connection. For clarification, we also note that the raw input data is the starting point of the normative modeling analysis, or in other words the raw in-177 put data is the response variable or independent ( ) variable that is predicted from the vector of 178 covariates when estimating the normative model. Before entering into the benchmarking tasks, 179 to create a fair comparison between raw data and deviation scores, nuisance variables including 180 sex, site, linear and quadratic effects of age and head motion (only for functional models) were 181 regressed out of the raw data (structural and functional) using least squares regression.

182
Benchmarking 183 The benchmarking was performed in three separate tasks, mass univariate group difference test-184 ing, multivariate prediction -classification, and multivariate prediction -regression, described in 185 further detail below. In each benchmarking task, a model was estimated using the deviation scores 186 as input features and then estimated again using the raw data as the input features. After each 187 model was fit, the performance metrics were evaluated and the difference in performance be-188 tween the deviation score and raw data models was calculated, again described in more detail in 189 the evaluation section below. An overview of the analysis workflow is shown in Figure 1.  Figure 1 Overview of Workflow. A) Datasets included the Human Connectome Project (young adult) study, University of Michigan schizophrenia study, and COBRE schizophrenia study. B) Openly shared, pre-trained on big data, normative models were estimated for large scale resting state functional brain networks and cortical thickness. C) Deviation (Z) scores and raw data, for both functional and structural data, were input into three benchmarking tasks: support vector machine (SVM) classification, group difference testing, and regression (predicting cognition). D) Evaluation metrics calculated for each task benchmarking task. These metrics were calculated for the raw data models and the deviation score models. The difference between each models' performance was calculated for both functional and structural modalities. 191 Mass univariate group difference (schizophrenia vs. control) testing was performed across all brain 192 regions. Two sample independent t-tests were estimated and run on the data using the SciPy 193 python package Virtanen et al. (2020). After addressing multiple comparison correction, brain 194 regions with FDR corrected < .05 were considered significant and the total number of regions 195 displaying statistically significant group differences was counted.

196
For the purpose of comparing group difference effects to individual differences, we also summa-197 rized the individual deviation maps and compare this map to the group difference map. Individual 198 deviation maps were summarized by counting the number of individuals with 'extreme' deviations 199 ( > 2 or < −2) at a given brain region or network connectivity pair. This was done separately 200 for positive and negative deviations and for each group and visualized qualitatively ( Figure 4B).

201
To quantify the individual difference maps in comparison to group differences, we performed a 202 Mann-Whitney U-test on the count of extreme deviations in each group.

203
Task 2 Multivariate Prediction -Classification 204 Support vector machine is a commonly used algorithm in machine learning studies and performs 205 well in classification settings. A support vector machine constructs a set of hyper-planes in a high 206 dimensional space and optimizes to find the hyper-plane that has the largest distance, or margin, 207 to the nearest training data points of any class. A larger margin represents better linear sepa-208 ration between classes and will correspond to a lower the error of the classifier in new samples.

209
Samples that lie on the margin boundaries are also called "support vectors". metric. For task one, the metric was the total count of models with significant group differences 234 after multiple comparison correction (FDR-corrected < 0.05). In task two, the metric was area 235 under the receiving operator curve (AUC) averaged across all folds within a 10-fold cross validation 236 framework. For task three, the metric was the mean squared error (MSE) of the prediction in the 237 test set. Evaluation metrics of each task were calculated independently for both deviation score 238 (Z) and raw data (R) models. Higher AUC, higher count, and lower MSE represent better model per-239 formance. We then have a statistic of interest that is observed, , which represents the difference 240 between deviation and raw data model performance.
To assess whether is more likely than would be expected by chance, we generated the null distribution for using permutations. Within one iteration of the permutation framework, a ran-245 dom sample is generated by shuffling the labels (In task 1 and 2 we shuffle the clinical group labels, 246 and in task 3 we shuffle the g-factor labels). Then this sample is used to train both deviation and 247 raw models, ensuring the same row shuffling scheme across both deviation score and raw data 248 datasets (for each perm iteration). The shuffled models are evaluated, and we calculate for 249 each random shuffle of labels. We set = 10, 000 and use the distribution of to calculate 250 a p-value for at each benchmarking task. The permuted p-value is equal to ( + 1)∕( + 1).

251
Where is the number of permutations where > . The same evaluation procedure de- Sharing of functional big data normative models 256 The first result of this work is the evaluation of the functional big data normative models ( Figure   257 3). These models build upon the work of Rutherford et al. (2022a) in which we shared population-258 level structural normative models charting cortical thickness and subcortical volume across the 259 human lifespan (ages 2-100). The data sets used for training the functional models, the age range 260 of the sample, and the procedures for evaluation closely resemble the structural normative models.

261
The sample size (approx. N=22,000) used for training and testing the functional models is smaller 262 than the structural models (approx. N=58,000) due to data availability (i.e., some sites included in 263 the structural models did not collect functional data or could not share the data) and the quality  .We point out that the age range of the transfer (controls) sample (shown in Figure 2A) falls into a range with sparse data, and therefore the lower explained variance observed in the transfer (controls) group compared to the test and transfer (patients) groups is likely due to epistemic uncertainty (reducible with adding more data points) of the model predictions in this age range. B) The distribution across all models of the evaluation metrics (columns) in the test set (top row) and both transfer sets (middle and bottom rows). Higher explained variance (closer to 1), more negative MSLL, and normally distributed skew and kurtosis correspond to better model fit.

271
The strongest evidence for embracing normative modeling can be seen in the benchmarking task 272 one group difference (schizophrenia vs. controls) testing results (Table 2, Figure 4A). In this appli-273 cation, we observe numerous group differences in both functional and structural deviation score  Table 2. Benchmarking Results. Deviation (Z) score column shows the performance using deviation scores (AUC for classification, total number of regions with significant group differences FDR-corrected p<0.05 for case vs. control, mean squared error for regression), Raw column represents the performance when using the raw data, and Difference column shows the difference between the deviation scores and raw data (Deviation -Raw). Higher AUC, higher count, and lower MSE represent better performance. Positive values in the Difference column show that there is better performance when using deviation scores as input features for classification and group difference tasks, and negative performance difference values for the regression task show there is better performance using the deviation scores. * = statistically significant difference between Z and Raw established using permutation testing (10k perms).  The individual difference maps shows that at every brain region or connection, there is at least 296 one person, across both patient and clinical groups, that has an extreme deviation. We found 297 significant differences in the count of negative deviations ( > ) for both cortical thickness 298 ( = 0.0029) and functional networks ( = 0.013), and significant differences ( > ) in the count 299 of positive cortical thickness ( = 0.0067).

301
In benchmarking task two, we classified schizophrenia versus controls using support vector classi-302 fication within a 10-fold cross validation framework (Table 2, Figure 5). The best performing model 303 used cortical thickness deviation scores to achieve a classification accuracy of 87% ( = 0.87).

304
The raw cortical thickness model accuracy was indistinguishable from chance accuracy ( = 305 0.43). The AUC performance difference between the cortical thickness deviation and raw data mod-306 els was 0.44, and this performance difference was statistically significant. The functional models, 307 both deviation scores (0.69) and raw data (0.68), were more accurate than chance accuracy, how-308 ever, the performance difference (i.e., improvement in accuracy using the deviation scores) was 309 small (0.01) and was not statistically significant.

311
In benchmarking task three we fit multivariate predictive models in a held-out test set of healthy 312 individuals in the Human Connectome Project young-adult study to predict general cognitive ability 313 ( to the raw data model ( = 0.708) and this difference was not statistically significant. For the 317 functional models, both the deviation score ( = 0.877) and raw data ( = 0.890) models  Vector Classification using cortical thickness (residualized of sex and linear/quadratic effects of age) as input features. C) Support Vector Classification using functional brain network deviation scores as input features. D) Support Vector Classification using functional brain networks (residualized of sex and linear/ quadratic effects of age and motion (mean framewise displacement)) as input features.
were less accurate than the structural models and the difference between them (0.013) was also 319 not statistically significant. to strong (group difference testing) benefits of using deviation scores compared to the raw data 333 features.

334
The fact that the deviation score models perform better than the raw data models confirm the 335 utility of placing individuals into reference models. Our results show that normative modeling can 336 capture population trends, uncover clinical group differences, and preserve the ability to study 337 individual differences. We have some intuition on why the deviation score models perform better 338 on the benchmarking tasks than the raw data. With normative modeling we are accounting for 339 many sources of variance that are not necessarily clinically meaningful (i.e., site) and we are able 340 to capture clinically meaningful information within the reference cohort perspective. The reference 341 model helps beyond just removing confounding variables such as scanner noise, because we show 342 that even when removing the nuisance covariates (age, sex, site, head motion) from the raw data,

412
There has been recent interesting work on "failure analysis" of brain-behavior models Greene  mathematical description see (Fraza et al., 2021). Briefly, for each brain region of interest, y is 893 predicted as: Where ⊺ is the estimated weight vector, ( ) is a basis expansion of the of covariate vector , 895 consisting of a B-spline basis expansion (cubic spline with 5 evenly spaced knots) to model non- Where is the true response,̂is the predicted mean, 2 is the estimated noise variance (re-910 flecting uncertainty in the data), and 2 * is the variance attributed to modeling uncertainty. Model