Automated Machine Learning in Brain Predictive Modelling: A data-driven approach to Predict Brain Age from Cortical Anatomical Measures

The use of machine learning (ML) algorithms significantly increased in neuroscience. However, from the vast extent of possible ML algorithms, which one is the optimal model to predict the feature of interest? What are the best parameters for such a model? Given the plethora of possible answers to these questions, in the last years, automated machine learning (autoML) has been gaining attention. Here, we used TPOT which is a tree-based pipeline optimisation tool that scans a model space of models, their hyperparameters and finds the model with the highest accuracy. To explore autoML approaches and evaluate their efficacy within neuroimaging datasets, we choose a problem that has been the focus of previous extensive study: brain age prediction. Without any prior knowledge, TPOT could scan through the model space and create pipelines that outperform the state-of-the-art accuracy for Freesurfer-based models (MAE: 4.89 years) using only the cortical thickness and subcortical volume information. It also suggests interesting ensembles that do not match the current most used models for brain prediction but generalise well to an unseen dataset (MAE: 4.94 years). Thus, TPOT can be used as a data-driven approach to find ML models that accurately predict brain age.


Introduction
The last few decades have seen significant progress in neuroimaging methodologies and techniques focused on identifying the brain features, structural or functional associated with brain disease states. With the advance of ML algorithms, which learn to identify significant and generalisable structures 403 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 in datasets, the field of neuroimaging is moving towards becoming a predictive science, where inference can be made on the individual level rather than the general behaviour of a group of individuals (Glaser, Benjamin, Farhoodi, & Kording, 2019;Liem et al., 2017;Yarkoni & Westfall, 2017). The aim of predictive modelling is to use ML algorithms to learn patterns in a large dataset and subsequently build a model to predict an independent variable of interest. The model's performance is then evaluated in an independent dataset and new predictions can be generated when passing an unseen dataset to the trained model (Cole & Franke, 2017;Glaser et al., 2019;Liem et al., 2017;Yarkoni & Westfall, 2017). However, this raises various problems for analysis: Firstly, the richness of multivariate data, can lead to significant problems with overestimation of the model, leading to a loss of generalisation; and secondly the sheer mass of learning approaches that are available for datasets with a vast array of different properties and approaches, provides a bewildering set of choices for the practitioner; each with advantages and disadvantages both in terms of generalisation, and computational complexity. Hence, the optimal application of ML technology requires the answer to at least three questions: For the data at hand, which one is the optimal model to predict the feature of interest? What are its best parameters? Is the chosen model and parameters overfitting to the dataset? The fact that the answer to these questions are often arbitrary and defined only on prior -wisdom, is a challenge for neuroimaging which continues to face a significant replication crisis. Finding the best model and its hyper-parameters in a systematic, timely and computationally efficient way is the aim of autoML (Hutter, Kotthoff, & Vanschoren, 2019). This approach takes advantage of the complexity in the underlying dataset, and while searching for the best model it optimises the performance, whilst simultaneously attempting to maximise the generalisability of resulting predictions.
In this paper, we explore a genetic autoML approach and evaluate its efficacy for both identifying and predicting patterns within neuroimaging datasets. As a test-case, we choose to look at a problem that has been the focus of previous extensive study: the use of structural brain data (in this case, cortical thickness) to predict the subject's age. We first analysed the TPOT (Olson, Bartley, Urbanowicz, & Moore, 2016; performance, then we analyse the ML models suggested by TPOT and how well they generalised to a validation set.

Methods
For this analysis, T1-weighted MRI scans from N=10,307 healthy subjects (age range 18-89 years, mean age = 59.40) were obtained from 14 publicly available datasets  and the UK Biobank (Sudlow et al., 2015). All subjects were screened to exclude those with major neurological or psychiatric diseases and were divided into train (n=1,030), test (n=464) and validation (n=8,813) set in a pseudo-random fashion to ensure that age and sex distribution was balanced in all 3 groups.
The feature space under analysis consisted of 116 features that describe the thickness of different sub-cortical and cortical brain areas and were segmented using the Desikan-Killiany atlas (Desikan et al., 2006) and the Freesurfer reconall (Dale, Fischl, & Sereno, 1999) (Freesurfer version v6.0) from the individual's T1-weighted brain images.
To perform automatic machine learning we used TPOT (Olson, Bartley, et al., 2016;, which is a tree-based pipeline optimisation tool that uses genetic programming to search different hyper-parameters combinations and choose the most suitable model and its parameters for solving a classification or regression machine learning problem with high accuracy. It does so by finding the models with the best cross-validated performances on the training set for each generation and applying local perturbations (e.g., mutation and cross-over). This process is repeated for a specified number of generations, and the best performing model is returned to the user. One particularly interesting feature from TPOT is that it creates ensembles, that is, it combines the prediction from different ML together into a single pipeline, in order to enhance the model's accuracy.
The model space consisted of a pool of 11 commonly used linear and non-linear ML algorithms commonly used to predict brain-age (Aycheh et al. (2018) (2015), Table 1). We also extended the current TPOT software to include Gaussian Process Regressors (Pedregosa et al., 2011) and Relevance Vector Machines 1 , as those are common methods used fro brainage prediction.

Results
The analysed method is stochastic so for simplicity here we are showing only the results obtained for a predefined random seed.

Analysis of TPOT Performance
Our first step was to evaluate if TPOT was able to increase the prediction accuracy over generations. Fig 1 illustrates the change in prediction performance evaluated using the MAE (Mean Accuracy Error, lower values represent a better accuracy) for the different models for every generation. During the first two generations, TPOT is still exploring the model space and there is a large variance of the model's MAEs. As we can see by the decrease in the error bars, after the first 60 generations, the model pool becomes more accurate. However, throughout out the entire analysis we can observe three main groups of models: a group that poorly predict age (MAE > 35 years), a second group where the accuracy oscillates around 15 <MAE < 2, and a third group MAE < 10.

TPOT Suggested Models
We also analysed the presence of each model at every generation. As illustrated in Fig 2, Random Forests, Ridge Regression and Extra Trees Regressors achieve high accuracy and are passed on into future generations.

TPOT Generalisability
For this specific analysis, the ensemble of LinearRegression and two combined ExtraTreesRegressor performs with the highest accuracy.
To validate whether this model shows evidence of overfitting to the training dataset we applied the obtained model to a left-out validation dataset. While the MAE on the test set was 4.89 years, the obtained MAE for the validation dataset was 4.94 years.

Discussion
In the course of this work we have demonstrated that: (1) TPOT can be used to identify a good predictive model (2) For this application there is no single analysis model that "best" predicts age from the underlying structural imaging data; the "best" models identified by TPOT consist of a mix of random forest, randomised decision trees and cross-validated lasso.
(3) The accuracy of the models suggested by TPOT is better than recent brain age models. When comparing the accuracy of different studies, it is important to take into account the age range of the analysed sample, as age prediction in a small range has less variability than in a large range.  Liem et al. (2017) using only the cortical thickness reported a MAE of 5.95 years (analysed age range 19 -92 years, mean = 58.68). One of the main advantages of the approach here proposed is that it does not make any assumptions about the underlying statistics of the dataset and does not require any fine-tuning of the model of choice but still achieve the state-of-the-art accuracy. (4) We further evaluated the decisions made by TPOT in a totally independent dataset, to explore the stability of the TPOT generated analysis pipeline and found that the predictions generalise well to this 'unseen' dataset. Therefore, the autoML approach shown in this paper can be used as a data-driven method to learn patterns in the data and to avoid common pitfalls from ML algorithms such as overfitting.
Despite the obtained high accuracy of the model using only the cortical thickness information, we think that the MAE can be further reduced by including the surface area information obtained from Freesurfer to the TPOT models. Previous studies have observed an increase in accuracy by adding the different cortical anatomical measures to the models (Liem et al., 2017;Valizadeh et al., 2017;Wang et al., 2014). The accuracy of our current approach might also improve by better splitting the test, training and validation sets. With the current settings, TPOT can only use around 2500 subject for the test and training dataset in order to explore the defined model space. As we have a large sample size available, it would be interesting to add more subjects to these groups. These two refinements will be tested in future works.