From brain to education through machine learning: Predicting literacy and numeracy skills from neuroimaging data

Abstract The potential of using neural data to predict academic outcomes has always been at the heart of educational neuroscience, an emerging field at the crossroad of psychology, neuroscience, and education sciences. Although this prospect has long been elusive, the exponential use of advanced techniques in machine learning in neuroimaging may change this state of affairs. Here we provide a review of neuroimaging studies that have used machine learning to predict literacy and numeracy outcomes in adults and children, in both the context of learning disability and typical performance. We notably review the cross-sectional and longitudinal designs used in such studies, and describe how they can be coupled with regression and classification approaches. Our review highlights the promise of these methods for predicting literacy and numeracy outcomes, as well as their difficulties. However, we also found a large variability in terms of algorithms and underlying brain circuits across studies, and a relative lack of studies investigating longitudinal prediction of outcomes in young children before the onset of formal education. We argue that the field needs a standardization of methods, as well as a greater use of accessible and portable neuroimaging methods that have more applicability potential than lab-based neuroimaging techniques.


INTRODUCTION
The past few decades have seen a rapid increase in our understanding of how the brain changes over development and learning, leading a number of neuroscientists to consider implications of these findings for education.This has led to the emergence of the field of educational neuroscience ( Ansari & Coch, 2006;Goswami, 2004Goswami, , 2006)), defined in a recent review ( Thomas et al., 2019) as "an interdisciplinary research field that seeks to translate research findings on neural mechanisms of learning to educational practice and policy."However, this general endeavor has not been unchallenged.Critics have notably claimed that neuroscience findings are too remote from the classroom to be informative and to have practical implications for children or educational systems ( Bruer, 1997).Others have argued that behavioral measures are more practical to characterize children's cognitive capacities than neuroimaging measures ( Bowers, 2016).
In an earlier review, Gabrieli et al. (2015) argued otherwise and suggested that brain measures obtained through neuroimaging techniques may be useful for predicting future academic outcomes and, therefore, help design interventions, as well as for evaluating the success of interventions.A relatively limited number of studies were available at the time of Gabrieli et al.'s review.However, significant progress has since been made in both neuroimaging and machine-learning techniques.The term "machine learning" refers here to a set of computational methods that involve the development of algorithms and statistical models relying on patterns and inference derived from data.These computational methods typically use past information to improve their performance or to make accurate predictions over time ( Mohri et al., 2012).Because these technological advances are changing the landscape of what may be possible in terms of the prediction of outcomes from neural signals, we aimed here to provide an updated review of recent advances in neuroscience and machine learning that may have application to both education and the treatment of neurodevelopmental disorders.Though the present review primarily focuses on the methodological framework, challenges, and main findings from these studies, we will also end by discussing the potential practical applications of this line of research.
The present review largely focuses on findings in the domains of literacy and numeracy skills (and associated disorders) for two reasons.First, literacy and numeracy skills are considered fundamental to modern science and technologies, and difficulties in acquiring these abilities may negatively impact academic attainment and financial well-being ( Estrada-Mejia et al., 2016).Predicting reading and mathematical difficulties in children has, therefore, critical societal relevance.Second, literacy and numeracy are probably the academic domains for which the most progress has been made in developmental cognitive neuroscience over the past decades.We will, however, also include in our review several studies that have focused on other cognitive factors relevant to education.Finally, we will highlight future directions for studies aiming to apply machine learning to neural data in order to predict and improve educational outcomes.Gabrieli et al. (2015) pointed out that the term "prediction" can have at least three different meanings in studies.In its weakest form, the term might be used to describe a correlation between two sets of variables obtained at the same time point.In a slightly stronger form, it can also be used to describe a correlation between two sets of variables obtained at different time points.In its strongest form, "prediction" may describe a model generalization to out-of-sample individuals, which typically relies on machine learning.This third meaning is arguably the closest to the definition of a "prediction" in common language.Studies demonstrating an out-of-sample generalization have also the most practical relevance because they suggest that a model would be applicable to novel data that are not specific to a given sample.

PREDICTING EDUCATIONAL OUTCOMES FROM BRAIN ACTIVITY: METHODOLOGICAL CONSIDERATIONS
The present review exclusively focuses on the term "prediction" as describing generalization to out-ofsample individuals, and, therefore, only includes studies demonstrating such generalization.As a side note, not all neuroimaging studies using machine-learning techniques are relevant to the question of individual differences in academic performance, learning, or development.For instance, studies may use machine learning to test differences in spatial distributions of neural activity across tasks ( Nakai et al., 2023).These studies are not included in the present review either.
Broadly speaking, previous neuroimaging studies using machine learning to predict educational outcomes can be divided into two categories.The first category (Fig. 1, top row) encompasses studies using a crosssectional design, such that different participants are evaluated at one (T1) or several time points (T1 and T2).The second category (Fig. 1, bottom row) includes studies using a longitudinal design, such that the same participants are evaluated at different time points (T1 and T2).These time points can be separated by days, weeks, or even years.Note that cross-sectional and longitudinal studies may use supervised learning to predict either a continuous distribution of achievement (e.g., reading, math) scores from brain activity or discrete categorical labels such as presence or absence of learning disability.While the former relies on regression analyses (Fig. 1, left column), the latter involves classification analyses (Fig. 1, right column) ( Bishop, 2006).
Note that the three typical meanings of "prediction" in Gabrieli et al. (2015) can be categorized along the dimensions of "in-sample correlation vs. out-of-sample prediction" and "cross-sectional vs. longitudinal."That is, the first two meanings are similar in that they both focus on in-sample correlation but are different because one uses a cross-sectional design and the other a longitudinal design.The third meaning (out-of-sample prediction) can also be applied to both cross-sectional and longitudinal data (Fig. 1).In both cases, machine-learning models are trained with a subset of samples, and their generalizability is tested with left-out samples.
Regression and classification analyses use different analytic strategies.For instance, regression analysis as it is applied to a cross-sectional design (Fig. 1, upper left cell) relies on the generation of a predictive model based on the relationship between brain and behavioral data across participants from the training set at T1.The trained model is then used to predict behavior from brain data in left-out participants, also at T1. Regression analysis as it is applied to a longitudinal design (Fig. 1, bottom left cell) relies on the generation of a predictive model based on the relationship between brain data at T1 and behavioral data at T2 across participants from the training set.The trained model is then used to predict behavior at T2 from brain data at T1 in left-out participants.Classification analysis as it is applied to a cross-sectional design (Fig. 1, upper right cell) relies on an association between a discrete categorization of participants from the training set according to behavioral labels defined at T1 and their brain data at T1.This trained model is then used to assign labels to left-out participants based on their specific brain data, also at T1. Classification analysis as it is applied to a longitudinal design (Fig. 1, bottom right cell) relies on an association between a discrete categorization of participants from the training set according to behavioral labels defined at T2 (e.g., typically developing or learning disabled) and their brain data at T1.This trained model is then used to assign labels to left-out participants based on their specific brain data at T1.The specific methodologies underlying these analyses are discussed in a later section (see Studies Use a Range of Machine-Learning Methods).The present study does not include data from human or animal subjects and does not require approval from the ethics committee or informed consent.

CAN NEUROIMAGING STUDIES PREDICT LITERACY SKILLS?
A number of cross-sectional (Table 1) and longitudinal (Table 2) neuroimaging studies have attempted to use brain data to predict literacy skills (see Supplementary Information for the selection criteria of articles and the methodology used to generate the tables).For example, using regression in a cross-sectional design, He et al. (2013) showed that gray matter (GM) structural MRI (sMRI) data from adult participants could predict various language abilities (phonological decoding, form-sound Fig. 1.Schematic chart outlining the methodology used in neuroimaging studies reviewed here.Studies can be categorized into following a cross-sectional or a longitudinal design (rows), as well as a regression or a classification approach (columns).T1, time point 1; T2, time point 2. Note that although cross-sectional design can be applied to multiple time points, we only describe the case of T1 to avoid confusion with the longitudinal design.Furthermore, we also simplified the description of the longitudinal design by excluding cases of using differences of behavioral data (T2 -T1) as explained variables.Labels #1 and #2 indicate discrete categorization of individuals (e.g., typically developing or learning disabled).ANN, artificial neural network; DA, discriminant analysis; kNN, k-nearest neighbors; NBC, naïve Bayes classifiers; RF, random forest; SVM, support vector machine; SVR, support vector regression.( Kristanto et al., 2020), and task-fMRI (language, working memory, and motor tasks) ( Tomasi & Volkow, 2020).Together, these studies show that it is possible to predict individual differences in literacy skills with different sources of neuroimaging data, indicating that such skills are related to brain data over multiple dimensions.
Other studies have attempted to use neuroimaging data to classify between participants with and without dyslexia, a specific learning difficulty in word recognition, word decoding, and spelling abilities, with otherwise normal intelligence ( American Psychiatric Association, 2013).For example, Tamboer et al. (2016) classified adults with and without dyslexia using sMRI (GM) data.Cui et al. (2016) and Joshi et al. (2023) further showed that such classification was not limited to adults based on dMRI and sMRI data, respectively.Using sMRI (GM) data, but with a larger sample size including children from three different countries (130 children with dyslexia and 106 typically developing children), Płoński et al. (2017) replicated successful dyslexia classification.Finally, some studies have reported successful classification between children with and without dyslexia based on task-electroencephalography (EEG) with word comprehension ( Zainuddin et al., 2018) and auditory stimuli listening ( Formoso et al., 2021), and resting magnetoencephalography (MEG) signals ( Dimitriadis et al., 2018).Although many of the studies above rely on rest-fMRI or sMRI data, more recent studies have also used task-fMRI data.For example, Mascheretti et al. (2021)  Studies have also attempted to distinguish between different subtypes of language-related disorders and language proficiency levels.Bailey et al. (2016) were able to distinguish children with dyslexia from those with specific reading comprehension deficits (SRCDs) based on their sMRI (GM) data.SRCD differs from dyslexia in that affected children have difficulty in reading comprehension despite adequate phonemic decoding ( Landi & Ryherd, 2017).Cignetti et al. (2020) and Nemmi et al. (2023) classified between children with dyslexia and with developmental coordination disorder (DCD) using rest-fMRI and sMRI (GM and white matter [WM]) data.Zare et al. (2016) and Yu et al. (2022) classified whether children's families had a history of dyslexia using rest-EEG and rest-fMRI data, respectively.One study has also used functional near-infrared spectroscopy (fNIRS) study to classify between higher and lower second language proficiency groups ( Lei et al., 2020).Barranco-Gutiérrez (2020) classified between adults who are native English speakers and those who learned English as a second language.Zhang et al. (2023) classified second language (English) proficiency levels (high, moderate, low) of Chinese speakers and further predicted listening comprehension scores using fMRI with a story listening task.Mossbridge et al. (2013) found that good and poor readers were separable using EEG data during a sentence comprehension task.
In comparison with the number of studies that have used cross-sectional designs to predict literacy outcomes, a much smaller number of studies have used longitudinal designs to make out-of-sample predictions of literacy outcomes (Table 2).A pioneering study by Hoeft et al. (2007) combined both task-fMRI (rhyme judgment) and sMRI (GM and WM) data as inputs of multiple linear regression models.The authors found that brain data could predict later reading scores at the end of the same year.Bach et al. (2013) combined task-EEG and task-fMRI data (word comprehension) to predict reading scores measured 2 years later.In Feng et al. (2021), subjects underwent grammar training of an artificial language.Their final learning outcomes were predicted from task-fMRI data during training in earlier sessions.Beyer et al. (2022) used sMRI data (GM, surface area, and local gyrification) in preschoolers to predict literacy ability 2 years later.This study is particularly interesting because children were tested before they were exposed to formal education.This finding lends support to the argument that neuroimaging measures may be used as a way to improve the early detection of learning difficulty, in order to prevent difficulties later on ( Mascheretti et al., 2017).
Some longitudinal neuroimaging studies have also attempted to use neural data to classify between children with and without dyslexia.For example, Hoeft et al. (2011) showed that a machine-learning classifier can distinguish whether certain dyslexic children will improve their reading skills or not 2.5 years later using fractional anisotropy (FA) of dMRI and task-fMRI (rhyme judgment) data.Skeide et al. (2016) also reported successful classification of future dyslexia based on sMRI (GM) data in children before formal education.Finally, Yu et al. (2020) demonstrated classification of children with and without familial risk of dyslexia using task-fMRI data (phonological processing) before formal education.These reports suggest that prediction of language ability before formal education may be applicable to the early detection of risk of language deficits.In sum, both cross-sectional and longitudinal designs suggest that neuroimaging data may have the potential to predict literacy skills and classify language disorders.

CAN NEUROIMAGING STUDIES PREDICT NUMERACY SKILLS?
As is the case for studies on literacy, neuroimaging studies that attempt to predict numeracy skills can be categorized as either cross-sectional (Table 3) or longitudinal (Table 4).Cross-sectional studies include, for example, Ullman and Klingberg (2017), who estimated math scores of 6-to 7-year-old children through a prediction model of brain age using dMRI (FA).Pina et al. (2022) predicted four types of math scores (math fluency, calculation, applied problems, quantitative concepts) using 100 radiomics features derived from sMRI data.
Other cross-sectional studies have attempted to classify groups of participants with respect to their numeracy skills, for example, those with and without dyscalculia.Dyscalculia is defined as a specific learning difficulty in processing numerical information, learning arithmetic facts, and performing calculations, with otherwise normal intelligence ( American Psychiatric Association, 2013).For example, Rykhlevskaia et al. (2009), Jolles et al. (2016), and Dinkel et al. (2013) showed that children with and without dyscalculia could be classified using dMRI (number of pathways), rest-fMRI (FC), and task-fMRI data (dots comparison and calculation), respectively.Moreover, Mórocz et al. (2012) and Peters et al. (2018) showed that arithmetic task-fMRI data can be used to classify both dyscalculic and dyslexic children.Torres-Ramos et al. ( 2020) also showed that task-EEG data (digits comparison) could be used to classify children according to three different categorical levels of math achievement.
Several studies have focused on classifying other aspects of individual differences in numeracy skills.Shim et al. (2021) and Liu et al. (2022) reported classification of individuals based on their expertise in mathematics using rest-fMRI (FC) and sMRI data, respectively.Ventura-Campos et al. ( 2022) classified individuals who make errors in variable selection (reversal error) when writing equations to given word problems using algebra task-fMRI data.
In contrast to what has been done in studies focusing on literacy, a greater number of studies have used a longitudinal design to predict numeracy skills (Table 4).In a seminal study relying on multivariate regression, Supekar et al. (2013) showed that sMRI (GM) and rest-fMRI (FC) data could predict improvements in math performance of 8-year-old children after 8 weeks of tutoring program consisting of conceptual instruction and speeded arithmetic fact retrieval.Evans et al. (2015) further showed that prediction of longitudinal math outcome is possible even 6 years later using sMRI (GM) and rest-fMRI data.Chang et al. (2022) also reported similar prediction of change in performance after 4 weeks of training using rest-fMRI (FC) data.Schwartz et al. (2020) used fMRI data during a transitive reasoning task to predict math calculation skills 1.5 years later.Ullman et al. (2015) showed that math and working memory scores could be predicted at ages 5 and 7 from neonatal dMRI (FA), but not from sMRI data.Therefore, studies show that numeracy skills may be predicted from brain activity associated with domain-general processing, consistent with the role of these processes in math learning ( Raghubar et al., 2010).
We found only one longitudinal neuroimaging study that focused on the classification of dyscalculia as is depicted in Figure 1.Kuhl et al. (2021) classified future dyscalculia at ages of 7-9 years and typically-developing (TD) children based on dMRI and rest-fMRI data before formal education (at ages of 3-6 years).Overall, similar to language abilities, studies show that neuroimaging data may have the potential to predict numeracy skills and classify their disorders.
Note that some longitudinal studies do not neatly fall into the categories described in Figure 1.For example, Qin et al. (2014) used differences between addition task-fMRI data from two time points (T1 and T2, 1.2 years later) to predict improvements in the frequency of retrieval strategy for addition problem solving.Iuculano et al. (2015) showed that task-fMRI data (mental addition) can discriminate between children with and without dyscalculia before (but not after) 8 weeks of a tutoring program involving conceptual instruction and speeded arithmetic fact retrieval training.Michels et al. (2018) also reported similar results based on 5 weeks of mental number line training.These studies represent different ways to combine machine learning with neuroimaging data to explain differences in numeracy skills.

CAN NEUROIMAGING STUDIES PREDICT OTHER SKILLS RELEVANT TO ACADEMIC ACHIEVEMENT?
In our review of studies above, we exclusively focused on studies that have examined literacy and numeracy skills.However, studies have also tested whether neuroimaging may predict other skills that are relevant to academic achievement.This is notably the case for vocal communication.For example, Abrams et al. ( 2016) used task-fMRI data from 10-year-old children listening to their mother's voice to predict children's communication scores.This is also the case for affective traits related to academic achievement, particularly numeracy skills.Young et al. (2012), for example, classified children with high and low math anxiety groups using task-fMRI (addition and subtraction).Chen et al. (2018) predicted individual differences in positive attitudes toward mathematics using right hippocampal activity during an addition task.Supekar et al. (2015) showed that activity changes in task-fMRI during addition task can predict changes in children's math anxiety elicited by the same tutoring program.Finally, studies have attempted to use brain information to enhance the diagnosis of autism spectrum disorder (ASD) and attention-deficit/hyperactivity disorder ( Eslami et al., 2020;Nogay & Adeli, 2020), both of which can have impact on academic achievement ( Arnold et al., 2020;Whitby & Mancil, 2009).Iuculano et al. (2014) notably used task-fMRI data (mental addition) to classify between ASD and TD children, suggesting a potential relationship between the autistic trait and numeracy skills.While these developmental disorders are beyond the scope of this paper, they are important targets that cannot be ignored when considering the overall application of neuroimaging and machine learning to education.
In addition to predicting literacy and numeracy skills, studies have also used brain imaging data to predict academic achievement more generally.For example, Wang et al. (2019) predicted students' academic achievement at ages 17-20 years using sMRI data.Rasheed et al. (2021) predicted academic achievement (math and language test scores) of school children 4 years later using EEG data.Maglanoc et al. (2020) used a large sample of rest-fMRI data from the UK Biobank to predict educational attainment (based on the qualification variables, e.g., university degree).Studies have also investigated to what extent domain-general skills contributing to academic achievement may be predicted using neuroimaging, including working memory, attention, and intelligence.For example, Ullman et al. (2014) used sMRI and task-fMRI during a visuospatial working memory task to predict children's working memory capacity 2 years later.There are also a large number of studies on the prediction of intelligence quotient scores from brain data (see Vieira et al. (2022) for a recent systematic review).For example, Greene et al. (2018) used both rest-and task-fMRI data with working memory and emotion identification tasks and found that task-fMRI models outperformed rest-fMRI model in predicting fluid intelligence scores.Therefore, a number of studies provide evidence that neuroimaging may predict general cognitive functioning, though this may not be as relevant as the prediction of specific academic skills such as reading or math for the purpose of identifying children with specific learning difficulties.

ARE THERE ANY SPECIFIC BRAIN CIRCUITS SUPPORTING PREDICTION OF ACADEMIC OUTCOMES?
The studies reviewed here are important not only for practical reasons (i.e., to predict outcomes), but also for understanding the brain mechanisms supporting literacy and numeracy acquisition.Tables 1-4 report the main brain regions that have been identified in the specific studies.Some consistency can be seen across studies.For example, studies that have used MRI data to classify participants with and without dyslexia have often identified the left fusiform gyrus (FG) ( Skeide et al., 2016;Tamboer et al., 2016;Yu et al., 2022;Zahia et al., 2020), and the left superior temporal gyrus (STG) ( Joshi et al., 2023;Płoński et al., 2017;Usman et al., 2021;Zahia et al., 2020) as a potential neuromarker of the condition (see Supplementary Table S1 for a list of studies only focusing on dyslexia).Studies that have used MRI data to classify participants with and without dyscalculia have instead often identified the right intraparietal sulcus (IPS) ( Dinkel et al., 2013;Jolles et al., 2016;Kuhl et al., 2021) (see Supplementary Table S2 for a list of studies only focusing on dyscalculia).Although the number of studies remains too limited to quantify the consistency of these findings in a meta-analysis, these findings suggest that these specific brain circuits may be important for academic learning and be the target of future studies.
However, as can also be seen from the tables, the brain systems identified between studies are wide and span the frontal, temporal, parietal, and occipital cortices, as well as subcortical areas.To some extent, this variability is expected given the different domains (e.g., literacy vs., numeracy), brain measures (e.g., EEG, fMRI, sMRI), and tasks (e.g., addition vs. reasoning) explored between studies.Another factor contributing to such variance may be the use of different tests to estimate math and reading scores, and inconsistent definitions of conditions such as dyscalculia and dyslexia.For example, while some studies (e.g., Jolles et al., 2016) considered children with dyscalculia as having at or below the 25 th percentile using standardized math test scores, others (e.g., Dinkel et al., 2013) have used more stringent criteria and focused on children having at or below the 10 th percentile.In other words, variability in findings is expected given the wide variability in methods between studies.In what follows, we will argue that some critical differences in both machine-learning algorithms and cross-validation methods used between studies might also underlie some of this variability.

STUDIES USE A RANGE OF MACHINE-LEARNING METHODS
As shown in Figure 1, neuroimaging studies predicting academic outcomes can be classified as belonging to one of the four categories.However, studies largely differ with respect to the specific machine-learning algorithms they rely on to predict behavior, which is the first important source of variability in the literature.Many classification studies have used linear support vector machine (SVM) (Tables 1-4).Briefly, SVM is a supervised classification algorithm that constructs a set of hyperplanes separating given classes in a high dimensional space, so as to maximize the distance between the nearest data points of any class ( Cortes & Vapnik, 1995).The SVM, which is implemented in several decoding toolboxes as a default method (e.g., The Decoding Toolbox; Hebart et al., 2014), is useful for classifying among different groups, such as children with learning disability versus controls.However, studies have also used other techniques, such as logistic regressions ( Cui et al., 2016), decision trees ( Torres-Ramos et al., 2020), random forest (RF) ( Nemmi et al., 2023), naïve Bayes classifiers (NBCs) ( Formoso et al., 2021), discriminant analysis (DA) ( Bach et al., 2013), k-nearest neighbors (kNN) ( Ventura-Campos et al., 2022), and artificial neural networks (ANNs) (Tomaz Da Silva et al., 2021).
A number of different machine-learning methods have also been used in regression studies, though there is more homogeneity among these studies than among classification studies.For instance, some studies have used linear regression, while others have used support vector regression (SVR) ( He et al., 2013), relevance vector regression ( Yuan et al., 2023), kernel ridge regression ( Schwartz et al., 2020), and elastic net ( Beyer et al., 2022).Simple or multiple linear regression requires a reduction of input data into a limited number of variables, which has been achieved by focusing on predetermined regions of interest ( Hoeft et al., 2007;Supekar et al., 2013) or connectivity among them ( Chang et al., 2022).However, inclusion of too many parameters can cause models to overfit the training data that contain non-negligible amount of noise, resulting in reduced generalizability to test data ( Bishop, 2006).Elastic net and other regularized regression methods implement constraints on the model weight values to minimize overfitting to the training data and are appropriate for high-dimensional brain data.More recently, connectome-based predictive modeling (CPM) based on linear regression has been adopted for the analysis of brain-behavior association ( Shen et al., 2017).For example, researchers have used this technique to analyze the HCP dataset, which includes a large number of subjects ( Kristanto et al., 2020;Tomasi & Volkow, 2020).
Although the use of different algorithms is in itself not problematic, it may become so when no justification is given for using one method instead of another.This is unfortunately often the case in the literature.This methodological flexibility increases the researcher degrees of freedom and makes it difficult to parse out exploratory from confirmatory findings, especially given an absence of preregistration across studies ( Poldrack et al., 2017).There is also a need for more direct comparison between methodologies.For instance, Płoński et al. (2017) tested SVM, logistic regression, and RF for the same dataset, and reported that logistic regression showed the highest classification accuracy for dyslexia.Furthermore, Ventura-Campos et al. ( 2022) compared 13 different classification methods and reported that flexible discriminant analysis outperformed other methods.This type of systematic approach can ensure the robustness of results independent of the analysis method.However, this also requires researchers to systematically adopt the most robust methods, which might not always be the case.For example, a meta-analysis on machine-learning application for disease prediction reported that SVM is the most frequently used algorithm in the literature, while RF shows superior accuracy ( Uddin et al., 2019).By comparing six regression methods, Cui and Gong (2018) reported that least absolute shrinkage and selection operator (LASSO) regression were worse than the other algorithms when using FC of rest-fMRI data, while ordinary least-square regression was worse when using the sum of FC from each brain region, suggesting that performance of different algorithms also depends on preprocessing methods of the same brain data.To our knowledge, it remains unclear which method is more effective for predicting academic achievement.
Another source of variability in machine-learning methods is the cross-validation (CV) method employed (e.g., split-half, 10-folds, leave-one-out).CV is a widely known method in machine learning to iteratively split some data into training and test samples, testing the model generalizability while minimizing selection bias.In the case of k-fold CV, 1/k of the original data are selected as test samples in each iteration and this procedure covers all original data with k iterations.In contrast, leaveone-out CV (LOOCV) uses each individual data (e.g., subject) as a test sample and iterates across all data.Among the studies included in the current review, LOOCV was the most widely adopted (23 studies), while other studies used various types of k-fold CV methods (10-fold CV, 8 studies; 4-fold CV, 6 studies).Recent studies sug-gested that the repeated random splits method is more reliable than the leave-one-out method ( Valente et al., 2021;Varoquaux et al., 2017).In this method, CV based on different random sample splitting is repeated for multiple times and averaged (e.g., 100 times); 10 studies adopted this technique ( Beyer et al., 2022;Nemmi et al., 2023).Overall, there is wide variability in the machinelearning techniques used in neuroimaging studies, in terms of both algorithm selection and CV method.Both of these may have substantial influence on the model performance.This calls for a standardization in the field and future research would require careful consideration of their methodological choices.

LIMITATIONS AND FUTURE DIRECTIONS
As reviewed here, an increasing number of neuroimaging studies suggest that brain data can be used to predict individual differences in both literacy and numeracy skills, as well as other skills relevant for academic achievement.However, several limitations are apparent in the literature.
First, the majority of articles reviewed here have used sMRI or resting fMRI data (Tables 1-4).Although some studies have used task-fMRI data, their sample size was also generally smaller than sMRI and rest-fMRI studies.However, task-fMRI data can contribute to more accurate prediction of individual differences in academic achievement.For example, a recent study has reported superiority of movie-watching task-fMRI data in predicting various cognitive and emotional traits compared with rest-fMRI data ( Finn & Bandettini, 2021;Greene et al., 2018).Combining multiple task-fMRI data may further increase prediction performance ( Hammer et al., 2015).Moreover, task-fMRI can shed light on the heterogeneous profiles of children with dyscalculia or dyslexia, who might have specific difficulties in some cognitive skills (such as phonological or visual attentional deficits in the case of dyslexia) by targeting appropriate ROIs ( Jednoróg et al., 2014;van Ermingen-Marbach et al., 2013).
Second, the literature is largely dominated by MRI data and relatively few studies have used EEG, MEG, or fNIRS in predictive studies.For instance, to the best of our knowledge, Dimitriadis et al. (2018) was the only example of using MEG data to predict language disorders.Lei et al. (2020) was also the only example of using fNIRS data to predict second language proficiency.The wide usage of MRI data might be due to its advantage in spatial resolution compared with the other methods.Considering their portability, however, EEG, fNIRS, and optically pumped magnetometers (OPM)-MEG ( Boto et al., 2018;Brookes et al., 2022), as well as portable MRI ( Liu et al., 2021), are interesting because they are more accessible for experimentation in schools and clinical practices than conventional MRI ( Stangl et al., 2023).Given that early detection of potential learning disabilities is an important goal of several neuroimaging studies discussed here ( Hoeft et al., 2007), efforts should be made to evaluate the potential of task-related portable neuroimaging data for predicting outcomes in children.
Third, most previous studies recruited subjects who were already exposed to formal education.However, predicting outcomes from neuroimaging data may be most interesting before potential difficulties occur at the behavioral level.That is, brain data might help detect a risk for learning disabilities before children begin formal education, which may help ensure that children receive appropriate educational support at the earliest stage.To our knowledge, four studies in the literacy domain ( Beyer et al., 2022;Skeide et al., 2016;Yu et al., 2020;Zare et al., 2016) and two studies in the numeracy domain ( Kuhl et al., 2021;Ullman et al., 2015) tested children before the onset of formal education.Most of these studies used either sMRI or rest-fMRI, and only one study used task-fMRI data ( Yu et al., 2020).The relative lack of studies might reflect the inherent difficulty of pediatric MRI with young children.Again, this calls for the use of more child-friendly portable measurement techniques to inform about the prediction of future academic outcomes.
Fourth, there is still room for the integration of sophisticated machine-learning methods.Although linear regression and SVM are the two most widely used techniques in previous studies, some recent studies have adopted ANNs ( Joshi et al., 2023;Tomaz Da Silva et al., 2021;Zahia et al., 2020).ANN is a computational model inspired by biological neural networks (BNNs).It consists of multiple layers of neuronal units, where the weighted sum of units in one layer is used as input for the next layer after a nonlinear transformation.One advantage of using ANNs is that one can compare commonality between ANNs and BNNs in terms of their representations across different layers/regions ( Goldstein et al., 2022;Nakai & Nishimoto, 2023;Schrimpf et al., 2021).However, it remains unclear which ANN model is the more appropriate to explain developmental changes in brain representations and differences between those with and those without learning disabilities.Cross-validation techniques might also be improved.Although the large majority of studies use left-out sample predictions, this method is not the only method for brain-based classification or regression.Siegelman et al. (2021), for example, recently proposed a Bayesian latent-mixture model framework to classify between children with and without dyslexia.This framework does not need left-out samples because it constructs classification models by only using neuroimaging data without any categorical labels.In other words, it interprets the fit between the models' classification and categorical labels as an estimate of its explanatory power.On the other hand, Astle et al. (2019) used unsupervised self-organization map to classify children into four groups (typically developing, broad cognitive deficits in both language and mathematics, working memory problems, and phonological difficulties).These alternative approaches can shed light on the search for more effective methods for predicting academic achievement.
Fifth, a critical step for any neuroimaging studies using machine learning is feature selection.As is clear from our survey of the literature, many studies have relied on the selection of specific regions-of-interest (ROIs) as features to construct machine-learning models (see Tables 1-4).A well-known issue with ROI analyses in neuroimaging studies is that the way they are selected might bias the outcome of the analyses.For instance, selecting ROIs based on data that are nonindependent from the effect tested might lead to effect sizes that are inflated, an issue known as circular analyses ( Kriegeskorte et al., 2009).Several neuroimaging studies (i.e., 15 out of 30 ROIbased studies) reviewed here have selected ROIs based on the same dataset that was used for their machinelearning analyses.This may cause inflation of decoding accuracy and result in a lack of generalizability of decoding models, even if ROIs are selected using univariate analyses and subsequently tested with multivariate analyses.The use of nonindependent ROIs may further be inconsistent with the assumption of the left-out sample prediction because the test samples are already used for the feature selection during model training.Therefore, studies using nonindependent ROIs could be considered as confirmatory, much like those that use in-sample correlations between two datasets ( Dumontheil & Klingberg, 2012).Other feature selection methods may be used to circumvent this circularity issue.For example, some have interpreted contributing voxels based on nonzero decoding model weight values ( Cui et al., 2018;Hoeft et al., 2011) or based on the nested cross-validation ( Cui et al., 2016).Although caution is needed in interpreting weight values ( Haufe et al., 2014), both approaches can minimize bias of contributing brain regions.We believe that an interesting approach to avoid circularity issues in feature selection is searchlight decoding analysis ( He et al., 2013;Kuhl et al., 2021).This whole-brain analysis constructs decoding model using voxels included in spheres centered around each cortical voxel.This makes it possible to identify brain regions in which multivoxel patterns are sensitive to the difference between conditions or subject groups ( Kriegeskorte et al., 2006).
Sixth, because we attempted to provide a comprehensive review of the literature, several studies discussed here rely on relatively small sample sizes (see Tables 1-4).
It is now acknowledged that small sample sizes can lead to a significant lack of reliability in neuroimaging data ( Button et al., 2013).Therefore, conclusions from these studies must be considered with caution.Indeed, prediction accuracy can largely vary based on sample size.For example, Tamboer et al. (2016) classified dyslexia with 80.0% accuracy in a relatively small group of participants (N = 49), while they obtained 59.0% accuracy in a second group with a much larger sample size (N = 876).In the case of classification between learning disability (dyslexia or dyscalculia) and typically developing participants, no study with a large sample size (N > 100) achieved more than 80% accuracy (Supplementary Tables S1 and S2).Usman et al. (2021) did report 94.7% accuracy with N = 148, but this study classified MRI image patches and did not directly classify original brain data.Overall, this suggests that a machine-learning model with a classification accuracy of 80%, even if the accuracy is significantly higher than the chance level, would lead to misdiagnosis in one subject out of five.This is relatively low for realworld applications, which should aim for highly accurate predictions more than statistical significance.
Finally, as is the case generally in neuroimaging research, openly sharing data will be fundamental to improve models predicting academic outcomes from brain data.Building reliable predictive models requires a large amount of data ( Varoquaux, 2018).Eight studies constructed predictive models of literacy skills (Table 1) using such open datasets.In addition to the neuroimaging data published in the Adolescent Brain Cognitive Development (ABCD) study ( Casey et al., 2018) or in UK Biobank ( Littlejohns et al., 2020), researchers have published a series of open task-fMRI datasets of school children ( Lytle et al., 2019( Lytle et al., , 2020;;Suárez-Pellicioni et al., 2019;Wang et al., 2022).Such large neuroimaging datasets will be beneficial for future developments in predicting academic performance using machine learning.In addition, acceleration of open data and codes would enable comparison of prediction accuracy across different studies and may reduce inconsistencies between studies.

ARE WE GETTING CLOSER TO REAL-WORLD APPLICABILITY?
In their review, Gabrieli et al. (2015) highlighted a number of challenges that would have to be met by neuroimaging studies predicting skills to have some real-world applicability, either in the classroom or in a clinical context.These notably included the reliability and representativeness of the findings, the added value compared with behavioral indicators, the economic cost, as well as the ethical and societal issues these methods may raise.We revisit here these challenges 9 years after Gabrieli et al. (2015).
The section above already fleshes out the critical limitations and challenges in the body of literature.On the one hand, the relative lack of consistency in methodology, experimental designs, and findings shows that there is much room for improvement for studies aiming to translate their findings to the real world.On the other hand, the literature has significantly expanded over the past 10 years.Although initial studies largely focused on literacy skills, investigation of academic skills has now largely expanded to numeracy.In comparison with earlier ones, studies have also now started to focus on long-term outcomes, sometimes over the course of several years (e.g., see Kuhl et al. (2021) for long-term prediction of dyscalculia).This is critical if neuroimaging is to be thought about as a tool for enhancing the detection of future learning difficulties before they occur ( Raschle et al., 2012).Finally, recent technical advances in machine learning, as well as the availability of largescale neuroimaging data, might accelerate practical applications.For example, ANNs with a large number of layers were not available 25 years ago ( Liu et al., 2022).The development of machine-learning toolboxes such as scikit-learn ( Abraham et al., 2014) has also reduced the barriers to attempting prediction analyses using neuroimaging data.
For neuroimaging measures to be useful indicators for clinical practice or in the classroom, they would of course need to add some explanatory power to the prediction of future academic skills that can already be gathered from behavioral assessments alone.Some studies suggest that a combination of behavioral and brain-based measures may outperform either behavioral or neuroimaging measures alone when predicting academic skills ( Beyer et al., 2022;Hoeft et al., 2007), though most studies still lack a systematic comparison of prediction based on neuroimaging and behavior.
Most studies reviewed here have used MRI to predict academic achievement.Some common criticisms of MRI include its cost and accessibility, as well as the fact that pediatric MRI is relatively challenging.As also pointed out by Gabrieli et al. (2015), it would be important for any financial analysis to account for current practices, which may be costly and less effective as they are often targeted at children who are already failing school.Even though MRI may not be used in the population at large, some studies do suggest that early MRI measures may be useful for some targeted population, for example, for children of parents with learning disabilities.Indeed, a large body of evidence indicates that such children are at greater risk of developing the disability than their peers.Brain-based measures, together with behavioral assessments, may thus enhance the early detection of at-risk children ( Beyer et al., 2022;Kuhl et al., 2021).Another path for reducing the economic cost associated with collecting brain-based measures is a greater reliance on portable and wearable neuroimaging devices, such as wireless EEG or fNIRS.Critically, these methods have been increasingly used over the past 10 years, with several studies showing their applicability for collecting brain data in uncontrolled environments such as classrooms ( Davidesco et al., 2021).The field is now ripe for testing how these techniques may be combined with machine learning to predict academic outcomes and how they compare with MRI measures.
Finally, any use of neuroimaging measures to predict aspects of academic achievement would have to take into consideration ethical and societal issues.Though behavioral measures such as intelligence quotient (IQ) have long been used to predict academic achievement ( Chamorro-Premuzic & Furnham, 2008), studies have shown that brain-based measures may have a special status in the public eye and be easily misinterpreted ( Racine et al., 2005).For example, there is evidence suggesting that people often perceive scientific claims as more credible when they include references to the brain or neuroscientific information ( Weisberg et al., 2008), which suggests that people might give more weight to brain-based than behavioral indicators.Another critical aspect of the findings reviewed here is that they may raise ethical questions about whether they could be used to merely identify those with the highest likelihood of success instead of identifying individuals who are at risk and would need help.Although a discussion of these ethical and societal issues is beyond the scope of the present review, it is clear that they need to be considered by researchers, clinicians, educators, parents, students, and policy makers.

CONCLUSION
Nine years after the review of Gabrieli et al. (2015), studies using machine learning to predict educational achievement and learning disabilities from brain activity have grown exponentially, particularly in the domains of literacy and numeracy.However, we found in this updated review a considerable variation in algorithms and underlying brain circuits between studies.Studies also largely rely on relatively small samples and suboptimal models.We argue that the field needs a standardization of methods, as well as a greater use of accessible and portable neuroimaging methods that have more applicability potential than lab-based neuroimaging techniques.
classified dyslexic from nondyslexic children using a visual detection task, whereas Tomaz Da Silva et al. (2021) used a word-reading task.Finally, Zahia et al. (2020) used three different reading tasks to classify children with dyslexia, monocular vision (due to ocular motility disorders), and control groups.

Table 1 .
Cross-sectional prediction studies for literacy.

Table 2 .
Longitudinal prediction studies for literacy.

Table 3 .
Cross-sectional prediction studies for numeracy.

Table 4 .
Longitudinal prediction studies for numeracy.