Predicting the pattern and severity of chronic post-stroke language deficits from functionally-partitioned structural lesions

There is an ever-increasing wealth of knowledge arising from basic cognitive and clinical neuroscience on how speech and language capabilities are organised in the brain. It is, therefore, timely to use this accumulated knowledge and expertise to address critical research challenges, including the ability to predict the pattern and level of language deficits found in aphasic patients (a third of all stroke cases). Previous studies have mainly focused on discriminating between broad aphasia dichotomies from purely anatomically-defined lesion information. In the current study, we developed and assessed a novel approach in which core language areas were mapped using principal component analysis in combination with correlational lesion mapping and the resultant ‘functionally-partitioned’ lesion maps were used to predict a battery of 21 individual test scores as well as aphasia subtype for 70 patients with chronic post-stroke aphasia. Specifically, we used lesion information to predict behavioural scores in regression models (cross-validated using 5-folds). The winning model was identified through the adjusted R2 (model fit to data) and performance in predicting holdout folds (generalisation to new cases). We also used logistic regression to predict fluent/non-fluent status and aphasia subtype. Functionally-partitioned models generally outperformed other models at predicting individual tests, fluency status and aphasia subtype.


Introduction
Left hemisphere stroke often results in disrupted speech and language processes (aphasia). Under the single umbrella term of 'aphasia' there are considerable variations in patients' language and cognitive presentation, in both the pattern and severity of impairment to different language activities (e.g., comprehension, naming, reading, writing, speech, etc.). The consequence of this significant diversity is that individual patients will need very different types of intervention and clinical management (e.g., patients with primary comprehension or phonological deficits). By utilising fMRI in healthy participants (Price, 2010(Price, , 2012 and voxel-lesion symptom mapping (Bates et al., 2003) in aphasic patients, cognitive and clinical neuroscience has made considerable strides in mapping language performance and the underpinning cognitive mechanisms to different brain regions. Despite being a crucial step for clinical application, the reverse mappingusing neuroimaging results to predict individual aphasic profiles or typeshas only been attempted by a limited number of studies. The key aim of this investigation, therefore, was to embark on using new methods to generate lesion-based models which are able to predict both the detailed language profile of individual patients as well as their aphasia classification. For clarity, in this study we use prediction-based inference to determine how neural data can predict the current behavioural status using a k-fold cross validation approach. Future studies will be able to test whether similar models can offer accurate prediction in the temporal sense (using neural data to predict future behaviour). Indeed, the chronic stroke lesion is apparent long before patients' longterm language and cognitive abilities have stabilised (the partial, gradual recovery that most patients demonstrate extends to at least nine to twelve months post onset). Accordingly, accurate lesion-based prediction models would have considerable clinical utility, including improvements in the type of information that can be offered to patients and carers, enhanced clinical management planning, and appropriate patient stratification to treatment plans.
Studies using neural lesion information to predict behavioural outcomes have yielded inconsistent results. For example, earlier studies reported little advantage of using lesion information in improving predictions (Hand et al., 2006;Johnston et al., 2002;Johnston et al., 2007;Lazar et al., 2008;Willmes and Poeck, 1993). In contrast, more recent studies have found that models, designed to predict a single feature of aphasic performance or aphasia type, can be improved by including lesion information (Hope et al., 2013;Saur et al., 2010;Schiemanck et al., 2006;Thijs et al., 2000;Yourganov et al., 2015). For example, Hope et al. (2013) developed a predictive model using basic demographic information (age, gender, etc.) and structural lesion information obtained from a high-resolution T1-weighted image (lesion size and atlas-based lesions) to predict a composite speech production score (and its constituent individual speech test results), with the winning model containing time post-onset, lesion volume and 35 atlasbased predictors. They showed that the model could predict patients' composite speech production score over the first 200 months poststroke. In addition, the same group used anatomical regions to predict 22 subtests scores of the Comprehensive Aphasia Test (Swinburn et al., 2005) for mono-/bi-lingual patients (Hope et al., 2015). Another study used ridge regression in order to predict behavioural scores across seven domains (left/right motor, language, attention bias, verbal memory and spatial memory) in acute stroke cases (< 2 weeks) (Corbetta et al., 2015) -though, language was identified in a broad sense and thus the study did not allow for predictions of specific language deficits. Other groups have used support vector machines (SVM) trained on atlas-based lesion parcellations to predict six out of ten pairwise binary contrasts between aphasia subtypes at above chance levels (Yourganov et al., 2015). Saur et al. (2010) also used SVMs in order to predict patients' chronic outcome status (a binary classification; good/bad) as well as the type of improvement from the acute to chronic stage (good/bad). They found that age and a composite language recovery score (LRS) achieved above chance classification (62%). It is important to note that this particular study made use of fMRI measurements and showed that the fMRI data within targeted language areas improved prediction accuracy improved significantly (~86%), suggesting that functionally-as well as neuroanatomically-partitioned maps might be critical in improving predictive models. Furthermore, we also know that white matter connectivity (or disconnection) plays an important role in understanding behavioural deficits . A recent study has shown that damage to white matter pathways that converge into a bottleneck, for example in the posterior temporal lobe, are critical in predicting multiple behavioural deficits such as speech fluency, naming and auditory semantic decisions (Griffis et al., 2017).
The present study advances this handful of existing prediction models in two novel and important waysnamely, (a) how patients' lesions are partitioned (before they are used as predictors) and (b) in the nature and detail of what is being predicted. Our approach to both research aims was informed by a new, emerging conceptualisation of the aphasia phenotype and underlying brain systems. There is a longstanding tradition in aphasiology to categorise patients into different aphasia types according to clusters of behavioural deficits (e.g., Broca, Wernicke, conduction, etc.). These classifications provide an approximate descriptive shorthand for communicating and comparing cases across clinics/research institutions, and influencing treatment options (Horn et al., 2005). There is increasing agreement, however, that aphasia classifications have strong limitations because (a) there is considerable variability amongst patients within each category and (b) there are fuzzy boundaries between categories. Indeed, it is often difficult to place patients within a single category, leading to the diagnosis of "mixed aphasia". An alternative approach moves away from categorisation and clustering towards considering each patient as a point in a multidimensional space, where each dimension corresponds to a primary computational-brain system (Butler et al., 2014;Chase, 2014;Halai et al., 2017). In this conceptualisation, each patient's pattern of aphasia reflects a different weighting of the impairments to these primary systems. Likewise, each language activity (e.g., naming, comprehending, repeating, etc.) is not localised to a single brain region but rather reflects the joint action of the underpinning primary systems (Patterson and Lambon Ralph, 1999;Seidenberg and McClelland, 1989;Ueno and Lambon Ralph, 2013;Ueno et al., 2011). A simple analogy is that of the arrangement of different colour hues (cf. patients) across the red, green and blue (RGB) colour space. Whilst it is possible to demarcate and label (cf. categorise) approximate areas in the space as yellow (e.g., Broca), blue (Wernicke), etc., there are in fact many different kinds of each colour and the boundaries between them are fuzzy. Likewise, when presented with individual hues it is not always obvious which colour category they fall into (e.g., teal, maroon, indigo; cf. how to categorise a patient with mixed aphasia). Thus, like aphasia classifications, colour labels provide approximate albeit limited information about the underlying graded differences. This is sufficient to communicate broad distinctions between cases (e.g., blue vs. yellow; Broca vs. Wernicke) but not finer variations (the overlapping variations of orange vs. yellow; conduction vs. Wernicke). An alternative and more precise approach is to represent each hue (patient) in terms of its position along the RGB dimensions (cf. patients' performance in terms of the underlying primary language-cognitive systems).
With sufficient breadth of assessments (to sample the full spectrum of language activities) and patient numbers, it is possible to use statistical approaches such as principal component analysis (PCA) to uncover the underlying dimensions (Lambon Ralph et al., 2002;Lambon Ralph et al., 2003). Recent applications of this approach have not only recovered the same set of orthogonal dimensions (phonology, semantics, executive skills, speech quanta) but have found that each one is associated with damage to discrete brain regions (Butler et al., 2014;Halai et al., 2017). Importantly, for the present study, very similar or identical behavioural dimensions and lesion correlates have been observed across independent studies both in patients with chronic (Lacey et al., 2017;Mirman et al., 2015a;Mirman et al., 2015b) and acute aphasia (Kümmerer et al., 2013), indicating the robustness of these core underlying factors.
The ramifications of this aphasia conceptualisation on generating prediction models are as follows. In terms of prediction targets, the ultimate aim is to predict the full behavioural profile of each patient from the neuroimaging data. Thus, rather than focussing on individual language activities, in this study we predicted each patient's scores across the full range of assessments. Given the strong tradition of using aphasia classifications, we also generated a predictive classification model but rather than focussing on pairwise discrimination between pairs of aphasia types, we required the model to discriminate simultaneously between all major types (thus providing a full albeit coarsecoding of the aphasic multidimensional space). Secondly, in terms of deriving the best predictors for inclusion in these models, we utilised the finding that the core underlying 'primary' dimensions (phonology, semantics, etc.) have been associated with discrete lesion correlates. As such, one might expect the status of each of these key regions to be a strong predictor of the patients' performance across the full range of tests. Accordingly, each patient's lesion was functionally-partitioned according to the overlap with these primary language regions and the resultant four component model was used to predict each patient's individual test scores as well as aphasia classification.

Participants
Seventy post-stroke patients (53 males, mean age ± standard deviation [SD] = 65.21 ± 11.70 years) were recruited in the chronic stage (minimum 12 months post onset; mean = 56.6, SD = 50.17 months). A subset of cases (31/70) was the same as reported in two previous studies (Butler et al., 2014;Halai et al., 2017). The mean years in education was 12.11 (SD = 2.20). All cases were diagnosed with aphasia (using the Boston Diagnostic Aphasia Examination, BDAE), having difficulty with producing and/or understanding speech. No restrictions were placed according to aphasia type or severity (spanning from global to minimal aphasia). All subjects were right handed (premorbidly) using the Edinburgh Handedness Inventory (Oldfield, 1971), native English speakers and had only one known stroke to the left hemisphere. We excluded cases with damage to the right hemisphere or those that had multiple strokes. Data from 22 healthy age and education-matched controls (10 female, 12 male) were used to determine abnormal regions of the T1 weighted brain scan (see Neuroimaging analyses section for details). All participants gave written informed consent with ethical approval from the local ethics committee.

Neuropsychological assessment (dependent variables)
All participants underwent a large neuropsychological battery of tests to assess a range of language and cognitive abilities (Butler et al., 2014;Halai et al., 2017). These included subtests from the Psycholinguistic Assessments of Language Processing in Aphasia (PALPA) battery (Kay et al., 1992): auditory discrimination using non-word (PALPA 1) and word minimal pairs (PALPA 2); and immediate and delayed repetition of non-words (PALPA 8) and words (PALPA 9). Tests from the 64-item Cambridge Semantic Battery (Bozeat et al., 2000) were included: spoken and written versions of the word-to-picture matching task; Camel and Cactus Test (pictures); and the picture naming test. To increase the sensitivity to mild naming and semantic deficits we used The Boston Naming Test (BNT) (Kaplan et al., 1983) and a written 96-trial synonym judgement test (Jefferies et al., 2009). The spoken sentence comprehension task from the Comprehensive Aphasia Test (CAT) (Swinburn et al., 2005) was used to assess sentential receptive skills. Speech production deficits were assessed by coding responses to the 'Cookie theft' picture in the BDAE, which included tokens (TOK), mean length of utterance (MLU), type/token ratio (TTR) and words-per-minute (WPM) (see Halai et al., 2017 for more details). The additional cognitive tests included forward and backward digit span (Wechsler, 1987), the Brixton Spatial Rule Anticipation Task (Burgess and Shallice, 1997), and Raven's Coloured Progressive Matrices (Raven, 1962). Assessments were conducted with participants over several testing sessions, with the pace and number determined by the participant. The scores reflect the performance of each individual for each test and all scores were converted into percentage; if no maximum score was available for the test we used the raw score.
The neuropsychological measures were entered into a PCA with varimax rotation (SPSS 22.0). Factors with an eigenvalue exceeding 1.0 were extracted and then rotated orthogonally. Following varimax rotation, the loadings of each test allowed a clear behavioural interpretation of each factor. Individual participants' scores were obtained using the regression method for each extracted factor and used as target values in the prediction models.

Neuroimaging analyses
Structural MRI scans were pre-processed with Statistical Parametric Mapping software (SPM8: Wellcome Trust Centre for Neuroimaging, http://www.fil.ion.ucl.ac.uk/spm/). The images were normalised into standard Montreal Neurological Institute (MNI) space using a modified unified segmentation-normalisation procedure optimised for focal lesioned brains (Seghier et al., 2008b). Data from all participants with stroke aphasia and all healthy controls were entered into the segmentation-normalisation. This procedure combines segmentation, bias correction and spatial normalisation through the inversion of a single unified model (see Ashburner and Friston, 2005 for more details). In brief, the unified model combines tissue class (with an additional tissue class for abnormal voxels), intensity bias and non-linear warping into the same probabilistic models that are assumed to generate subjectspecific images. The lesion of each patient was automatically identified using an outlier detection algorithm, compared to a group of healthy controls, based on fuzzy clustering. The default parameters were used apart from the lesion definition 'U-threshold', which was set to 0.5 to create a binary lesion image. We modified the U-threshold from 0.3 to 0.5 after comparing the results obtained from a sample of patients to what would be nominated as lesioned tissue by an expert neurologist. The images generated for each patient were individually checked and visually inspected with respect to the original scan, and were used to create the lesion overlap map in Fig. 1 (2 mm 3 MNI voxel size). The binary lesion mask was then used to determine the predictor variables (see below). We selected the Seghier et al. (2008b) method as it is objective and efficient for a large sample of patients (Wilke et al., 2011), in comparison to a labour intensive hand-traced lesion mask. The method has been shown to have a DICE overlap > 0.64 with manual segmentation of the lesion and > 0.7 with a simulated 'real' lesion (where real lesions are superimposed onto healthy brains; Seghier et al., 2008b). All images were visually inspected and manually edited if required. We should note here, explicitly, that although commonly referred to as an automated 'lesion' segmentation method, the technique detects areas of unexpected tissue classand, thus, identifies missing grey and white matter but also areas of augmented cerebrospinal fluid (CSF) space.

Predictor variables
We used the automated binary lesion mask for each subject and calculated a percentage overlap with the whole left hemisphere, giving a value of overall left hemisphere damage (referred to as the lesion volume [LV] model). We used the same binary lesion to obtain percentage lesion overlap for three critical discrete principal languagerelated clusters identified using a PCA voxel based correlational methodology; phonology, semantics and speech quanta (clusters taken from Halai et al., 2017), which provided three additional predictor variables (percentage of damage to each cluster based on the individuals' lesion profile) and were used in conjunction with the residual left hemisphere LV variable (referred to as the LV-PCA model). In addition to the neural predictors, we used basic demographic information as predictors including: age at testing (age), years in education (edu) and months post stroke onset (onset).

Prediction analysis
For the following section we used prediction-based inference to determine how well each model performed. To clarify, we used neural information from chronic T1-wieghted images to predict the patients' current chronic behavioural profile and we are not performing any predictions in the temporal sense. In order to make sure our models and predictions remained unbiased we implemented a 5-fold cross-validation procedure throughout the following analyses. Our pipeline was as follows: 1) randomly split the data into 5 folds (N = 14 per fold), 2) perform a PCA on the behavioural data on training set (4 folds), 3) determine neural correlates with PCA factor scores using voxel based correlational methodology (VBCM), 4) obtain predictor variables (as outlined above) for each cluster related to phonology, semantics and speech quanta, 5) create regression models for each neuropsychological test, and 6) use the regression to predict the holdout fold and determine the correlation between predicted data and known targets. The analyses presented in the paper are split into three parts. First, we outline the model fit (using a different combination of predictor variables) to a range of behavioural scores (measured by adjusted R 2 ). The second analysis focused on the prediction accuracy to 'new' cases that were held out during the model building stage (measured by correlation). We identified the winning model by comparing the average model fit and ability to predict holdout cases across models. These evaluation measures were computed for two types of model: the first type consisted of the non-partitioned left-hemisphere lesion (LV) plus the demographic variables; and the second type of model consisted of the four-part functionally-partitioned lesion (LV-PCA predictors) plus the same demographic variables. To evaluate the statistical significance of each model (against chance) we undertook a Monte Carlo analysis. Specifically, to obtain a null distribution for each model of each behavioural test, the dependent variable (test score for each model) was randomised (N = 10,000) and the same model-fit analysis as above was performed, providing a means to test the real data against chance levels (p < 0.05). In addition, we also compared the real models directly using a Wilcoxon test (p < 0.05). In order to determine specific cases that were poorly predicted, we calculated the mean square root residual sum of squares across all test scores (sum(observed-predicted) 2 ) 1/2 . Any cases that were two SD away from the mean group were considered poor predictions.
Finally, a logistic regression analysis was used to determine if the winning version of the non-partitioned LV or functionally-partitioned LV-PCA models could predict aphasia type for all subjects: 1) at a coarse level, by splitting the patients into fluent/non-fluent groups based on BDAE criteria; and 2) the specific BDAE aphasia classification. In order to obtain a null distribution for each model, the dependent variable (fluent/non-fluent or subtype code) was randomised (N = 10,000) and for each iteration a logistic regression analysis was performed. The corresponding percentage correct was recorded for each model. The threshold was set at p < 0.05 to reject the null hypothesis. Table 1 provides demographic details on the cases included in the study (a subset of 31 cases were reported in Butler et al., 2014;Halai et al., 2017) and overall lesion volume (note that the cases are ordered according to their Boston naming test scores). Table 2 provides a summary of the participants' scores on all neuropsychological tests (dependent variables) and is ordered according to patients' scores on the Boston naming test (note that values are displayed as integers but all decimal values used in the analyses). A lesion overlap map for all cases is provided in Fig. 1, and primarily covers the left hemisphere area supplied by the middle cerebral artery (Phan et al., 2005). The maximum number of participants who had a lesion in any one voxel was 56 (−38, −9, 24 central opercular cortex and −20 −12 26 left cortico-spinal tract). Fig. 2 shows the cluster overlap figure for each behavioural component across the 5-folds for cross-validation, which were used on each fold to determine the percentage of lesion overlap per functionally-partitioned region.

Neuropsychological and lesion profiles
In order to evaluate the utility of the models the results are split into three sections. First, we summarise the adjusted R 2 values for each model across behavioural tests. Secondly, we determine which model had the best predictive power by comparing the correlation for holdout folds for each model across behavioural tests. Finally, we show how well the winning models can classify patients based on fluent/nonfluent membership and specific BDAE classification using logistic regression.

Best fit to data (adjusted R 2 )
For a graphical depiction of the adjusted R 2 values for all tests, see Fig. 3. Table 3 shows the mean adjusted R 2 values and highlights significant differences between all model pairings. For simplicity, the Figure only shows the adjusted R 2 values for LV-all and LV-PCA-all as these models were approximately the two best models. Overall, the fit to data for both LV and LV-PCA models (and the variant models with demographic information) was significantly better than chance (p < 0.05) for almost every behavioural test (see Supplementary Table  A1 for scores on each behavioural test, values marked in bold indicate non-significant models). Importantly, the average adjusted R 2 across all tests was significantly higher for the functionally-partitioned LV-PCA model compared to the LV model (0.27 and 0.15, respectively) (Wilcoxon Test: Z = 4.015, p < 0.001). There were four language assessments that there were problematic for some models: A) The LV-age and LV-edu models did not significantly fit to immediate word repetition and Boston naming test scores; B) The LV and LV-ons models did not significantly fit to the Brixton score; and C) The LV, LV-age, LV-edu, LVons, LV-all models did not significantly fit to type/token ratio (although it should be noted that this assessment had the lowest adjusted R 2 values overall for all models). Considering the non-partitioned LV models, the best-fitting (adjusted R 2 ) models when adding demographic variables were LV-ons (0.17) and LV-all (0.22). The LV-all model was significantly better than all other LV models (p's < 0.016) except LV-ons (Wilcoxon Test: Z = 1.616, p = 0.106). The LV-ons was only significantly better than LV (Wilcoxon Test: Z = 2.311, p = 0.021). Considering the functionally-partitioned lesion (LV-PCA) models, the bestfitting (adjusted R 2 ) models when adding demographic variables were LV-PCA-all (0.323) and LV-age (0.311). The LV-PCA-all model was   Table 2 Participant scores on behavioural assessment battery and speech production measures (dependent variables). All scores have been converted into percentage and rounded to an integer, except for TTR (continued on next page) A.D. Halai et al. NeuroImage: Clinical 19 (2018) 1-13 significantly better than LV-PCA, LV-PCA-edu and LV-PCA-ons (p's < 0.03) but not different to LV-PCA-age (Wilcoxon Test: Z = 0.539, p = 0.590). The LV-PCA-age model was not different to LV-PCA-edu but was trending towards significance against LV-PCA and LV-PCA-ons (Wilcoxon Test: Z = 1.894, p = 0.058). In order to compare the relative power of these models further, we summed the number of assessment tests for which each model had the best fit for the four models identified above. The functionally-partitioned models were the best for the vast majority of individual tests: LVons (0/21), LV-all (0/21), LV-PCA-age (13/21) and LV-PCA-all (8/21).

Predictive power (predicted R 2 )
The non-partitioned lesion-only (LV) model had significantly greater correlation values between predicted and observed than chance levels for all bar five measures (PALPA 8 immediate, PALPA 9 immediate, PALPA 9 delayed, TTR, and the Brixton spatial anticipation test). The functionally-partitioned LV-PCA only model had significantly greater correlation values between predicted and observed scores than chance levels for all bar two measures (PALPA 8 immediate and PALPA 9 immediate). It should be noted that most models failed to accuracy predict PALPA 8 immediate, PALPA 9 immediate and TTR (see Supplementary Table A2 for details on all models, values marked in bold reflect non-significant models). Table 3 shows the mean correlation between the predicted and observed values across holdout folds for all models. Overall, the correlations were significantly higher for the LV-PCA model than the LV model (0.41 vs. 0.32, respectively; Wilcoxon Test: Z = 3.146, p = 0.002). We report the correlation values for: LVage, LV-all, LV-PCA-age and LV-PCA-all as these produced the best models. There were a small number of behavioural tests that were not predicted above chance-level based on these four models: the LV-age model failed to generate better than chance predictions for PALPA 9 (delayed) and TTR measures; the LV-all model failed for PALPA 9 (immediate and delayed), PALPA 8 (immediate), forward digit span, and TTR; the LV-PCA-all model failed on PALPA 9 (immediate) and TTR; whilst the LV-PCA-age model did not fail on any measure.
For the non-partitioned lesion (LV) models, the highest mean correlation belonged to LV-age but this was not significantly greater than LV-all (0.385 vs. 0.369, respectively; Wilcoxon Test: Z = 0.782, p = 0.434). For the functionally-partitioned lesion (LV-PCA) models, LV-PCA-age did produce significantly higher values than LV-PCA-all (0.484 vs. 0.478; Wilcoxon Test: Z = 1.999, p < 0.046). Directly comparing the LV-age and LV-PCA-age models, showed the latter model had significantly higher correlations (Wilcoxon Test: Z = 3.736, p < 0.001). Therefore in both model groups, adding age to the neural information produced significantly higher correlation values than neural information alone, as well as all demographic variables combined. Fig. 4 shows the mean correlation values for all tests based on the LV-age and LV-PCA-age models.
We identified two cases for which the predictive model performed poorly (2/70 or 2.86%) and had residual scores more than two SD from the mean group (Case 37 and 40). On closer inspection the model was poor at repetition and picture naming performance for case 37 and speech tokens for case 40. The repetition and picture naming performance for case 37 was floor for all tests, however the lesion profile of the patient suggests that frontal regions related to speech output were disproportionally affected compared to regions related to phonological processes (proportion of neural damage: LV 9.1%, phonology cluster 7.7%, semantic cluster 7% and speech quanta cluster 50.6%). The discrepancy for case 40 was solely based on underestimating the number of tokens produced (predicted 72 vs. 315 observed).
Finally, in order to determine the distribution and significance of    Table 3 Comparing Best model fit (adjusted R 2 ) and predictive power (predictive R 2 ) across all models using a Wilcoxon repeated measures test. Mean values for adjusted and predicted R 2 are given in the second column, whilst significant differences between each pair-wise comparison (p < 0.05) is shown in bold.  Halai et al. NeuroImage: Clinical 19 (2018) 1-13 the betas in the winning model we determined the model fit including all participants across all test for the LV-PCA-age model (see Supplementary Fig. A1).

Aphasia classification
Overall, we deemed the LV-age and LV-PCA-age models to be the winning models in each group of models based on the model fit and predictive capabilities. The following classification analyses were split into two stages: fluent/non-fluent and specific subtypes. The results for the binary classification of fluent vs. non-fluent aphasia were significant for LV-age (80% accuracy, p = 0.043) and LV-PCA-age (88.6% accuracy, p = 0.003) when compared to distribution obtain using permutation tests (the mean chance level was 49.75% and 49.08%, respectively). The difference between the two models was at a trend, in favour of the LV-PCA-age model (Wilcoxon Test: Z = 1.90, p = 0.06). The LV-PCA-age model classified 87.9% (29/33) fluent cases and 89.2% (33/ 37) non-fluent cases correctly, compared to the LV-age model which achieved 84.8% (28/33) and 75.7% (28/37), respectively. The coefficients in the logistic regression for the LV-PCA-age model were as follows: LV (0.048), phonology (−0.044), semantic (−0.013), fluency (−0.096) and age (−0.036), where only fluency was significant (p = 0.003 and remaining betas had p > 0.092).
Secondly, we used a multinomial logistic regression to determine how well each model could classify patients simultaneously into the seven BDAE subtypes. The models were both significantly better than chance determined by permutation testing: LV-age (54.3% accuracy) and LV-PCA-age (68.6%) where the mean chance levels were 36.2% and 32.9%, respectively. The difference between the two models was significant, in favour of the LV-PCA-age model (Wilcoxon Test: Z = 2.36, p = 0.018). The LV-PCA-age model correctly classified 77.8% anomia, 58.3% Broca, 33.33% conduction, 42.9% global, 86.7% mixed non-fluent, 50% TMA and 100% TSA. In comparison the LV-age model correctly classified 85.2% anomia, 8.3% Broca, 16.7% conduction and 86.7% mixed non-fluent cases (remaining aphasia types, all 0%). Overall, the LV-PCA-age model outperformed the LV-age model in predicting fluent/non-fluent status and BDAE classifications.

Discussion
Our understanding of how our speech and language capabilities are organised in the brain has vastly improved over the past decade (reflected not only in a large number of published papers but also in vibrant dedicated, international learned societies such as the Society for the Neurobiology of Language: http://www.neurolang.org/). It is, therefore, both timely and critical to use this accumulated knowledge and expertise to address critical research challenges, including the ability to predict the behavioural deficits experienced after brain damage not only to validate the theoretical models but also to provide improved care and clinical management. The approach taken in this study to tackle these aims was based on a new, emerging conceptualisation of the aphasia phenotype and underlying brain systems (Butler et al., 2014;Chase, 2014;Halai et al., 2017). Specifically, rather than relating each language activity (e.g., repetition, naming, comprehension, etc.) singly to the underlying neural systems, the varying aphasia phenotype (both severity and type) is hypothesised to reflect graded differences in the level of damage to a set of primary neurocognitive systems (Patterson and Lambon Ralph, 1999;Seidenberg and McClelland, 1989;Ueno and Lambon Ralph, 2013;Ueno et al., 2011) and their interaction. By utilising a combination of principal component analysis on a large and detailed behavioural dataset and voxel-symptom lesion mapping, previous studies have shown that (a) the variable aphasia phenotype can be considered in terms of graded differences along a set of statistically-independent (orthogonal) dimensions (e.g., semantics, phonology, speech quanta, executive skill (Butler et al., 2014;Halai et al., 2017;Kümmerer et al., 2013;Lacey et al., 2017;Mirman et al., 2015a;Mirman et al., 2015b)) and (b) that each of these factors is associated with increased lesion likelihood in discrete brain regions (e.g., semanticsanterior temporal lobe; phonologyposterior superior temporal gyrus/inferior supramarginal gyrus, etc.).
Based on this new, evolving conceptualisation of aphasia, the current study tested a prediction model that had two important novel features: (a) as well as predicting aphasia type or individual features of aphasic performance (as done in the small handful of previous studies (Hope et al., 2015;Hope et al., 2013;Saur et al., 2010;Yourganov et al., 2015), we targeted the full spectrum of aphasia by predicting aphasia type as well as the full range of behavioural test scores; (b) the patients' lesions were functionally-partitioned before using them as predictors. Specifically, we demonstrated that the functionally-partitioned (LV-PCA) model outperformed a more general model that only incorporated the overall lesion volume (LV model), both in terms of percentage variance explained of the training data and correlation of predictions on left out cases. By adding age to the functionally-partitioned model, we further improved the explanatory and predictive power of the model for the majority of behavioural tests. Furthermore, this combined (LV-PCA plus age) model was better at classifying participants into aphasia subtypes compared to the unpartitioned lesion plus age (LV-age) model. Indeed, the LV-PCA-age model was able to classify a broad range of patient subtypes. The success of the LV-PCA-age model suggests that: 1) the a priori functional partitions (identified in previous studies using a combination of PCA-decomposition of detailed behavioural data and voxel-lesion symptom mapping) are suitable to capture variance across a wide range of aphasia patients; and 2) age proved to be the best demographic variable across the range of tests.
Previous studies that have investigated the utility neural information for predict behavioural outcomes/deficits have had mixed results. Earlier reports suggested that neural information does not add significantly to predictions (Hand et al., 2006;Johnston et al., 2002;Johnston et al., 2007;Lazar et al., 2008;Willmes and Poeck, 1993). Our results align with the more contemporary studies that have found significant improvements in predictions when using neural lesion information (Hope et al., 2013;Saur et al., 2010;Schiemanck et al., 2006;Thijs et al., 2000;Yourganov et al., 2015). Currently, the existing literature has either focused on differentiating between pairs of aphasia subtypes (i.e. Yourganov et al., 2015) or has targeted individual, important tests scores (Hope et al., 2013). There has been only one study that has predicted the subtests within the CAT using anatomical regions defined in pre-existing atlases (Hope et al., 2015). Our investigation suggests that the full range of aphasia subtypes can be predicted (to provide a 'coarse' picture of the nature of each patient's type and severity of aphasia; see Introduction section for the limitations of these measures) and can also predict a broad spectrum of individual assessment scores (to provide a detailed picture of each patient's phenotype).
A critical characteristic of the prediction model was its use of functionally-derived partitioning of the patients' lesions rather than the use of anatomical parcellations (e.g. Hope et al., 2013;Yourganov et al., 2015). It is important to note that the lesion correlations across 5-folds produced strikingly stable results, suggesting that the core areas identified are highly reproducible. This alternative approach resonates with a previous study by Saur et al. (2010) which found that including the level of signal from functionally-focused regions-of-interest in patients' acute fMRI scans, enhanced binary predictions of language outcome and improvement (rather than aphasia types or assessment profiles). Although we did not use fMRI data in the current study (instead deriving the functional partitioning from a combined PCA-lesion mapping approach (Halai et al., 2017)), the fact that both studies found considerable improvements in prediction power suggests that functionallyrelated information may be a critical ingredient for successful prediction models.
In clinical terms, the prediction accuracy of aphasia classification achieved here was very good and thus high enough to begin to contemplate how this model might be used in clinical management, including predicting language-cognitive abilities in the chronic stage from scans collected in the acute or sub-acute phase, to guide intervention plans and to stratify patientsin short, a form of 'neurocognitive' precision medicine. We stress that the current models were built and tested on chronic neural and behavioural data and were not tested in the temporal sense (acute to chronic)which can be explored in future studies (although this type of prediction has additional barriers; see Karnath and Rennig, 2017). In contrast, whilst predictions of the specific scores across the full test battery are statistically reliable and better than lesion-only models, further improvements of the models are required before they could be used clinically.
Future studies can explore how to improve the predictive power of these fine-grained prediction models. It seems likely that these explorations will fall into three classes: (i) more data: like our own investigation, most studies to date have used structural (T1/T2) neuroimaging to predict performancewhich is important given that routine clinical scanning often only includes this type of scan. The inclusion of other imaging modalities, such as fMRI (Saur et al., 2010), white-matter connectivity (e.g. Corbetta et al., 2015;Griffis et al., 2017;Saur et al., 2008) and functional-connectivity, might improve prediction models, especially when used together (which will require sophisticated methods for combined analyses of multimodal imaging data (Calhoun and Sui, 2016)). Furthermore, and perhaps critically, these additional imaging measures might provide critical insights about post-stroke functional reorganisation which is unlikely to be reflected in the core lesion itself but rather from changes in functional activation and connectivity. (ii) More predictors: as well as developing the precision of each existing predictor, it is likely that the models will be improved by increasing the range (in terms of additional areas) and type (in terms of modality) of neuroimaging predictors. These will include other aspects of language processing (e.g., perhaps the differentiation of receptive and expressive phonological abilities (Schwartz et al., 2012;Schwartz et al., 2009)), as well as non-language functions (e.g., executive abilities). The inclusion of non-language primary systems such as executive skill may be important given that it has been shown to vary across the aphasia population, is engaged by patients with better recovery, predicts response to therapy and forms a critical part of the aphasia phenotype (Brownsett et al., 2014;Butler et al., 2014;Fillingham et al., 2005;Halai et al., 2017). (iii) Individual differences: whilst some previous studies have predicted dichotomies (e.g., contrastive aphasias or good/bad outcomes), some like the present study have attempted to predict the individual differences in performance (e.g. Hope et al., 2015;Hope et al., 2013). There are, however, other sources of individual differences which may be important. First, there are premorbid differences in how individuals achieve each language activity, in terms of the reliance on different parts and connections within the neurocognitive network that underpins the behavioural activity. Indeed, the multiple parts and connections in these networks may promote (individually-varying) robustness to task complexity and damage (a notion capture by the mathematical term, degeneracy (Price and Friston, 2002)). Recent explorations using fMRI in healthy participants suggest that it may be possible to map and understand the limits of these premorbid differences and, with transcranial magnetic stimulation, to understand the likely impact following brain damage (Hoffman et al., 2015;Kherif et al., 2008;Seghier et al., 2008a;Woollams et al., 2016). Secondly, there are clearly individual differences in the level and type of recovery after brain damage. Prediction models will be improved, therefore, not only by identifying global factors that modulate the overall level of recovery (which might, perhaps, include age and domain-general mechanisms such as multi-demand executive skills, see above) but also the limits on how far key neurocognitive networks can re-distribute function after partial damage (Keidel et al., 2010;Ueno and Lambon Ralph, 2013;Ueno et al., 2011;Welbourne and Lambon Ralph, 2007;Welbourne et al., 2011).