Neurophysiological Correlates of Concussion: Deep Learning for Clinical Assessment

Concussion has been shown to leave the afflicted with significant cognitive and neurobehavioural deficits. The persistence of these deficits and their link to neurophysiological indices of cognition, as measured by event-related potentials (ERP) using electroencephalography (EEG), remains restricted to population level analyses that limit their utility in the clinical setting. In the present paper, a convolutional neural network is extended to capitalize on characteristics specific to EEG/ERP data in order to assess for post-concussive effects. An aggregated measure of single-trial performance was able to classify accurately (85%) between 26 acutely to post-acutely concussed participants and 28 healthy controls in a stratified 10-fold cross-validation design. Additionally, the model was evaluated in a longitudinal subsample of the concussed group to indicate a dissociation between the progression of EEG/ERP and that of self-reported inventories. Concordant with a number of previous studies, symptomatology was found to be uncorrelated to EEG/ERP results as assessed with the proposed models. Our results form a first-step towards the clinical integration of neurophysiological results in concussion management and motivate a multi-site validation study for a concussion assessment tool in acute and post-acute cases.

Traumatic brain injury (TBI) impacts upwards of 2.8 million individuals annually in the united states alone 1 . Concussions (henceforth used synonymously with mild TBI; mTBI) form a considerable subset of that figure and are defined as closed-head injuries that leave the affected with functional and cognitive deficits 2,3 . The current understanding of underlying mechanisms in concussion remains lacking, with echoing concerns both in the identification and management of the condition 4 . An expansive body of work has targeted the multiple facets of concussion, offering different means of elucidating the cognitive deficits caused by concussion and its co-morbid sequelae 5 . Electrophysiology is one tool with promising applications in concussions. Specifically, event-related potentials (ERPs) as recorded by electroencephalography (EEG) have shown persistent changes in concussed individuals in the post-acute stage and decades after insult [6][7][8][9][10] .
ERPs are non-invasively-recorded indices of cognitive function 11 . The P300, a positive-deflecting response peaking approximately 300 ms after stimulus onset, is a commonly studied component in neurophysiology that is associated with attentional resource allocation, orientation, and memory 12 . The P300 was found to be impacted by concussion immediately 13 after occurrence and decades post injury 6,[8][9][10] . P300 effects were observable when patients were symptomatic as well as after symptom resolution 14 and were affected cumulatively following a series of concussive blows to the head in comparison to a single hit 15 . The N2b is an ERP often linked to executive function manifesting as a fronto-central negative deflection 200 ms after stimulus onset 16 . Similar to the P300, the N2b was affected after sustaining hits to the head 7,10,15,17,18 . Research has demonstrated the versatility and sensitivity of both the P300 and N2b to concussion; however, a transition from controlled, group-level findings to individual assessment is required before clinical adoption is made feasible.
Machine learning (ML) has gained significant traction in the clinical field, offering a cost-efficient way of replicating expert judgements and decisions in a setting overloaded with data 19,20 . ML introduces a dynamic process that is able to ingest high-dimensional clinical data and learn complex patterns that might also be difficult to detect or visualize for a human expert 19,20 . Despite some scrutiny due to black-box solutions 21 and susceptibility to bias in misapplication 22 , machine learning remains a great tool for exploiting resources to improve clinical standards 19,[21][22][23] . EEG data are characterized by their rich high-dimensionality that requires certain degrees of aggregation to simplify for a human observer -quite possibly at the cost of losing critical information. That complexity has made ML a valuable method in EEG analysis [24][25][26][27][28][29][30][31][32] .
Although this study details the first EEG/ERP application of deep learning (DL) in mTBI, DL has been explored in various EEG applications 33 . Broadly, DL expands on traditional ML techniques by providing a multi-layer architecture that enables fitting complex and custom models that promote hierarchical feature extraction. In EEG, model complexity and layer stacking has been proposed as a valuable tool in creating end-to-end solutions that integrate feature extraction and classification as opposed to the more manual feature engineering of traditional ML 24 . Most DL applications on EEG to date have been on resting-state, using shifting windows in time as input, to provide datasets with sufficient size for training such complex models 27,28,33 . Recently, there have also been studies of DL to classify targets (P300) vs. non-targets in a brain-computer interfacing setup 27,33 .
In the present study, we developed the TRauma ODdball Net (TRODNet), a deep learning network that uses convolutional layers in extracting information from single-trial EEG/ERP data to identify signs of concussion. The network learns a set of topographical maps that characterize different ERPs elicited in a multi-deviant oddball paradigm designed to elicit both the P300 and the N2b responses. The temporal activations of these maps form a set of automatically extracted features to predict a single-trial's label. TRODNet is trained and assessed using 10-fold class-stratified cross-validation on a dataset of 54 participants (28 controls). All concussed participants were clinically diagnosed and were symptomatic at the time of testing. Supplementary self-reports were collected to investigate concussive and depressive symptomatology as captured by the post-concussion symptom scale (PCSS) and the Children Depression Inventory 2 (CDI), respectively. Nineteen of the 26 concussed subjects returned for a follow-up test (see Fig. 1A), nine of which reported full symptom recovery (PCSS of 0) with the others developing post concussion syndrome (PCS). Analyses on the longitudinal samples were run in parallel to assess whether symptom resolution was identifiable by the trained model (see Fig. 1C). Model interpretation is a critical factor for integrating machine learning into the clinical setting 21 . Thus, trained models were interpreted using the SHapley Additive exPlanations (SHAP) method, a recent introduction to the field with demonstrated success in clinical applications 23,24,34 .
The study was designed to investigate two primary hypotheses. First, the study examined whether single-trial classification can be aggregated for each subject to provide a viable tool of detecting concussion-related neurophysiological effects using minimal feature engineering. Second, the model's judgements on longitudinal datapoints were examined. It was postulated that performance would deteriorate after symptom resolution due to a normalization of the recovered subjects' neurophysiological responses, as opposed to consistent performance in those who retained their symptoms. Model interpretability was prioritized to ensure a transparent representation of learned information and to serve as a confirmatory step for the model's results.

Results
Concussion identification. As the model was trained (and tested) on single trials, aggregation of the TRODNet output was performed to create a prediction on the subject-level (see Methods for more details). As such, if more than 50% of a subject's trials were classified as concussed, the subject was predicted as belonging to the concussed group. The TRODNet model was able to achieve a single-subject cross-validation accuracy of 85%. Specifically, four control subjects were misidentified as concussed while four concussed subjects were misclassified as controls. This put the model's sensitivity to concussive effects at 84.6% and its specificity at 85.7%. Single-trial cross-validation accuracy was recorded at 74.4%; however, this figure should be assessed with care as discussed below. A detailed list of the model's single-trial accuracies; PCSS and CDI scores; demographics; and number of days since injury for each subject in the concussed group, including the longitudinal results, is reported in Table 1 Longitudinal factors. Assessing the model's single-trial accuracy for the concussed subgroup that partic-  Fig. 2, showing a clear main effect of Testing Date that is not influenced by Recovery. Additionally, it can be observed that subjects that didn't report symptom recovery had lower single-trial accuracies overall.

Injury acuteness and correlation analyses. The effect of days since injury on perceived results was
inconclusive for the first day of assessment (see Fig. 3 and Table 1). For the second date, self-reported symptoms seemed to increase as days since injury increased for the no symptom resolution (NSR) group. This effect was equally observable in the PCSS and CDI scores. Although the two measures are inherently confounded, this result proposes a layer of subjectivity indicating a worsening of effects as an individual is subjected to symptom persistence. Conversely, no clear effect of days since injury was noted on the EEG/ERP results when accounting for symptom resolution. Overall, the SR subgroup reported a lower PCSS score at the date of the first test compared to the NSR subgroup. This is concordant with reports of symptom severity being a consistent measure of clinical recovery 2 .
Insights from model explanations. Upon interpreting the model with SHAP, TRODNet highlighted areas of interest overlapping with previously demonstrated effects in the literature 10,35 . The mean absolute SHAP values, indicative of feature importance, were reshaped for display on a 64-channel EEG plot for each condition (see Fig. 4). The two deviants had the most prominent features with important ones forming a bimodal distribution in the posterior regions, morphing into a unimodal shape in the frontal areas. The first and second peaks correspond in time and topography to the P300 and N2b, respectively 12,16 . Features tended to be uniformly important bilaterally, with slightly higher importance for the right side. Responses to the standard condition showed smaller and more dispersed distributions of feature importance, an unexpected finding considering an earlier study on chronic effects of concussion that showed early discernible effects to the standard tones 24 .

Discussion
Our results demonstrated the efficacy of an acute/post-acute automated system for concussion identification in individual subjects. In contrast to earlier work in concussion, the utilization of deep learning and convolutional networks enabled an end-to-end solution with minimal feature-engineering 24,26,36,37 . Additionally, the hypothesis that single-trials offer a more granular and effective method of assessing EEG/ERP data was supported.
Results relating symptomatology and neurophysiological effects were negative. Despite the misalignment between the present study's hypothesis and the data, symptomatology has been previously shown to have little correlation to EEG/ERP effects 6,10,35 , especially as neuropsychological measures completely return to baseline in most cases 38 . This disagreement extends to other assessment modalities such as quantitative EEG 36,39 . It is noteworthy that the model's performance drop may be attributable to the time-elapsed since injury, a finding that agrees with a regression study conducted in parallel to the present one (under review). These results highlight the need to examine the multiple stages of concussion progression and their effects with care as some may potentially be observable strictly at a particular stage of injury and/or recovery. Moreover, in the longitudinal subset, the model predicted trials of subjects that exhibited symptom resolution as concussed more than the subjects with persisting symptoms. Interestingly, that difference was observed irrespective of Testing Date (1 st vs. 2 nd ; Fig. 2). These results introduce the possibility that a subject's recovery trajectory may be inferred from a participant's EEG/ERP results during their symptomatic stage; however, no strong evidence could be drawn given the constraints of the present dataset. Of note, performance in the longitudinal sample is difficult to interpret provided that at no time was the model trained on a longitudinal sample from our data. We are not able to draw conclusions on whether the results are due to PCS-related neurophysiology or a more broad neurophysiological persistence of the injury that remains beyond symptom resolution. In practical terms, we posit that a sufficiently-large PCS group is required in addition to a symptomatically-resolved group to train a model to effectively differentiate the two against a control group. Ideally, given sufficient data, a model should also have access to date from injury to properly factor for a dynamically changing manifold of injury-affected responses.
The present study is the first report of ML-based EEG/ERP analysis in acute/post-acute concussion assessment. We reported a higher accuracy than previous studies classifying mTBI using RS EEG 26,37 and marginally higher than a previous study on injury detection decades after injury 24 . A quantitative comparison with clinical tools typically used in mTBI assessment is not straightforward as some of the best-reported tools decline in utility as soon as 5 days after injury 2,4 while our first day of data collection was an average of 20.2(13.6) days separated from injury. Clinical tools such as self-reported symptoms, postural control evaluation, and a pen-and-paper assessment scored sensitivities of 68.0%, 61.9%, and 43.5%, respectively, when administered within 24 hours of injury 40 . Combining all these tools was reported to exceed 90% sensitivity, although it is critical to be mindful that with these increments in sensitivity, specificity of these methods deteriorates and, by definition, reduces accuracy. Overall, we argue that the implementation of a single-subject EEG/ERP evaluation for acute/post-acute concussion is feasible provided group-level studies in the literature 17,35 and extended to single-subjects by the methodology presented here. Clinical applicability beyond the acute stage, however, requires further investigations that would augment the data used for training as discussed above.
The interpretability layer on our neural network model confirmed our results' origins as pertaining to neurophysiological signals commonly affected by concussion. This provides strong evidence that the model's predictive www.nature.com/scientificreports www.nature.com/scientificreports/ power is linked to the ERPs that the experimental paradigm was designed to elicit. Primarily, in the deviant conditions, TRODNet's most important features, as extracted by SHAP, corresponded to the 100-500 ms window, encompassing both the N2b and the P300 (see Fig. 4). Topographical examination of feature importance showed the effects to be predominantly central, with an earlier effect that is marginally lateralized to the right. Examination of the standard condition showed a small parieto-occipital effect in the 100-300 ms range, likely related to the N1-P2 complex. While this finding agrees with previous work on chronic neurophysiological effects of concussion observable in responses to the standard tones in an oddball paradigm, the features show low and dispersed importance measures compared to what was observed in the earlier study 24 . This is compatible with a hypothesis that alterations in earlier responses (in the mismatch negativity 10 or the N1/P2 complex 24 ) may correspond to irreversible effects of concussion and are strictly prominent in chronic cases. Further, tracing the model's results provides additional, empirical and data-driven, support of mTBI's impact on facets of cognitive function linked to the P300 and N2b such as attention and executive function 10,17,41 .
The study exhibits two primary limitations. First, the difference in age between the two groups can be argued to contribute to the model's ability to discern between the two experimental groups. Although there have been several reports of age-related differences in ERPs and resting-state EEG, the evidence supports little to no differences in the range of our two groups (15.04 and 19.3) [42][43][44] . Thus, we argue that an effect pertaining to the presented age-range is minimal, if not unlikely. Secondly, as correlations between model output and symptomatology were conducted post-hoc, further work is required to confirm the relationships between time-elapsed since injury and ERP effects.
In sum, a strong case for the clinical utility of ERPs in individual assessment of acute/post-acute concussion patients has been presented. The current findings improve upon those from resting-state and quantitative EEG 36,37 to establish a modality that is able to capture the effects of concussion immediately after insult and years post-injury 24 . The intent of this research was not directed at the mechanisms of progression and symptom manifestation, which remain unclear. However, a major step in that direction has been achieved in the translation of a complex, multi-trial EEG signal that was successfully able to provide an accurate identification of concussion incidence on a single-subject basis. The proposed model, TRODNet, was able to capture distinguishing features without the need for feature engineering, enabling further application to prospective different population ages and pathologies.

Data collection and EEG recordings.
Participants. Data were collected from 26 (7 male) adolescents (mean age = 15.04) with a recently sustained and clinically diagnosed concussion (mean days since insult = 20.15). A comparative group of 28 (5 male) participants (mean age = 19.3) acted as healthy controls, reporting no previous head injuries. All participants reported no neurological or auditory problems. The study was reviewed and approved by the Hamilton Integrated Research Ethics Board, Hamilton, Ontario, Canada. Prior to study participation, all participants provided informed consent in accordance with the ethical standards of the Declaration of Helsinki.
EEG stimuli and experimental conditions. ERPs were collected to a multi-deviant auditory oddball paradigm 10,45 . A 600-tone sequence was presented across two blocks of 300 each. Three deviant tones were presented pseudo-randomly in a continuous stream of standard tones. The standard tone was presented 492 times (82%) at 1000 Hz, 80 dB sound pressure level (SPL), and a duration of 50 ms. Each deviant was presented 36 times (6%) and differed from the standard tone in only one sound characteristic. The frequency deviant was 1200 Hz, the duration deviant was 100 ms, and the intensity deviant was 90 db SPL. Participants were tasked to respond using one button to the standard and another button to all deviants. Due to technical issues, data from the intensity deviant were discarded during analysis. EEG recording and preprocessing. Continuous EEG was recorded from 64 Ag/AgCl active electrodes (Biosemi ActiveTwo system) placed according to the extended 10/20 system using an elastic cap. Data were passed through an online bandpass filter of 0.01-100 Hz and referenced to the driven right leg. Data were digitized and saved at 512 Hz. Five external electrodes were recorded with the same settings. Three were placed on the mastoid processes and on the tip of the nose. The last two were placed above and over the outer canthus of the left eye to record eye movements. Stimuli markers were recorded and saved synchronously with the EEG data.
Data were processed offline using a 60 Hz notch and a 0.1-30 Hz (24 dB/oct) bandpass filters before re-referencing to the averaged mastoids. Artifacts were rejected manually using visual inspection followed by www.nature.com/scientificreports www.nature.com/scientificreports/ independent component analysis (ICA) decomposition. The two components found to correlate with horizontal eye movements and blinks were removed before recomputing sensor data. Trials with correct behavioural responses were segmented to 1200 ms intervals starting 200 ms before stimulus onset. Finally, segments were baseline corrected (−200 to 0 ms) and grouped into their respective experimental conditions before exporting the single trials. All EEG preprocessing was conducted using Brain Vision Analyzer (v2.01; Brain Products GmbH).  observations collected from concussed subjects on their second day of testing. We denote the main dataset tensor as  ∈ × × X T N S main . All EEG data manipulation was conducted using the Python MNE package 46 .

Statistical analyses.
Training and validation. Stratified 10-fold cross-validation was applied to estimate the generalization accuracy of the trained models (see Fig. 1C). X was split into X train and X test before standardizing both sets based on X train , removing the mean and scaling to unit variance for each feature. Observations from one subject were contained exclusively in either X train or X test to ensure no performance inflation due to subject-specific idiosyncrasies. The learner was batch-trained on X train for 500 epochs where each epoch passed a batch of B = 160 randomly-picked observations from X train . The resultant model predicted the labels of each observation in X test to produce the trial accuracy t . A thresholded version of accuracy t evaluated the accuracy s of all trials from a single subject. If more than 50% was achieved, the accuracy s i for subject i tallied as correct. In instances where X test contained one or more subjects that have undergone a second day of testing, the subjects' second set of trials were evaluated in parallel to assess their follow-up test's accuracy similar to what's described above. This procedure was done to ensure an identical training-set for both testing dates as well as to eliminate the possibility of within-subject bias. No training was conducted on data collected at the second day of testing.
Neural network architecture and hyperparameters. Following the notion that a multi-channel EEG signal is the evolution of certain topographies across time 25,29 , TRODNet utilized convolutional layers to learn commonly occurring topographical maps (see Fig. 1B) 27,28 . The present architecture, based on an EEG ConvNet 28 and EEGNet 27 , expanded to account for multiple conditions in the same input observation. Compared to EEGNet, TRODNet did not contain a convolutional layer that provided learned filtering settings, but split the depthwise convolution for each of the experimental conditions to extract topographical maps that best distinguish each condition. TRODNet corresponded in architecture to the shallow ConvNet 28 with the addition of the by-condition split and by limiting the input to time-locked trials (see Fig. 1B). The network had four layers in total (in addition to input).
• L input : This describes the input layer. The input tensor is of size B × N × S and is reshaped to B × N × S × 1 before passing to the next layer. • L 1 : The input tensor was split across three separate convolutional filters such that each was tasked with learning M = 5 maps that are specific to the condition. Kernel size was set to (64, 1). The output from each of the three sub-layers was of size B × 1 × S × M. The outputs were concatenated across the last dimension before passing to the next layer. • L 2 : A maxpooling layer was applied with both a pool size and stride of (1, 10) and (1,5), respectively. • L 3 : Corresponded to a dense feed-forward layer of size 100.
• L output : The output layer acted as the label predictor with softmax activation to separate classes concussed and control.
All layers but L output had a rectified linear activation unit (ReLU). L 2 regularization was applied on all weights with λ = 0.25. The Adam optimizer was used during training with α = 5e − 4. Training for a single cross-validation iteration was stopped after 500 complete epochs. These hyperparameters were set to optimize a separate dataset 10,24 collected using the same EEG/ERP protocol and were not modified throughout training. The code for training, testing, and visualization procedures is made readily accessible (see Data Availability section).
Model interpretation. The Deep Learning Important FeaTures (DeepLIFT) 47 implementation using Shapley values 34 was applied post-hoc on a model trained on all data to explain a model's decision on single-subject averages. An overall estimate of all features' influence on classification was calculated as the mean of the absolute SHAP values for all single-subject averages. The values were overlaid across the head to represent a 64-channel plot as commonly used in EEG/ERP studies. For visual clarity, each experimental condition was plotted independently.

Data availability
The input set was imported and formatted using Python MNE 46 package version 0.16.1 running on Python 3.5.2. Cross-validation and scaling were applied using scikit-learn 0.19.1 48 . Deep learning used Tensorflow 49 (v1.8.0). All code is made available at https://github.com/boshra/TRODNet. Statistical analysis was conducted using R statistical software (v3.5.3) and the ez package (v4.4-0). Result storage, correlational plots, and feature importance visualizations were conducted using the pandas (v0.24.1), seaborn (v0.9.0), and Python MNE packages, respectively. The single-trial data used to train the models of this study are available upon request from the corresponding authors (J.F.C. and R.B.).