Deep Long Short-term Memory Structures Model Temporal Deep Long Short-term Memory Structures Model Temporal Dependencies Improving Cognitive Workload Estimation Dependencies Improving Cognitive Workload Estimation

Using deeply recurrent neural networks to account for temporal dependence in electroencephalograph (EEG)-based workload estimation is shown to considerably improve day-to-day feature stationarity resulting in signiﬁcantly higher accuracy ( p < .0 0 01) than classiﬁers which do not consider the temporal dependence encoded within the EEG time-series signal. This improvement is demonstrated by training several deep Recurrent Neural Network (RNN) models including Long Short-Term Memory (LSTM) architectures, a feedforward Artiﬁcial Neural Network (ANN), and Support Vector Machine (SVM) models on data from six participants who each perform several Multi-Attribute Task Battery (MATB) sessions on ﬁve separate days spread out over a month-long period. Each participant-speciﬁc classiﬁer is trained on the ﬁrst four days of data and tested using the ﬁfth’s. Average classiﬁcation accuracy of 93.0% is achieved us-ing a deep LSTM architecture. These results represent a 59% decrease in error compared to the best previously published results for this dataset. This study additionally evaluates the signiﬁcance of new features: all combinations of mean, variance, skewness, and kurtosis of EEG frequency-domain power distributions. Mean and variance are statistically signiﬁcant features, while skewness and kurtosis are not. The overall performance of this approach is high enough to warrant evaluation for inclusion in operational systems. Published by Elsevier B.V. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Teams composed of both humans and machines can potentially work together to mitigate their respective inherent weaknesses. A computer's strength is manifested in its ability to quickly and correctly compute answers, while humans exhibit superior flexibility of response to unexpected situations. Thus, Human-Machine Teams (HMTs) promise to mitigate inherent limitations on computational decision-making in all-human teams while simultaneously reducing the brittleness and inflexibility of fully-autonomous systems [11] . Team outcomes are improved when one agent (human or computer) assists another in the right way at the right time [7] . For computers to help humans in HMTs, they must know the human's cognitive state; this knowledge can be obtained through method for estimating mental workload is to first use statistical machine learning to fit a model which enables prediction of mental workload from the physiological signals, and then use that model to make mental workload estimates from newly-gathered physiological signals [44] .
The utility of an OFSA system will depend on the benefits of accurate assessment and the costs of errors. This cost-benefit tradeoff will be application-specific and different for correctly identifying high and low workload states depending on the types of augmentation tied to a given state and the consequences of incorrect/inappropriate activation or lack of activation. These errors directly impact an operator's trust in the automation, in-turn affecting future utility of that automation in a closed loop-fashion [28] . Rouse et al. [35] indicated that a 95% accuracy rate for workload estimation may be required for a system to be acceptable. Parasuraman et al. [31] went further and suggested that if the system does not approach 100% accuracy then the costs of inaccuracy and lack of trust may lead to the system being unacceptable, especially in safety-critical environments.
Unfortunately, current state-of-the-art systems are not yet able to achieve the required accuracy, due in part to the challenge of temporal non-stationarity in psychophysiological signals. This challenge relates to variation over longer periods of time and dependence within shorter periods. Both can negatively impact the generalizable long term accuracy of workload assessment systems [7] . Within shorter spans of time, signals tend to exhibit hysteresis or serial dependence. This suggests that there is inherent structure in the statefulness in the brain that can be exploited with appropriate machine learning techniques. While it is difficult to attribute this dependence to any discrete set of factors, some of the likely possibilities include consistency in default mode activity [34] and hysteresis exhibited by most physiological systems.
In the context of machine learning, temporal non-stationarity can be addressed in two ways. The first is through feature generation or selection. A better set of features will exhibit less long-term non-stationarity and will lead to better model performance. In this work, we examine several feature generation techniques to determine empirically if certain feature sets are superior to others. The second way to address non-stationarity with machine learning is to use algorithms that make different assumptions about the nature of the data being processed. As it stands, most published research on operator workload estimation implicitly assumed temporal independence from one time segment to the next. This is likely a poor assumption due to both the factors discussed above as well as longer term effects such as fatigue and performance hysteresis with mental workload transitions [22] . An example from aviation illustrates this nicely. If a pilot has just completed flying an instrument approach in instrument meteorological conditions (IMC) when an unexpected emergency requires attention, pilot workload will increase differently than if the pilot had the same unexpected emergency arise following a period of autopilot-on flight at cruising altitude in visual meteorological conditions (VMC). This simple example illustrates that what has happened in the recent past temporally, matters for operator workload assessment.
Machine learning algorithms that consider past information as well as current information when fitting models should perform better. Such algorithms must be able to learn a temporal representation of the data. A common model used for modeling temporal data is the Recurrent Neural Network (RNN). RNNs are neural networks that are able to learn sequences that are not composed of independent, identically distributed observations [16] . Rather, they are able to elicit the context of observations within sequences and accurately classify sequences that have strong temporal correlations [16] . Historically, RNNs had limitations when training models with more than 10-20 time steps which led to poor performance. Incorporating longer time-series data streams would cause computational sensitivity problems that stymied RNN training.
Recent developments have resulted in RNN architectural and training advances which mitigate these computational problems and allow much longer temporal sequences to be processed. One approach is the Long Short-Term Memory (L STM) layer. L STM architectures extend the length of sequences that can be considered by a RNN by overcoming computational sensitivities encountered during backpropagation [21] . For these reasons, they may offer improved workload classification accuracy over other methods when using EEG data. With these improvements in machine learning, there is no longer a reason to avoid incorporating temporal context in a workload model. We capitalize on these machine learning developments in our research.
The primary contribution of this research is demonstration of significantly improved cross-day workload classification accuracy by integrating contextually relevant algorithmic architectures with improved feature generation techniques. We statistically evaluate all combinations of mean, variance, skewness, and kurtosis of frequency-domain power distributions and contrast a variety of RNN architectures, to include deeply stacked LSTMs, with baseline algorithms and features. Both linear and Radial Basis Function (RBF) Support Vector Machines (SVMs) and single-layer feedforward Artificial Neural Network (ANNs) using mean-only features are used as baseline cases. We show that by accounting for temporal dependence using deep LSTM models trained with new feature combinations, we can maximize cross-day workload estimation accuracy resulting in a 58% reduction in classification error over baseline methods and a 59% decrease in error compared to the best published results for this dataset.

Background and related work
Temporal non-stationarity of electroencephalograph (EEG) signals within individuals is likely caused by a large number of intrinsic and extrinsic factors. Participant motivation and mental or physical readiness are examples of some intrinsic factors; extrinsic factors include significant differences in EEG electrode placement, changes in conductance, and different motion artifacts [8,23,30] . Due to the challenge of handling these factors, cross-day nonstationarity of EEG signals has motivated a number of related studies including several using the same dataset described below.

Dataset
Data for our investigation was used in the 2011 Cognitive State Assessment Competition [13] and was recorded during a prior human research study performed by Wilson et al. [43] . Eight participants completed scenarios within the Multi-Attribute Task Battery (MATB) [10] environment across five test days spread out over a month-long period. Monitoring, communication, resource management, and tracking tasks were presented and manipulated to induce three levels of difficulty: low, medium, and high [8,43] . Resource allocation errors, monitoring task reaction times, and communication response times were recorded and used to validate that participants experienced distinct low and high difficulty levels. Participants were trained to asymptotic proficiency prior to the first test day [43] .
For each participant, horizontal electrooculogram (HEOG), vertical EOG (VEOG), and 19 channels of EEG voltages (according to the International 10-20 System) were sampled at 256 Hz. On each of the five days, each participant performed three five-minute trials at low, medium, and high difficulty for a total of nine trials per day. Trials were presented in a random ordering with transition periods in between. Each participant completed a 30 s resting baseline at the start of each session prior to the MATB task. Only six of the participants were used in our study due to missing data from two of the original eight participants [19] . Similar to Christensen et al. [8] and Casson [5] , we selected data from the low and high workload conditions resulting in a total of 30 conditions-15 low and 15 high-for each individual.

Within-participant cross-day variability
Many researchers using this dataset focused their efforts on cross-participant variability or a combination of cross-participant and cross-day models rather than independently investigating cross-day variability within individuals. This section will only consider those works that exclusively examined within-participant cross-day variability.
In the workload literature using this dataset, three research teams focused solely on the performance of classifiers in crossday analyses. Hefron and Borghetti evaluated the variance of frequency-domain power distributions as a feature and found it to improve accuracy for random forest, Linear Discriminant Analysis (LDA), and K-Nearest Neighbors (KNN) classifiers for the twoclass-high versus low workload-problem. Models built with variance and mean features exhibited a 5.8% increase in accuracy over models built using only the mean of frequency-domain power distributions [19] . In their study, it was postulated that the use of skewness and kurtosis as features could generate further gains in classification accuracy [19] .
In another study, Christensen et al. evaluated cross-day classification performance of low versus high workload by training three classifiers: LDA, Support Vector Machine (SVM), and a singlehidden-layer feedforward Artificial Neural Network (ANN) [8] . Two SVM implementations were used-one used a radial basis function kernel and the other a linear kernel. The authors noted the linear kernel performed best and was used for reporting results. One portion of the experimental design used leave-one-out crossvalidation, holding out one day's data in each case. This procedure produced five separate testing periods that were used to evaluate algorithmic performance. Of note, the SVM achieved 68% accuracy, while the ANN attained 83% accuracy [8] . The results demonstrated significantly better cross-day performance for the neural network compared to more traditional machine learning techniques, namely LDA and SVM classifiers. These results indicated that neural networks may have more capacity to handle nonstationarity of feature-to-target mappings.
In a study very similar to Christensen's, Casson [5] evaluated the effect of temporal stability of feedforward ANNs trained on EEG signals with differences of seconds, minutes, hours, and days between collection of training data and testing data. While other interesting results were presented in the paper, the most relevant to our work concerned participant-specific, leave-one-session-out, cross-validated models. These models were aimed at producing excellent cross-session predictive performance and attained an average classification accuracy of 73%. Differences in preprocessing, training, and network architecture contributed to the differential between these results and ours.
In all of these investigations, the demonstrated classification accuracy fell short of recommendations for use in many operational settings. While accuracy requirements will be task specific, as discussed in Section 1 , neither Rouse's desired 95% accuracy rate [35] nor Parasuraman's near 100% accuracy [31] condition were met for workload estimation. Since all of these methods assumed temporal independence between time segments, we believe new models that account for temporospatial dependencies and the use of new features should be explored.

RNNs and LSTMs
Deep neural networks have been used extensively to achieve state-of-the-art results across a wide range of categories including image recognition, speech recognition, translation, and image captioning [16,17,26,40] . Recurrent Neural Networks (RNNs) are a type of deep network where depth is added via a recurrent connection in the hidden layer which is used to process sequential data. Processing sequential data is fundamentally different than processing independent identically distributed observations because of the distributional dependence on the sequence. In a traditional feedforward ANN, each observation is processed individually and the network state does not persist while other potentially sequentiallydependent points are processed. Conversely, in a RNN, recurrent connections pass state information across time steps allowing previously processed observations to affect the subsequent observations [29] .
Feedforward ANNs have a directed acyclic computational graph-one layer feeds to another with no cycles. RNNs can be understood as an extension to the feedforward structure allowing for cycles in the graph structure. Fig. 1 shows how very deep computational graphs can be created when the recurrent connections are unfolded in time. The process of unfolding a recurrent network simply requires removal of all cycles in the graph to form a directed acyclic graph [15] . To allow these networks to learn important pieces of information that may be located at different positions in the sequential data, the input, recurrent, and output weight matrices' parameters are shared [15] . If different parameters were used for each time step, no generalization across time would be possible; sharing parameters allows the network to keep track of state. Since temporal non-stationarity of EEG signals across days is a challenge, we needed to ensure the temporal sequences supplied to our model did not exceed periods of time over which feature-to-target non-stationarity would occur.
In addition to the depth of a RNN due to unfolding over time, depth can be added to the network when recurrent layers are stacked in a sequence-to-sequence fashion. This means that the output from one layer of a RNN returns a sequence of vectors which form the input to the next layer. Graves et al. demonstrated that deeply layering RNNs has a more beneficial effect than merely adding memory cells [17] . The lower layers transform input sequences into more easily learned representations in the higher layers, resulting in better classification accuracy [15] .
One problem with the simple recurrent structure shown in Fig. 1 is a result of shared weights which produce vanishing and exploding gradients during backpropagation through time. This limits the sequential depth of simple recurrent units to sequences of no greater than 10-20 observations because the signal from distant positions will not propagate to the current time [15] . A significant innovation which solved this problem was Long Short-Term Memory (L STM) [14,21] . L STM provides the algorithm fine control over what is put into memory and removed from memory in the hidden layer. This is achieved by a combination of three gates which control flow into and out of the memory cell: an in- For clarity, we will describe the state of a memory cell at time t with recurrent connections from time t − 1 , and to time t + 1 . As shown in Fig. 2 , there are two vectors that persist from time t − 1 : the hidden vector, h t−1 , and the memory cell state, s t−1 . The forget gate f t , as shown in Eq. (1) , determines what to remove from the memory cell state. In other words, it forces the memory cell to forget things that are not important based on error backpropagation: where the weight matrix W xf is the weight matrix from the input x to the forget gate f t , x t is the input at time t, W hf is the weight matrix from the previous hidden vector h t−1 to the forget gate f t , and b f is the forget gate bias. The from-to subscript convention is used to describe each weight matrix. The input gate i t determines how much each element of the candidate update vector, ˜ s t , should be added to the corresponding memory cell element at time t based on the recurrent connection from the hidden vector h t−1 and the sequential input at time t, x t . The gate scales the candidate update vector which is the output from a fully-connected tanh layer: The delicate balance required to maintain memory cell state over long sequences is attained by forgetting old information and incorporating new information. Forgetting is accomplished by multiplying the old memory cell state by the output of the forget gate, while the new information is supplied by adding the portion of each value specified by the element-wise product of the input gate with the candidate update: Finally, the output gate determines what should be output to the hidden vector from the memory cell state, given the temporal context of the time-step, to minimize error. The output gate, o t , and hidden vector, h t , are given by:

RNN models for EEG analysis
Despite excellent results in other fields, relatively few researchers have applied deep neural network techniques to classify EEG data. Bashivan et al. [2] trained a deep recurrent-convolutional neural network accounting for both temporal and spatial dependencies in the network. They began by performing a Fast Fourier Transform (FFT) on each time-series signal from each electrode and estimating the power in the theta (4-7 Hz), alpha (8)(9)(10)(11)(12)(13), and beta (13-30 Hz) frequency bands over 3.5 s working memory experiment trials with four classes of task difficulty. Then, they created a time-series of images by performing a 2-d Azimuthal Equidistant Projection (AEP) for power in each of the three different bands. This preserved distances from each electrode to the center point of the EEG cap, thus accounting for some spatial dependencies. These images were fed into a series of convolutional and max pooling layers. Their best performing model fed the output from the convolutional portion of the architecture into a 1-d convolutional layer, as well as a LSTM layer, and merged the output from both of these into a fully connected layer. This was then connected to a softmax layer which provided the final classification result. This complex architecture used 1.62 million parameters and was able to reduce the classification error from 15.34% to 8.89%; an impressive 42% reduction compared to a baseline radial basis function SVM [2] . While a large reduction in error over baseline methods was achieved, the complexity and amount of computation required to train such a deep network is significant. Our research sought to examine if similar reductions in error over baseline methods could be achieved using different preprocessing techniques and a less complex deep architecture which only used recurrent neural networks.
Two other researchers used a form of the LSTM to analyze EEG data. Davidson, et al. found that a small single-layer LSTM was able to identify lapses in attention based on EEG spectral data better than a tapped delay-line Multilayer Perceptron (MLP) during performance of a visuomotor tracking task [12] . They trained their networks in a leave-one-out, cross-subject manner. Their results evaluated temporal dependencies out to a length of 6 s and showed that accounting for temporal dependence of up to 4 s prior to a lapse improved detection. We use an extended temporal window of 30 s in this study. In other work, Binz et al. [3] used a derivative of the LSTM unit, called Dynamic Cortex Memory (DCM), to classify imagined sensorimotor imagery data from a Brain Computer Interface (BCI) workshop competition [4] . A single hidden layer was used with eight DCM units followed by a softmax output layer. While their results did not achieve state-of-the art accuracy, several advantages were present in their solution. Their network was able to provide real-time results and their solution did not need to specify a time window, since the network learned appropriate temporally-dependent sequences [3] . Binz's work largely differed from ours since a mixture of within-participant and crossparticipant models were used to evaluate trials that were separated from the training data by only seconds-to-minutes rather than days. The only similarity between our studies was that we both used forms of the LSTM architecture to account for temporal dependencies in EEG signals.
In a medical application, several research groups [18,27,36,39] successively improved performance of epilepsy diagnosis using RNNs. All four groups used various input features to train small Elman RNNs in a cross-subject manner using the dataset described by Andrzejak et al. [1] . The Elman RNN architecture has limited temporal representational capacity compared to the networks trained in our investigation. Like our study, each research group demonstrated the superiority of an RNN compared to other neural networks that did not incorporate temporal dynamics. However, unlike our experiment, the data sequences analyzed did not span temporal lengths great enough to examine day-to-day variability. Finally, Guler and Ubeyli both demonstrated that summary statistics (min, max, mean, standard deviation) of time-varying feature distributions can be useful input features for a recurrent model; however, no comparison to a baseline feature set was provided in either case.
The primary difference between previous research and ours is that we demonstrate improved within-participant, day-to-day feature stationarity by accounting for temporal dependencies in cognitive activity; whereas the aforementioned studies do not address day-to-day variability. Another difference is that the preceding research focused on either medical uses or state estimation unrelated to workload. A final differentiating factor was that many previous studies were conducted prior to breakthroughs that enabled drastic Table 1 Test matrix of all combinations of mean, variance, skewness, and kurtosis features. * Denotes feature sets that were included in models that incorporated the mean while all others are models that did not incorporate the mean.

Test run
Features included in dataset

Methodology
The goal of producing several workload models to evaluate the efficacy of new features and LSTM based classification algorithms required preprocessing of the EEG data to convert it into the frequency domain and to extract power in different frequency bands as outlined by [19] . Raw EEG data was transformed into features in clinical frequency bands (delta (1-4 Hz), theta (4-8 Hz), alpha (8-14 Hz), beta (15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and gamma (30-55 Hz) to conduct timefrequency analysis using the following process: The power spectral density was determined for 30 points spread out over a logspace from 3 Hz to 55 Hz by extracting power from complex Morlet wavelets [9] . Each wavelet was 2 s in length and the number of wavelet cycles increased logarithmically from 3 to 10 in conjunction with the frequencies. Mean power in each band was determined by averaging each power value for the evaluated frequencies within each of the clinical bands. Power was then aggregated over a ten second sliding window with 9 s of overlap, allowing for a new update each second.
Final features were generated by determining the mean, variance, skewness, and kurtosis of the power distribution in each of these ten second windows for all 19 EEG electrode sites across the five frequency bands. This process yielded 380 features for each second and approximately 90 0 0 observations per individual for the five day period. These features were then centered and scaled by session so that each session had a mean of zero and variance of one since none of the algorithms used for analysis were scale invariant. The data were split such that the first four days were used for training and cross-validation while the last day was reserved for testing.
To examine the effect of inclusion/exclusion of particular types of features, the test matrix shown in Table 1 Table 1 resulting in 90 final models for each algorithm. Final models were trained using each set of features and differences in classification accuracy due to choice of algorithm and feature set were considered. It is important to note that the cross-validation data was split so that each fold was a full day rather than splitting the folds by random selection of observations from the entire dataset. There were two compelling reasons for this choice. The first reason was that randomly selected crossvalidation points allow for too many temporally adjacent points to be split between the training and test sets which artificially inflates cross-validation accuracy due to the non-stationarity of the datasets. The second reason was to preserve temporal context of the data for processing when using RNNs.
Once final models were produced, classification accuracies for the holdout day 5 test set were determined for each individual and the algorithm average was calculated across participants. This enabled by-algorithm and by-feature set comparisons. Due to confounding effects of algorithm selection and feature sets on classification accuracy, an ANOVA test with five factor outcomes was performed to elicit the effect of varying levels of each factor. The first factor was algorithm selection which had six levels corresponding to each of the aforementioned algorithms. The remaining four factors grouped the feature sets in a binary fashion based on whether a feature was included or excluded from a particular test run. For example in Table 1 , all asterisked runs produced models that included the mean, while the others were models that excluded mean features. Interaction effects were not examined. A significance level of α = 0 . 05 was used to indicate if a factor had a statistically significant impact on classification accuracy. The Tukey Honest Significant Difference (HSD) test was performed following the ANOVA to determine which classification accuracies were different from each other in the six-level algorithm factor, and to determine the direction and magnitude of differences across all twolevel factor comparisons.
All neural networks were created using the Keras [6] and Theano [38] frameworks. The feedforward ANN had a single hidden layer that was fully connected with the input layer and the output layer. The input consisted of the appropriate number of features for a given test run, while the output layer was a single node with a sigmoid activation function which forced a classification as either high or low workload. Mini-batch gradient descent was performed using 600 observations per batch for all neural network implementations, and the Adam optimizer was chosen to optimize minibatch gradient descent due to its ability to handle non-stationary targets and noisy data [25] . A binary cross-entropy loss function was selected as the cost function [15] . The number of nodes in the hidden layer was tuned by performing 4-fold cross-validation while varying the number of nodes from 50 to 800 in steps of 50. This resulted in 3960 cross-validation models being trained. The lowest cross-validation error rate was used to determine both the number of nodes to use in the hidden layer and how many epochs to train the network.
As illustrated in Fig. 3 , the deep LSTM architecture consisted of an input layer, the first sequence-to-sequence LSTM layer, a manyto-one LSTM layer, a 20% dropout layer, and a final sigmoid activation function for binary classification. The first hidden layer contained 50 LSTM units while the second hidden layer used 10 units. With unlimited resources, we would have tuned the number of hidden layer nodes exhaustively using grid cross-validation. However, due to computational resource constraints, several hidden layer sizes were tested for each layer on smaller representative sets of data and it was found that reducing the number of LSTM units in each layer improved generalization and that empirically, 50 and 10 appeared to work well. Each network had a lookback of 30 s of pre-processed features or, for the case of the second layer, features generated from the output of the first LSTM layer. Dropout  Table 1 , and ranged from 90 to 380 features. Since LSTM 2 does not return a sequence, the tensor becomes two-dimensional at that point. on the input gates to each LSTM layer and between the final LSTM and fully-connected sigmoid layer served as a method of regularization and was set to 20% [15,37] . Dropout prevents co-adaptation of the hidden units by temporarily removing a percentage of randomly selected nodes, including their input and output, in a given layer during a training pass [ 20 ]. This forces hidden units to learn features without depending upon particular nodes to correct mistakes made during learning [ 20 ]. While dropout is the most widely used regularization method for deep neural architectures, it is also important to understand when it is not appropriate to use. The recurrent connections within the LSTM structure are one such case. The purpose of the recurrent connection in a LSTM is to store important long-term dependencies. Pham et al. [33] showed that if dropout is applied to the recurrent connection, then the long term memory becomes corrupted and inhibits learning rather than improving generalization. Similar to Zaremba et al. [46] , we found using a dropout of 20% on the input gates and between the final LSTM output and the sigmoid classification layer provided better results than the typically recommended 50% dropout for fully connected layers. The number of training epochs was tuned using cross-validation as previously specified in this section. The length of training with the lowest cross-validation error rate for each participant/feature set combination was selected for final network training. All training data was then used to retrain the network. To ensure that anomalous behavior was not present in the reported results from day 5, all other combinations of cross-validation with hold-out test day were performed for the deeply stacked LSTM model. No significant deviations from the reported upon results in Section 4 were observed due to permuting the test and training days.
Two recurrent networks were trained which were derivatives of the deep LSTM architecture. The differences between the architectures are detailed next. The deeply stacked simple RNN used the same architecture as the deep LSTM except that instead of using LSTM units, simple recurrent units were used. Due to parameter sharing in the weight matrix for the recurrent connections and a lack of gating functions, the vanishing gradient problem was present and restricted the effective temporal period to 10-20 s for this model despite 30 s of data being provided as input [15] . The second network consisted of an input layer, a many-to-one LSTM layer using 50 hidden units, and a sigmoid output layer. Training and classification methods using these networks mirrored those of the deep LSTM architecture.
Two categories of SVMs were trained, one with a linear kernel as recommended by Christensen et al. [8] , and one using a RBF kernel. The resulting models were used to compare neural network results to those from a more traditional machine learning algorithm. To train the linear SVM, the appropriate number of features for a given test run were provided for each observation supplied to the SVM. The tuning parameter C set a tolerance for number and severity of margin and hyperplane violations-effectively determining smoothness of the decision surface [24,32] . This hyperparameter was optimized using a cross-validated grid search across exponentially spaced values of C from 10 −4 to 10 2 resulting in 2520 cross-validation models for the linear SVM. Nearly all C values selected for the final models were either 10 −4 or 10 −3 . All training data was then used to retrain the linear SVM with the selected values for C . The procedure for training the RBF SVM was nearly the same as in the linear case, except a cross-validated grid search over values of γ and C was performed where the hyperparameter γ controlled the radius of influence for a single observation. The same range of C values were evaluated while γ was exponentially varied from 10 −3 to 10 2 , resulting in 15,120 cross-validation models for the RBF SVM. The best hyperparameters were selected and the final models were trained. This procedure ensured proper tuning to establish a valid baseline for comparison with new techniques.

Results
Results of the five-factor ANOVA indicate that algorithm type, mean features, and variance features had a statistically significant effect on classification accuracy (all p -values < 0 . 0 0 01 ). Skewness and kurtosis in the presence of mean and variance were not significant ( p -values > 0.43). Table 2 summarizes the results from  the ANOVA while Tables 3 and 5 display the post hoc Tukey HSD results. Results from the Tukey HSD test showed that the ANN  Table 4 show that there are 6 feature combi- This compares favorably to 84.8% and 84.6% for the best ANN and SVM cross-participant feature set performance. Furthermore, this represents a 58% decrease in error compared to the best baseline case-the mean-only RBF SVM. These results illustrate the importance of accounting for temporal dependencies in workload data. Further examining the results shows that the influence of using a temporally-stateful model on classification accuracy for workload estimation cannot be understated. In all cases tested, every temporally-stateful model outperformed the best performing non-stateful model and were found to be significantly different than all non-stateful models with Tukey HSD p -values all < 0 . 0 0 01 ( Table 3 ). Statistically, inclusion of mean and variance were significant, while skewness and kurtosis were not. ANOVA results show a significant effect dependent upon inclusion or exclusion of mean features (p < 0 . 0 0 01) . Post hoc comparisons using the Tukey HSD test indicate that average classification accuracy for models including the mean features results in an accuracy increase of 9.6% versus models excluding the mean features ( Table 5 ). A significant effect is also present dependent upon inclusion or exclusion of variance features ( p < .0 0 01). The Tukey HSD test shows that average classification accuracy for including the variance features increased 4.4% versus excluding the variance features. ANOVA and Tukey HSD results for skewness (p = 0 . 4689) and kurtosis (p = 0 . 4301) were not significant. However, by comparing models with and without a single feature, these results merely indicate that kurtosis does not add significantly to a model with skewness included and viceversa. We hypothesize that skewness and kurtosis may be more important in situations involving workload transitions. We did not investigate workload transitions in our research, but expect that transitions between different workload levels may first become apparent in the tails of the distributions.

Conclusion and future work
Cross-day workload estimation based on EEG is a difficult domain due to temporal non-stationarity of feature-to-target mappings. Previous research on cross-day workload estimation implicitly assumed independence of a participant's workload from one instance to the next due to the algorithms used for analysis. Theory and practical experience show that workload can build in a cumulative fashion and that a temporal dependence exists. Recurrent Neural Network (RNN) models, in particular those that use Long Short-Term Memory (LSTM) architectures, can account for both long-term and short-term temporal dependencies inherent in brain activity data. This work also statistically evaluated the utility of mean, variance, skewness, and kurtosis of frequency-domain power distributions and found only the mean and variance to be statistically significant.
This research demonstrated the utility of deep RNN models and particular feature sets for cross-day workload estimation and showed that drastically improved model accuracy can be achieved over Support Vector Machine (SVM) and feedforward Artificial Neural Network (ANN) models when working with nonindependent data. Previously, the best accuracy achieved using this dataset was 83%. Our models built with deep LSTMs increased that accuracy to 93.0%, representing a 58% reduction in classification error over baseline methods and a 59% decrease in error compared to the best published results for this dataset. This pushed us closer to the 95% threshold where adaptive operator augmentation may become feasible.
There is an abundance of future work to be pursued in this area. Due to time constraints and computational complexity, only a select number of deep architectures were examined during this research. A thorough evaluation of different deep RNN architectures to include variations in the depth of hidden layer recurrent connections, stacking of different sized LSTM layers, and interleaving fully-connected feedforward layers between sequence-to-sequence recurrent layers may yield additional improvement.
Another enhancement for future work would be to include workload transitions. We believe that skewness and kurtosis may be relevant in datasets where the target workload is transitioning across high and low workload conditions since distributional changes may first become evident in the tails of the distributions. Other ideas to pursue include creating ensembles of deep RNNs. This would almost certainly improve results as long as enough diversity could be added to the ensemble. In our research, we only supplied 30 s sequences to the recurrent networks. Exploring variations in temporal length supplied to a RNN to examine stationarity of target-to-feature mapping for workload estimation using electroencephalograph (EEG) would also be an interesting subject for future work. Deep RNN architectures could also be used to improve cross-participant and cross-day classification simultaneously by training and testing on all participants grouped together rather than individually. Finally, significant improvement could be realized if time-series data augmentation methods are developed capable of forcing a learned invariance to the sources of temporal non-stationarity.