Extracting duration information in a picture category decoding task using hidden Markov Models

Objective. Adapting classifiers for the purpose of brain signal decoding is a major challenge in brain–computer-interface (BCI) research. In a previous study we showed in principle that hidden Markov models (HMM) are a suitable alternative to the well-studied static classifiers. However, since we investigated a rather straightforward task, advantages from modeling of the signal could not be assessed. Approach. Here, we investigate a more complex data set in order to find out to what extent HMMs, as a dynamic classifier, can provide useful additional information. We show for a visual decoding problem that besides category information, HMMs can simultaneously decode picture duration without an additional training required. This decoding is based on a strong correlation that we found between picture duration and the behavior of the Viterbi paths. Main results. Decoding accuracies of up to 80% could be obtained for category and duration decoding with a single classifier trained on category information only. Significance. The extraction of multiple types of information using a single classifier enables the processing of more complex problems, while preserving good training results even on small databases. Therefore, it provides a convenient framework for online real-life BCI utilizations.

TVs, customer service etc). The established gold-standard method for these temporal decoding problems is represented by hidden Markov models (HMMs) [10][11][12]. HMMs have been rarely used in BCI. The broad variety of possible configurations provided by HMMs may contribute to this underutilization. The particular setup needs to be chosen carefully in order to adapt to the given problem. Also, most of these configurations have an extensive amount of free parameters that have to be estimated during training. With a limited amount of training data, the classifier without further restrictions to the model is likely to end up in a scenario that is known as the 'curse of dimensionality' [9,13]. The curse of dimensionality refers to the case when the amount of training data is small compared to the dimension of the feature space. This leads to improper description of the different classes and therefore, often results in poor decoding results. Despite these issues, it has been shown that HMMs can serve as a viable classifier, if appropriate adaptions are applied. In a previous study [14] we investigated HMMs as an alternative method for BCI decoding purposes. We showed for a simple finger tapping paradigm that HMMs can provide similar decoding accuracies as SVMs, which represent the gold-standard static classifier. Combining suitable features and rigorous restrictions to the model structure as well as optimized and problemadapted initialization we were able to show that HMMs can compete with SVMs regarding decoding accuracies. This is especially true for high accuracy cases, which might seem as a limitation at first glance. However, real-life BCI solutions demand low error rates in order to guarantee reliable function even in risky applications and to avoid user frustration [15].
Machine learning algorithms rely on information from large training ensembles. This becomes particularly important for noisy data and easily confusable classes. Small databases and noisy signals are typical in BCI research. Many studies try to compensate for this by long trial durations as well as restricted and highly distinguishable classes (see table 1) that favor less complex classifiers. These tasks are focused on pure decoding of class information in a condition where all trials of a class have the same properties (i.e. each class has only a single token). In these tasks dynamic classifiers are not exploited to their full potential. Here, we increase the complexity of the problem by adding an additional attribute to the stimuli. Besides decoding the type of a stimulus ('quality'), we also aim at detecting their durations ('quantity'). Dynamic classifiers might be particularly suitable for this task as they model the time course of given signals by default. There is no direct dynamic classifier analogue for SVM static classifiers. The aim of this work is to assess whether HMMs-as dynamic classifiers-after being trained for the quality classification, are also able to perform the task of quantity classification without any additional training effort. The routine we introduce here is stable and 'nearly' unsupervised. In section 2 of this paper, we present methods and material used for this study followed by the results and discussion in sections 3 and 4, respectively. Finally, we conclude our work in section 5.

Data acquisition and paradigm
The study is based on datasets from four different subjects, recorded with electrocorticography (ECoG) and magnetoencephylography (MEG). ECoG data were recorded from two volunteering patients (right-handed males) within a larger study that will be reported separately [20]. Both patients have been implanted with subdural electrode grids for clinical purposes in the course of pre-surgical planning of epilepsy treatment at Stanford, CA, USA. Electrode grids were solely placed based on clinical criteria and covered a variety of cortical areas including lateral occipital and medial temporal areas (details see [20]). The patients gave their informed consent in advance of the recordings. The ECoG was recorded with a sampling frequency of 3051.7 Hz. For pre-processing, a high pass filter (cut-off: 0.5 Hz) as well as a notch filter around the power line frequency (60 Hz) were applied to the data and all electrodes were re-referenced to the commonaverage-reference. Afterwards, artifacts (e.g. epileptic activity, machine noise, loose contacts) were manually rejected by visual inspection. The resulting time series were epoched into individual trials covering the interval from −100 to 2000 ms with respect to picture onset times (see figure 1). MEG acquisition was performed with a whole-head BTi Magnes system (4D-Neuroimaging, San Diego, CA, USA). Data from 248-sensors have been recorded from two healthy volunteers (male, age 26-28) with a sampling frequency of 1017.25 Hz. Artifacts (e.g. blinking, eye movement) were removed from Obermaier [16] Left/right-hand imagination 2 5 Chiappa [17] Left/right-hand imagination, 3 1-3 mental generation of words Lederman [18] Left/right-hand imagination 2 6 hand up-/downward movement imagination 2 5.5 Rezaei [19] Left/right-hand imagination 2 ∼5 5 mental tasks (10 binary problems) 2 1 0 Wissel [14] Finger tapping (4 fingers) 4 ∼0.5 the MEG data by visual inspection. The signals were bipolar referenced and epoched into trials with the same interval (i.e. [−100, 2000 ms]) as for the ECoG data. The stimuli of the paradigm consisted of varying pictures from four categories (objects, faces, watches and clothes). Pictures were presented to the subjects for durations chosen randomly from five different time spans (300, 600, K, 1500 ms). The inter-trial interval varied in duration (600, 750, K, 1200 ms; see figure 1). To control attention subjects responded to the presentation of a piece of clothing with a button press. These clothing target stimuli accounted for approximately 10% of the total trial count. The ECoG data is obtained from another study, for which the paradigm was originally designed. To avoid any bias in the decoding routine, neural response to all stimulus durations needs to be fully covered. This is ensured if trial lengths are adapted to the longest appearing stimulus. With respect to the data (compare figure 4, bottom right, red line), we therefore choose to extend trial segments up to two seconds after stimulus onset. In this setup, combinations of brief picture presentation (e.g. 300 ms) with short inter stimulus intervals (e.g. 600 ms) lead to additional stimuli at the end of a trial (see figure 1). As a consequence of that and our former findings [21] some parameters were slightly modified for MEG recording. We changed presentation durations to 500, 1000 and 1500 ms and increased ISIs to (1500, 1650, K, 2100 ms). In order to focus on visual information only, target trials from the ECoG data have been omitted for the study, as they include motor responses. Consequently, they have been omitted in the MEG paradigm. Stimuli were shown either on a projection screen (MEG data) positioned 1 m away from the subject or on a notebook screen (ECoG data) within the patients reach. Detailed information on the number of recorded trials and rejections per class is listed in table 2.

Feature extraction and selection
In this study, features of two different types have been extracted.  The topmost snippet shows a typical trial segment from the ECoG experiment containing only a single stimulus. Due to the short presentation durations and inter-stimulus intervals used in that experiment, multiple stimuli can fall within one trial segment as shown in the second snippet ('multi-stimulus-issue'). The third example is taken from the MEG paradigm and shows a combination of the shortest stimulus and inter-stimulus interval that is possible. Note that even in this 'worst-case' constellation only a single stimulus (onset) is contained within one trial.  spectral power of the signals using a sliding Hann-window approach (window length = 250 ms). The square root of the power spectrum was computed by fast Fourier transform for each window and the resulting coefficients were averaged in the frequency band of 70-200 Hz. We computed HG features for 100 time points, which leads to a window overlap of approximately 92.5%. Consequently, the resulting features vary slowly over time (compare figure S1). Both feature extraction routines are described in detail in a previous study [14]. Channel selection is performed on training data only, using a two stage approach. First, an algorithm based on the Davies-Bouldin index [14] is applied to select a predefined number of most informative channels for the picture category separation problem. The corresponding labels of the training data are used to select the channels. In a second step, channels containing information on the picture duration are selected using an unsupervised method as follows: raw data amplitude values of all trials are averaged separately for each channel. A channel is selected if the mean amplitude in the first 250 ms of the time series (150 ms after image onset respectively) exceeds a defined threshold. That particular time frame has been chosen manually to match the typical interval in which visual activation appears (compare figure S2). If more than the requested number of channels fulfills this condition, the threshold is iteratively increased, until the channel count fits. Note that this algorithm does not depend on any labels. The number of channels selected by both methods has been fixed for all datasets (7 duration and 3 category channels). The total count of 10 selected channels is motivated by previous findings ( [14],  different classes, and use the rest (i.e. 7 channels) for duration information. This choice represents a tradeoff between optimization of category decoding performance (which would benefit from a higher category channel ratio) and incorporation of the desired additional duration information. However, it has been made manually and cannot be justified theoretically.

Classification
The full procedure (see figure 2) contains a training of HMMs for each category class. Consequently, three HMMs (one for objects, faces and watches respectively) are built using the training dataset. A standard Baum-Welch algorithm is used for the estimation of all HMM parameters (i.e. transition matrices, state means and variances as well as priors). After training, classification is performed on the test set using a simple maximum likelihood approach [14]. For each trial in the test set, the so called Viterbi path of the HMM with the highest category likelihood is calculated. The Viterbi path is the state sequence of the corresponding HMM, that is most likely to explain the given time series of features. Based on the Viterbi paths, duration information is extracted as follows. Using a pre-defined state threshold the sample point, at which the Viterbi path first reaches this threshold, is extracted. This point is mapped to the picture presentation duration d using a linear relation: The slope m in this relation is known, as it refers to the time increment from one feature sample s to the following. The offset n in this equation is a parameter that is not known a priori at the present stage. It is introduced to compensate for the delay between picture offset and according HMM state changes. In order to determine it, we use the labels of one presentation duration only. Using this information, we calculate a representative (mean) sample point at which the Viterbi paths of trials with that particular duration typically reach the threshold. To compensate for fluctuations and outliers, we use a histogrambased method. We first calculate a rough histogram of the threshold crossings dividing the sample point range into ten equally sized boxes b i . We then select the box with the highest count (b max ) and calculate the median of all sample points in the proximity (20 sample points) of its center c: For comparison, category decoding is also performed with the gold-standard SVMs. Classification is executed in a oneversus-one mode using linear kernels. The parameter C has been determined in an earlier study [21]

Initialization and model constraints.
For initialization of the HMMs we reused the algorithm presented in [14]. The routine is based on k-means clustering with several small extensions that allow for adaptions to specific needs originating by the nature of the data. The most important parameter for these adaptions is the 'coupling constant' τ that introduces the time point of a feature sample as an additional dimension for clustering. High values for tau force the algorithm to strictly cluster feature values that are temporally close, while 0 t = leads to an unmodified k-means clustering 6 .
Since LFTD features are complex in their temporal structure compared to HG features different values have been used for the two feature types ( 25 LFTD t = and 0 HG t = ). Due to the limited amount of training data, introducing appropriate model constraints is critical to adapt the HMMs to the problem. We employed our former findings [14] introducing only small changes to adapt to the specific task. State transitions have been limited to one (instead of two) forward jumps to smoothen the resulting Viterbi paths. For HG features, a single-step backward jump has been introduced as it was found to compensate for the multi-stimulus issue mentioned above. This aspect is discussed in detail in section 4. Since back-jumping enables re-use of early states, it allows the reduction of the total number of states to four (down from five). For LFTD features, back-jumping turned out to be not-feasible, as it results in messed-up Viterbi paths. This is most likely due to the complex structure of the LFTD features. A detailed discussion is also provided in section 4. Both topologies are illustrated in figure 3.

Implementation.
The entire framework of this study was implemented using MATLAB R2012b from mathworks 7 . HMM functionality is provided by the open source toolbox from Kevin Murphy from the University of British Columbia [22]. SVM decoding is performed using the LibSVM library [23] for MATLAB.

Testing setup
All findings have been achieved using a five-fold cross-validation (CV) routine. Trials are assigned to specific fold sets using uniform random permutations with a balanced amount of training samples per class. Balancing is required to assure unbiased classification. In order to average out fluctuations in the decoding accuracies originating in varying allocations of trials to the folds, the CV procedure is repeated 20 times.

Results
In previous studies we found that no informative HG features could be extracted from our MEG data [21]. Therefore, MEG  1 and figure 1), the paradigm design used for ECoG acquisition leads to multiple stimuli within the time series of trials with briefer presentation durations. As a result of the complex structure of LFTD features (compare supplementary figure S1) and the limited amount of training data available, the HMMs are not capable of distinguishing the differences between true stimulus offsets and 'ghost' offsets resulting from the multi-stimulus issue. This can be seen easily when multiple stimuli are eliminated by restriction of data to a shorter time interval. In this setup, longer picture presentations (i.e. 1200 and 1500 ms trials) are not fully covered and hence, only brief durations stimuli can be analyzed. Considering this limitation, however, Viterbi paths exactly show the expected behavior (see figure 4). As a consequence of these findings, we decided to leave out LFTD results for ECoG data and focus on HG features only, which do not suffer from the above mentioned problem. This aspect will be discussed later.

Category decoding
For all four datasets, corresponding features have been extracted as described in section 2.2 and classification has been performed according to 2.3. The results for the category decoding task are shown in figures 5 and 6. Decoding accuracies for the ECoG data reach 80% and 85% for HMMs and SVMs, respectively. For MEG data results are about 20% lower for HMMs and 10%-15% lower using the SVM decoder. While the performance gap between SVM and HMM is about 12%-15% for MEG data, HMMs nearly reach the SVM results for the ECoG datasets (about 3%-6% difference). Detailed information on decoding results can be taken from the confusion matrices ( figure 6). The results show a consistent behavior across all datasets. Face trials are detected best and the majority of classification errors results from mix-ups of object-and watch-trials.
After category classification, Viterbi paths are extracted for all datasets ( figure 7). For purpose of better visualization and discussion of general behavior in figure 7, paths have been averaged over all trials separately for each of the presentation durations contained in the corresponding paradigm (i.e. 3 durations in MEG and 5 in ECoG data sets). Note that this average is not used for decoding. The paths' standard deviations as well as all single trial paths are presented in supplementary material figures S3 and S4. Apparently, the paths show a consistent behavior across subjects and data acquisition modality even on a single trial basis. It is clearly visible, that Viterbi paths are distinguishable for different presentation durations. Moreover, there is strong indication that the time points, at which the paths start to deviate from each other, directly correlate with the corresponding presentation duration. This is analyzed explicitly in the following.

Extracting duration information from Viterbi paths
The procedure to extract duration information from the Viterbi paths is described in section 2.3. As a first step, the threshold needs to be specified. To optimally fit the actual scenario, chosen values differ slightly due to the different structure of the two feature types. All values can be found in table 3. Using these values, threshold crossings of all paths can be determined. The results 8 are shown in figure 9. Afterwards, the offsets for duration decoding are estimated based on the threshold crossings of a single calibration class ('training'). To ensure unbiased results, each duration class was used once for calibration in a CV-like manner. Listed results (table 3) are averaged over all appearing duration classes (i.e. three for MEG and five for ECoG). In order to measure the accuracy of the decoded durations, root mean square errors (RMSE) with respect to the true durations have been computed. Individual decoding errors are visualized as histograms in figure 8 For means of easier comparability, predicted durations have been mapped to discrete classes via simple assignment to the nearest discrete duration values that appear in the paradigm (i.e. durations of 0-449 ms are assigned to the 300 ms class, 450-749 to the 600 ms class etc). Classification accuracies are illustrated as confusion matrices in figure 10.  Comparing figures 9 and 10, the confusion matrices can also be understood as a discretized form of the threshold crossings.

Discussion
The category decoding shows consistent behavior for all datasets. Confusion matrices reveal frequent mix-ups of watches and objects images by the classifier. This might result from the similarity of both image types, since watches are a sub-category of objects. However, faces could be consistently decoded without prominent confusions with other categories, resulting in the highest accuracy among the three classes. This behavior is in accordance with other studies [24][25][26]. There is a slight superiority of the SVM decoder for MEG data (figure 5). For ECoG data, HMM and SVM provide comparable results. This is consistent with our former findings [14] suggesting that both classifiers are at the same level when dealing with high quality features (i.e. in the upper accuracy range). The different properties of the two feature types used introduce the necessity to make adaptions in model structure and decoding strategy (e.g. Viterbi threshold). LFTD features, as they are time domain representatives, contain the ongoing brain background activity with superimposition of task-related characteristics. Therefore, time segments following the offset of a picture are not expected to result in reproducible LFTD feature sequences. This complexity introduces increased difficulty for appropriate modeling of such feature sequences with HMMs. As a consequence, only very simple HMM topologies can be used considering the limited amount of available training data. We address this issue by choosing a pure left-to-right model. In contrast, HG features show a more convenient behavior (compare figure S1). This allows us to loosen the restrictions to the model and permit a backward jump without losing smoothness of the Viterbi paths. Periods of picture presence lead to a significant increase in HG activity and this activity quickly returns to its initial low value, as soon as the picture presentation terminates. In the HMM state sequence this is expressed in backward jumps to earlier states after termination of the picture. Due to the decoding strategy of identifying only the first crossing of the state threshold, a later change to higher states (and therefore, possible threshold crossings) has no influence on decoding quality. Hence, this routine is able to deal with the 'multi stimulus issue' in the ECoG paradigm. In the absence of multiple stimuli within one trial, the backward jump is optional, thus, exactly reproducing the setup from [14]. However, it would still lead to increased category decoding accuracy since the model topology is simplified. As expected from the varying model structure, Viterbi paths of ECoG (HG features) and MEG data (LFTD features) show different behavior (see figure 7). The mean paths for the MEG datasets steadily increase up to approximately state three. From there, paths bend towards higher states with a larger slope of the curve. During that high slope part, they cross state 4, which is used for duration decoding. After a second bend, back to a very slight slope, paths ascend slowly towards the final state, which acts as some sort of dump state accounting for periods of picture absence. Conversely, paths for the ECoG datasets make an early jump to state 2 associated with the picture onset. At the end of the image presentation paths abruptly ascend to the highest state (4) where they form a more or less marked plateau. The connecting decrease to state 2 can be interpreted as a return to the starting situation, which is eventually followed by additional increases ('multi stimulus issue'). The described behavior of MEG and ECoG Viterbi paths is found not only for averaged paths but also consistently on a single trial basis ( figure S4). This reliable behavior provides a solid foundation for decoding of duration information. Applying the Viterbi path analysis, presentation durations have been extracted with RMSEs in the range of 123-279 ms. For the best subject this results in discrete accuracy of more than 82%. A more detailed view on single classification results of each of the durations is provided by confusion matrices. It becomes apparent that decoding accuracy decreases with increasing presentation duration. Additionally, there are mix-ups between the first two durations with a rate of 35%-40% in two of the datasets. These errors are most likely induced by the use of a fixed (constant) threshold for all time points. More sophisticated routines for the analysis of the Viterbi paths might solve these issues and will be the subject of future work. It is noteworthy that ECoG Viterbi paths are already promising for potential online decoding, since it is expectable that longer time segments (up to continuous feature streaming) would continue to transit between states 2 and 4 for periods of picture presentations and pauses. The offsets for duration decoding are estimated using a single calibration class. The results listed in table 4 show that for most cases (except subject MEG 1) the offset estimation provides very similar values for each calibration class. Moreover, results for the same feature type are comparable. This is an indication that the offset value might possibly be fixed across subjects, thus providing a fully unsupervised method. However, this needs to be evaluated over a larger amount of subjects.

Conclusion
In this work, we show-to our knowledge-the first use of the beneficial properties of HMMs to extract additional stimulus information in the context of BCI decoding problems. We present a largely automated routine that shows consistent results across subjects and, with minor adaptions, across acquisition modalities. Our results of duration decoding show strong correlations between Viterbi paths and picture presentation duration. For all four datasets decoding accuracies substantially above chance level were obtained. The best dataset provided classification accuracies of more than 80% for both category and duration decoding with a single classifier trained only on category information. The presented approach to decoding 'quality' (category) and 'quantity' (duration) information may be transferred to an every-day BCI task, e.g. movement direction and duration. However, for an every-day BCI use online decoding is essential. This can be addressed conveniently with HMMs requiring only slight changes to the routines presented here. More sophisticated algorithms for Viterbi path analysis as well as simulations of online scenarios will be focus of our upcoming work. The work emphasizes the benefit of dynamic classifiers compared to conventional static classifiers. In summary, it provides evidence that the properties of HMMs are efficient for such a BCI utilization.