Reconstructing Fine-Grained Cognition from Brain Activity

We describe the Sketch-and-Stitch method for bringing together a cognitive model and EEG to reconstruct the cognition of a subject. The method was tested in the context of a video game where the actions are highly interdependent and variable: simply changing whether a key was pressed or not for a 30th of a second can lead to a very diﬀerent outcome. The Sketch level identiﬁes the critical events in the game and the Stitch level ﬁlls in the detailed actions be-tween these events. The critical events tend to produce robust EEG signals and the cognitive model provides probabilities of various transitions between critical events and the distribution of intervals between these events. This information can be combined in a hidden semi-Markov model that identiﬁes the most probable sequence of critical events and when they happened. The Stitch level selects detailed actions from an extensive library of model games to produce these critical events. The decision about which sequence of actions to select from the library is made on the basis of how well they would produce weaker aspects of the EEG signal. The resulting approach can produce quite compelling replays of actual games from the EEG of a subject.


Introduction
The goal of this research is to track moment-by-moment what someone is thinking and doing over an extended period using the high temporal resolution of EEG.We will describe a method for achieving this goal that merges bottomup information from classification of the EEG signal with top-down information from cognitive modeling.A great deal of research has studied classifying EEG signals and the results have been applied to a number of domains such as braincomputer interfaces (Lotte et al., 2018), emotion recognition (Kim et al., 2013), understanding human memory (Noh et al., 2014), estimating workload (Brouwer et al., 2012), among others.With few exceptions (e.g.Su et al., 2018), this research involves tasks where the experimenter has control over the presentation of stimuli and examines activity in predefined intervals, typically locked to the presentation of these stimuli.However, in many realistic situations such as driving a car one does not have such experimental control and the sequence of events emerges as an interaction between the subject and the environment.Furthermore, the intervals between actions in such situations can be much shorter than the typical intervals used in most classification efforts.
To explore the tracking of mental state in such a context we chose video game play.There has been work on EEG and video games (e.g.Kerous et al., 2018), but typically focused on using traditional BCI methods to serve as a controller for the game.These applications typically leverage three types of EEG signals: (1) Signals sensitive to the occurrence of rare events, such as the presentation of a letter that the individual is thinking of (i.e., the P300); (2) Signals sensitive to how objects that the individual is attending to are presented (i.e., the SSVEP); and (3) Signals sensitive to planned and imagined movement (i.e., the Mu rhythm).These studies demonstrate the potential of inferring control signals from EEG, yet they do not involve highly dynamic tasks with significant perceptual-motor demands.
There has been little focus on recognizing events that occur in free-flowing to select sequences of actions from a cognitive model (Anderson et al., 2019) that can play the game like actual players.The result is a reconstruction of the game that is coherent, human-like, and typically very similar to the game of the player whose EEG signal we are working from.

Space Fortress Game
The video game we studied was a variant of Space Fortress.This game has a long history in the study of skill acquisition and training methods, first being used in the late 1980's by a wide consortium of researchers (e.g.Donchin, 1989;Frederiksen & White, 1989;Gopher et al., 1989).Part (a) of Figure 1 illustrates the critical elements of the game.Players are instructed to fly a ship between the two hexagons.They are firing missiles at a fortress in the middle, while trying to avoid being hit by shells fired by the fortress.The ship flies in a frictionless space.To navigate, the player must combine thrusts in various directions to achieve a path around the fortress.Mastering navigation in the Space Fortress environment is challenging; while subjects are overwhelmingly video game players, most have no experience in navigating in a frictionless environment.
There have been EEG studies of Space Fortress.Maclin et al. (2011) recorded EEG from subjects as they played Space Fortress while concurrently performing a secondary task that involved counting rare auditory oddball stimuli.The amplitude of the P300 to rare stimuli in the oddball detection task increased following training on Space Fortress, while the amplitude of the P300 to stimuli in Space Fortress decreased.These results indicate that with training, the primary task of playing Space Fortress became less attentionally demanding, freeing resources for the secondary task.In subsequent work, Mathewson et al. (2012) found that event-related increases in frontal theta, an oscillation associated with attentional control, predicted individual differences in learning rate.
Together, these results show the importance of attention in Space Fortress and that there is a reduction in attentional demands with practice.We used the Autoturn version of the game introduced in Anderson et al.  (2019).In this variant of the game, the ship is always aimed at the fortress and subjects do not have to turn it.The ship begins each game aimed at the fortress, at the position of the starting vector in Figure 1a, and flying at a moderate speed in the direction of the vector.To avoid having their ship destroyed, subjects must avoid hitting the inner or outer hexagons, and they must fly fast enough to prevent the fortress from aiming, firing at, and hitting the ship.When subjects are successful the ship goes around the fortress in a clockwise direction.They can destroy the fortress by shooting missiles at it to build up its vulnerability and then destroying it with a "kill shot" (two shots in rapid succession).If the fortress is destroyed it leaves the screen for 1 second before respawning.If the ship is destroyed it respawns after 1 second in the starting position flying along the starting vector.Our version of the game eliminated much of the complexity of scoring in the original game and just kept three rules: 1. Subjects gained 100 points every time they destroyed the fortress.
2. Subjects lost 100 points every time the ship was destroyed 3. To reinforce accurate firing, every fire costs 2 points.
To keep subjects from being discouraged early on, their score never went negative.The replay site (http://andersonlab.net/reconstruction/) offers examples of game play.Anderson et al. (2019) found that subjects can achieve relatively high and fairly stable performance within an hour of playing AutoTurn (much faster than in original Space Fortress where subjects are also responsible for turning their ship among other things).To maintain a constant challenge of game play, a staircase procedure decreased the separation between the inner and outer hexagons as subjects got better.Subjects played 1-minute games.During the first 10 games the inner corners were 40 pixels from the center and the outer corners were 200 pixels from the center producing a width of 160 pixels.After the tenth game, the border width was reduced by 10 pixels if the subject had 0 or 1 deaths in the prior game and it was increased by 30 pixels (to a maximum width of 160 pixels) if they had 2 or more deaths.In this way the death rate in the game was maintained at about 1 death per 1-minute game.For each 10 pixels the border is reduced, subjects get an additional 10 points for each fortress they destroy.Navigation becomes more difficult as one has to fly between narrower borders, with many deaths resulting from thrusting into the inner hexagon, a rare event with the original 160 pixel width.
The game advances at 30 ticks per second.Only two keys are pressed -a left-hand press of the W key to add thrust to the ship and a right-hand press of the space bar to fire at the fortress.Exactly when a player thrusts and fires is critical to performance.The difference of a single game tick can mean the difference between destroying the fortress and being destroyed.Critically, the impact of a key press depends on the past history of key presses as well: the consequence of a thrust depends on the ship's current position and flight path (determined by past thrusts) while the consequence of a fire depends on how preceding fires have affected the fortress's vulnerability.
Good performance involved mastering two skills -destroying the fortress and flying the ship in the frictionless environment.To destroy the fortress one must build up the vulnerability of the fortress (displayed at the bottom of the screen).When the vulnerability reaches 11, subjects can destroy the fortress by quickly firing an additional missile at it.Each fire increases the fortress's vulnerability by one, provided the fires are paced at least 250 ms apart.If the inter-fire interval is less than 250 ms the vulnerability is reset to 0 and one must begin the build up of vulnerability anew.While subjects could easily make sure the fires building up vulnerability are at least 250 ms apart by putting long pauses between them, this would reduce the number of fortresses destroyed and points gained per game.Thus, subjects are motivated to pace the fires as close to 250 ms as they can without going below that 250 ms.threshold and producing a reset.In contrast to the fires that build up the vulnerability, the fire to destroy the fortress must be less than 250 ms from the last fire.
Since the ship is always aimed at the fortress, subjects do not need to turn their ship as in the original version of Space Fortress.To navigate around the fortress, they must press the thrust key at appropriate times and for appropriate durations.The direction of the ship after a thrust is determined by a vector sum of the current flight velocity and the acceleration they add in the direction of the fortress.The acceleration is determined by how long they hold the thrust key down.Subjects' average ship speed is a little over 1 pixel per game tick (the ship starts out flying at 1 pixel per game tick).Every game tick the thrust key is held down adds .3units of speed in the current orientation of the ship (i.e., towards the fortress).As an example, suppose the ship is flying at 1.2 pixels per game tick, the angle between aim and ship direction (Thrust Angle in part a of Figure 1) is 120 degrees, and the thrust key is held down for 4 game ticks.The thrust will produce a force of 1.2 pixels in the direction the ship is aimed. 1 The resulting trajectory would still have a velocity of 1.2 pixels per game tick (more if the thrust angle was less than 120 degrees, less if it was more), and would now be in a direction that bisected the thrust angle.Thrusts at the wrong time or for the wrong duration can lead to death of the ship, which happens if the ship hits the inner or outer hexagons or if the ship flies so slowly the fortress can shoot it.

Overview of Sketch-and-Stitch Reconstruction
We developed the Sketch-and-Stitch method to infer a trace of the subjects cognition.While we apply the method here to a video game because that provides a demanding test, the underlying approach could be applied to any task.The method involves first developing a sketch of the critical mental events that happen occasionally during an extended task.To do this the method uses a combination of multivariate pattern analysis (MVPA) for identifying EEG patterns associated with the events and hidden semi-Markov models (HSMMs) for locating events in time.We have applied combinations of these MVPA and HSMMs to parsing of fMRI data (e.g.Anderson et al., 2010Anderson et al., , 2012) ) and to the processing of EEG and MEG data (e.g.Anderson et al., 2016Anderson et al., , 2018)), but nothing as time-critical as reconstructing video game play.After describing the approach taken in this paper, we will highlight its key features and innovations relative to past applications that enabled it to succeed in the task.
Having produced a hypothesis about when the critical events happened in a game, the Sketch-and-Stich method then stitches in a detailed reconstruction of the subject's cognition that led to these critical events.In this video game these detailed steps of cognition are directly associated with a detailed trace of actions providing a rigorous ground truth for judging the success of the effort.Stitching uses sequences of actions from runs of a simulation model that can produce human-like sequences of cognition.In our case, that simulation model is the ACT-R model described in Anderson et al. (2019) which produced a high-quality match to subject game play.Such a model (because it is stochastic like subjects) can be used to create a large library of candidate sequences for stitching between critical events.The Stitch-and-Sketch method selects among these candidate sequences according to how well they would produce EEG signals that match the subject.

Methods
This Methods section describes the source of all the information that goes into our application of the Sketch-and-Stitch procedure.The results section will describe and assess how well this information can be used to reconstruct game play.

Subjects
A total of 25 subjects were recruited from the CMU population of students and researchers between the ages of 18 and 40. 5 subjects were excluded because of poor performance (1 subject) and equipment problems (4 subjects), leaving 20 subjects (11 male, 9 female).All were right-handed.None reported a history of neurological impairment.Subjects were paid $75 for participation in the experiment that lasted less than 2 hours .
All participants signed informed consent and the experimental procedure and data handling was approved by the ethics committee of Carnegie Mellon University.

Game Play
After subjects studied game instructions2 , they played 60 1-minute games choosing to move on to the next game at their own pace.The games varied in length from 1800 to 1820 ticks with a mean 1819 game ticks (each game tick a 30th of a second, making most games a little longer than 1 minute).Our analysis will focus on the first 1800 game ticks, or exactly 1 minute.The game records the state of the screen (where the ship is if alive, the direction and speed of movement, whether shells or missiles are on the screen, and whether a key is depressed) at each game tick.This serves as the ground truth both for training the decoder and for testing its predictions.

EEG Analysis
The EEG was recorded from 128 Ag-AgCl sintered electrodes (10-20 system) using a Biosemi Active II System (Biosemi, Amsterdam, Netherlands).The EEG was re-referenced online to the combined common mode sense (CMS) and driven right leg (DRL) circuit.Electrodes were also placed on the right and left mastoids.Scalp recordings were algebraically re-referenced offline to the average of the right and left mastoids.The EEG and EOG signals were filtered with a bandpass filter of .1 to 70.0 Hz and were digitized at 512 Hz.The vertical EOG was recorded as the potential between electrodes placed above and below the left eye, and the horizontal EOG was recorded as the potential between electrodes placed at the external canthi.The EEG recording was decomposed into independent components using the EEGLAB FastICA algorithm (Delorme & Makeig, 2004).Components associated with eye blinks were automatically identified and manually confirmed.All but the marked ICAs were projected back to the EEG signal3 .
The EEG signal was recorded continuously for the entire experimental session and broken into 1-minute games.There was also a complete record of what happened in each game.Portions of the game periods were identified as bad signals were excluded.Individual channels within an epoch were flagged based on having extreme values for mean absolute deviation, drift, or range.Flagged channels were interpolated.Epochs that still contained channels with extreme values after these steps were flagged and rejected.This resulted in loss of the signal for an average of 1.7 seconds per game for games used in the decoding (52.5% of the games had no lost signal; the worst game had 21.8 seconds of lost signal ).This reflects a realistic complication in decoding where useful signal can be lost for some fraction of time.
After the processing above to produce the 512 Hz data, the EEG was down-sampled to 30 Hz to match the game ticks.This had the added benefit of making the data of a more manageable size.A one-second window around each game tick (14 game ticks before, the game tick, and 15 game ticks after) was used to classify whether a game tick contained a critical event.This means that each game tick had associated with it a vector of 30*128=3840 electrode readings, representing regional effects, frequency effects (below 30 Hz), and their interactions.Because the vector associated with a game tick requires a complete signal for 1 second, game ticks at the beginning and end of a game do not have corresponding vectors nor do game ticks in or near lost signal.Thus, 29 ticks at the beginning and end of the game have no vectors as well as an average of 71.6 ticks in the vicinity of lost signals, leaving an average of 1,718.4vectors per game.The available vectors for each game were z-scored to standardize them across games.To reduce dimensionality and filter out noise, the vectors for all games and subjects were subjected to a PCA analysis and the 1000 top dimensions were kept4 .Thus, we had an average of 1718.4 1000-element vectors per game.These are what were used for all classification analyses.

Classification
All reconstructions efforts focused on the last 55 games where performance is relatively stable.We excluded an additional 20 games of the remaining 1100 games because of particularly bad EEG signal or low activity by the subjects (1 game without good signal throughout, 8 further games where subjects failed to destroy a fortress without resetting or being killed, and 11 games with 12 or fewer critical events).This left 1080 games, which will serve as the focus of analyses.
We performed 4 classification analyses using the same 1000-element vectors produced by the PCA for game ticks in the 1080 games -to identify key presses, to identify critical events, to identify ship position, and to identify vulnerability changes.In each case we used the same leave-one-game-out approach: For a given target game of one subject, the training was done with all remaining games for that subject and all games for all other subjects.A linear discriminant classifier was trained to label the vectors of EEG activity with the category associated with the game tick that the vector describes.To reflect the fact the sensor activity of the subject may be most relevant, that subject's other games are weighted 15 times more than the games of other subjects.This was repeated for each game to get results for all 1080 games.We have neither explored different weightings of that subject to other subjects nor different classification methods.Thus, while the classification results are quite good, they probably are not the best possible, but they are good enough to enable the Sketch-and-Stitch method to achieve fairly high performance.

Model
We used the same ACT-R model as described in Anderson et al. (2019).
To summarize the model: it starts with a declarative representation of the instructions about when to do what.This produces slow performance initially, but over time the model builds action rules that directly perform the actions in the appropriate situations (bypassing the need for declarative retrievals).
Critical to its performance in Autoturn are learning when to thrust and when to fire.A Controller module has been implemented within ACT-R that explores a range of values for when to fire and when to thrust and converges on appropriate settings, which it comes to exploit.The creation of action rules and the learning of control values for action underlie the improvement with practice in the model.
The behavior of the model is similar to subjects because it uses established ACT-R settings (on the basis of prior experiments) for the timing and variability of mental steps and motor execution.While this model was developed only for the 160-wide pixel border separation, it generalizes to the narrower borders in this experiment because the model monitors for closeness to the borders.
We simulated 100 subjects by running the model 100 times for 60 games under the same game conditions as humans: As the model got better, the borders narrowed.If the model suffered more than one death in a game, the borders expanded.In addition, to collect enough games at each width to have a library for reconstruction, for each possible width 50 model runs of 60 games were executed at a fixed width.In all runs the model was learning and got better with later games.Since the first 5 games of subjects were excluded in the reconstruction efforts, we similarly excluded the first 5 games from each of these runs, yielding a library of 50*55=2750 games at each border width.There are 13 possible widths from 40 to 160 pixels, making for a library of 2750*13=35,750 model games to serve as a basis for reconstructing the 1080 subject games.Subjects vary in how tight a space they manage to fly in, but 13 of the subjects manage to reach a width of 70 pixels at some point and all but 1 reach 90 pixels (the other reaching 110 pixels).The 100 simulated subjects show a similar range with 43 reaching 70 pixels, and all but 9 reaching 90 pixels.The best subject reached 40 pixels while 3 of the 100 simulated subjects reached 40 pixels.Figure 3 shows how performance varies as a function of width (omitting the first 5 games where the rapid changes were taking place).Subjects earn somewhat more points with greater widths (Figure 3a).There is relatively little effect number of fortresses destroyed with width (Figure 3b) but a large effect on number of deaths (Figure 3c).Speed is somewhat greater with wider borders (Figure 3d).The model also has these trends.As Figures 2 and 3 show and as is elaborated at length in Anderson et al. (2019), this model does a good job at capturing many aspects of human game play.The rest of this Results section has 4 further subsections (second through fifth subsections) concerned with using this model and the EEG data to reconstruct game play: The second subsection describes a classifier for identifying key presses from the EEG signal and assesses whether this classifier's output can be used to reconstruct game play.The third subsection describes how a classifier for critical events can be combined with the model to create a sketch of the critical events in the game.The fourth subsection describes how two other classifiers can be combined with the model to stitch in the key presses between the critical events.The fifth subsection will evaluate how well this

Response-Based Classification
A direct model-free approach to reconstructing game play would be to try to recognize when the left (thrust) and right (fire) fingers are pressed.A fair amount of research has shown good discrimination between imagined left-and right-hand movements for application in BCI (Lotte et al., 2018) and classification of actual key presses by hand is also good (Krauledat et al., 2004).If we could correctly identify when the fingers are pressed we would be able to recreate the game play.Two features of the current task make it more challenging than the typical left-right discrimination.First, it requires not just a binary discrimination but rather a 4-way discrimination because the game includes ticks when neither finger is pressed (the majority: 68.2%) and when both fingers are pressed (rare: 0.2%).Second, the discrimination among these four categories must be made separately for every game tick (30th of a second).
Table 1 shows shows the classification performance on game ticks that have associated EEG vectors.Part a of the table shows the results when each game tick is assigned the label that makes the EEG vector most likely.Clearly, considerable discrimination can be achieved as indicated by the large values on the main diagonal.The overall accuracy is 47% and the average pair-wise area under the curve (AUC) is .78.A d-prime (Wickens, 2002) measure of discriminability is .906 .Only 25 of the 1080 games have negative d-primes.Excluding ticks with no or both keys pressed, a binary discrimination of ticks with fires from ticks with thrusts is better as would be expected (d-prime 1.54; 77.8% accuracy, .861 AUC).Part b of Table 1 shows the 4-way classification results using the posterior probability that weights the likelihood by the prior probability of the category.The d-prime measure in this case is 1.03.
The categorization uses a combination of spatial and frequency information over a 1-second interval surrounding a game tick and so the full patterns are complex.Figure 4 shows the average temporal pattern for three central elec-trodes and the full electrode patterns 5 ticks before and 10 ticks after the event7 .
At this level of aggregation the most prominent feature is that left frontal activation tends to drop after a fire and rise after a thrust.The difference between fire and thrust at the central electrode FZ 10 ticks after a thrust is highly significant (t(19) = 6.04, p < .0001).There is also greater right lateralized central negativity prior to a thrust and slightly greater left lateralized negativity prior to a fire, which would be the expected lateralized readiness potential opposite the hand pressing the key (e.g.Smulders et al., 2012).For instance, the difference between fire and thrust at the right central electrode F4 5 ticks before the thrust is highly significant (t(19) = 6.30, p < .0001).Notwithstanding these clear patterns that appear in average data, on single trials there is no electrode comparison on any game tick that provides even as much as 53% correct binary classification.The higher accuracy achieved by the classifier depends on more complex patterns.
How well could one use these key classifications to reconstruct the games?
As a test we took the posterior most probable class for each game tick8 and ran the resulting action sequence through the game.The average score was 13 points per game in contrast to 1218 points achieved by the subjects who generated the EEG signal.Could one do better with a better classifier?To simulate a classifier that was "10 times" more accurate we generated a sequence of key activity that chose the true key presses for each game tick with probability .9 and otherwise the keys of the current classifier.The activity generated in this manner (which matches the actual press pattern 97.2% of the game ticks) averages 92 points a game.We created a classifier that was "100 times" better by using the true key presses 99% of the time.The sequence of key activity generated in this manner (which matches the actual press pattern 99.7% of the game ticks unrealistically good) still fell short of human performance with 818 points per game.Although only about 5 or 6 game ticks (out of 1800 ticks in a 1-minute game) were being missed, these mistakes could include fires that created resets and threw the vulnerability count off.They could also include extra or lost thrusts that changed the direction of the ship.Once the vulnerability or direction was off, later actions did not have the same effect in the reconstructed game as in the original game.In conclusion, even with improvement in accuracy greater than seems possible with improved classification, it does not seem possible to reconstruct game play from direct action classification.

Creating the Critical Sketch
Higher classification performance can be achieved if rather than trying to classify every action one just tries to classify the critical events that happen during the game.Five critical events could occur in the course of game play: 1. Kills.The fortress is destroyed and the player gains 100 points.
2. Fortress Respawns.The fortress is respawned after a second and the player can resume firing.
3. Deaths.The player's ship is destroyed and the player looses 100 points.
4. Ship Respawns.The ship is respawned a second after death and the player can resume thrusting and firing.

5.
Resets.If the interval between fires is less than 250 ms and the vulnerability is less than 11, the fortress vulnerability will be set back to zero and the subject must begin building vulnerability anew9 .
The 1080 1-minute games in the pool averaged 9.38 kills, 0.86 deaths, and 1.15 resets.The numbers of critical events with EEG vectors are 9,742 kills, 9,731 fortress respawns, 819 deaths, 805 ship respawns, and 1,159 vulnerability resets.
In total, these are far less numerous than the thrust key and fire key events (row sums in Table 1).Figure 5 shows the EEG activity around the critical events, which produce much stronger signal changes than the key presses in Figure 4. Part a shows the 1-second around a kill.There is a posterior positivity that reaches a maximum 100 ms before the kill, then a general negativity 100 ms after the kill, and then an anterior positivity about 300 ms after the kill.Part b, which is a continuation of Part a shows the activity around the respawn of the fortress.There is return to an anterior positivity about 300 ms after the fortress reappearance.
Parts c and d show the activity around a death and the respawn of the ship.
There is negativity in anticipation of the death, which switches to strong central positivity peaking 400 ms after the death.The positivity remains but becomes left lateralized after the ship respawns.Part e shows the response to a reset where there is a strong central negativity 300 ms after the reset followed by an even stronger central positivity 500 ms after the reset.One feature common to kills, deaths, and resets (Parts a, c, and e) is a post-event positivity, although that positivity varies somewhat in its timing and distribution across the scalp.
The magnitude of this positivity varies with the rareness of the event, with the most common kills showing the smallest response and the least frequent deaths showing the largest response, as we would expect from a P300 (Polich, 2012).
The delay in the positivity for resets relative to the other two events may reflect the fact that resets are not anticipated until they occur.
The same classification method described earlier was used to distinguish these 5 events from other game ticks as used for response classification.Given that critical events are rare, to prevent the classifier from being overwhelmed by non-events we limited the number of non-events in training the classifier by randomly choosing two non-critical game ticks for each critical game tick in the game.Once trained on other games the classifier was applied to all game ticks in the target game.Part a of Table 2 shows classification of all game vectors by likelihood of the data (d-prime = 2.00) and Table 2b shows classification by posterior probability (d-prime = 2.18).The average accuracy in Table 2a is 59.6% and in Table 2b, which is thresholded to favor the majority null category, the average accuracy is 98.5% .The average pairwise AUC, which does not    While this is considerably better than classification of key presses (Table 1), these classification results by themselves would not produce particularly good game reconstructions.In Part a of Table 2, 97.5% of all labels are false alarms to game-ticks that do not involve the ascribed event.In Part b of Table 2 the false labels are reduced to 70% , but now 84.4% of all critical events are missed.
In addition to these problems, identification of critical events alone does not provide the detailed behavior of the subject that produced them.
While the classifier is not adequate in itself, it can be combined with statistical information from model runs to reconstruct fairly accurate critical sketches from the EEG signal as illustrated in Figure 6.From the large library of model games it is possible to obtain reliable estimates of the probabilities of one event following another and how far apart these events are.These transition probabilities and event distributions can be used to parameterize a HSMM. Figure 7 shows the distributions in the model between other events and the probability that one event will follow another.These distributions and probabilities also vary with the width between the borders (see Figure 3), with the most dramatic effect being on the probability of a death..
We used an HSMM to efficiently combine these model-based statistics and the conditional probabilities from the EEG classifier to estimate the most likely sequence of critical events in a game.Any sequence of events can be denoted a 1 , a 2 , ..., a n occurring at game ticks t 1 , t 2 , ..., t n where a 1 is the start of the game (hence t 1 is the first game tick), a n is the end (hence t n is the last game tick), and the rest are fortress kills and respawns, ship deaths and respawns, and resets.The following proportionality describes the probability of any such sequence relative to the probability of other sequences: where trans(a i , a i+1 ) is the probability of transition between the events a i and a i+1 , f (t i+1 −t i |a i , a i+1 ) is the probability of the t i+1 −t i game ticks between the events a i and a i+1 , P (EEG(t i + 1, t i+1 )|a i+1 ) is the conditional probability of the EEG signal for this period if it ends in a i+1 .The conditional probabilities come from the classifier.Their use can be made much more efficient and robust by using the fact that, if the signals at different ticks were independent: The product involving the conditional probabilities P (EEG(x)|N ull) will be the same for all sequences of critical events and can be ignored in determining which sequence has the highest proportionality.The only thing that matters is the conditional probability of the EEG signal on game tick t i+1 if critical event a i+1 happened relative to the probability of the signal if there were no critical event.Therefore we can rewrite Proportionality 1 as The probability of any particular sequence will only involve about 20 critical events and thus only 20 ratios of conditional probabilities of the EEG on particular game ticks.These ratios are many game ticks apart (on average about 100 game ticks or about 3 seconds).While the ratios for adjacent game ticks are highly correlated due the temporal correlation of the EEG signal, the ratios at these distances are not.Thus, the independence assumption underlying a HSMM will be approximately correct.The HSMM-MVPA applies to all game ticks, including the 4.7% that do not have a corresponding signal vector.For these game ticks the ratios were 1 for all critical events.Thus, for these stretches of time without signal, the only information about possible events comes from the transition probabilities and distribution of delays between events.
We used the Viterbi algorithm (Rabiner, 1989) for hidden semi-Markov models to find the assignment of events that maximized P rob(a 1 , a 2 , ..., a n ).This produced for each game a set of inferred events and the game ticks at which they occurred.For each game the match between the assigned events and the actual events was calculated as a sum of a recall and a precision measure (Buckland & Gey, 1994) calculated from locations of kills, deaths, and resets (since the respawns of the fortress and ship were tied to the kills and deaths).The measure of recall focused on the events that occurred in the game and identified the closest predicted event.If that event type matched and was within 2.5 seconds it was scored according to how many game ticks it was away -thus the maximum score was 75.If the closest event was further away or failed to match it was also scored 75.The average of these recall scores for a game can vary from 0 (perfect match of all events) to 75 (worst possible).The measure of precision applied the same scoring procedure but now started with all predicted events and found the closest actual event.These recall and precision measures were identical for 379 of the 1080 games, but were somewhat different for the rest because of differences between the actual game versus the reconstruction in number critical of events or their timing.Even when they are different they do tend to be correlated (r = .624).The average measure of recall was 14.1 and the average measure of precision was 11.8 for a total average match rating of 25.9 out of 150.
Figure 8 sshows the distribution of the recall and precision scores.Summing recall and precision the average match rating is 25.9.To provide a chance t measure one can use the average rating between actual events and predicted events of other games, which is 95.2.847 games are best matched by the game constructed from their EEG signal rather than any other reconstructed game.
9 games have all critical events predicted perfectly to the game tick.The mean rank of the reconstruction of a game out of the 1080 reconstructions is 7.7.
The range of possible rankings is 1 to 1080 -if reconstructions were randomly assigned to games the expected ranking (chance) would be 540.5. Figure 9 shows a pair of games to illustrate the range of prediction.Part a is the 276th best-matched game with a rating of 10.6.All events but the last kill are closely predicted -the model does not complete its last kill by game's end.
Part b is 787th best-matched game with a score of 36.4.8 of the actual events (6 kills and 2 deaths) are identified relatively accurately.However, one death, one reset, and 2 kills are missed while one kill and one death are predicted that did not occur.
Table 2 reported how well the classifier labeled game ticks.We constructed a similar matrix from the results of the Viterbi algorithm looking at how well it classifies each game tick.Table 3 presents those results, which have a d-prime of 2.75, superior to both matrices in Table 2 10 .All 20 subjects are classified better using the Viterbi algorithm than just the classifier (either method in Table 2).Moreover, the Viterbi algorithm results include classification of game ticks without a signal vector (the 1000 PCA values which are not available when there is lost EEG).Also, Figure 9 shows that even if the Viterbi algorithm does not identify the exact game tick it often identifies a nearby game tick (e.g. the cases in the figure where the predicted and observed kills are slightly offset).
Most important and not reflected in Table 3, the resulting positioning of events  render a coherent interpretation of game play, which is to say that one can create sequences of key presses that would produce these critical events.While this is a better result, it leaves us short of being able to reconstruct the detailed behavior of the subject.For that we need stitching, which we will describe next.

Stitching in Detailed Game Play
Stitching involves finding sequences of key presses (fires and thrusts) in simulated games that would produce the same timing of critical events as in reconstructed sketches like those in Figure 9.It is unlikely to find a complete simulated game whose critical sketch perfectly matched a reconstructed sketch.Even though there were 2750 simulated games for each border width, no simulated game had a critical sketch matched the critical sketch of any other simulated game or actual subject game (which makes the fact that 9 of the Viterbi-reconstructed critical sketches perfectly matched the actual games quite remarkable).Rather than looking for complete games that matched, the stitching procedure just searched for segments of simulated games that could reproduce segments of the reconstructed sketch.Given that the effects of firing and thrusting are nearly independent11 , the procedure selected thrust patterns and fire patterns from the simulated games independently.
Thrusts determine the flight path of the ship.The overall path in a game can be divided into a number of segments that begin with the ship flying in a starting configuration and ending either in a death or terminating with the end of the game.Figure 10 illustrates how thrusts are chosen from the model games for a segment in a critical sketch.For critical sketches of games with no deaths, there is no shortage of simulated games without deaths for that border because it controls the distance of the ship, which determines when a shot hits the fortress, potentially turning a vulnerability reset into an increment or vice versa.
width that can serve as candidate segments (which will all span the full game).
Beginning segments of these games also provide thrust sequences for segments in critical sketches that go from respawning of the ship after a death to the end of the game.The challenge is to find segments from the model that match segments in critical sketches that go from the appearance of the ship to a death.
There are nearly 1800 possible numbers of game ticks that can pass until death (the case illustrated in Figure 10).Even if they were equally likely (which they are not), ten times more simulations than the current 35,750 simulated games in the library would be required to come close to guaranteeing model runs with a death at each game tick for each width.As it was, model ship deaths did not occur at 32% of possible game ticks in the inferred critical sketches.We relaxed the criterion and selected any deaths within 5 ticks of the desired duration.
This reduced the number of missing cases to 2%.42 cases were still not covered and for these we expanded the difference until there was a case.With these relaxations, there were on average 21 candidate segments ending in death for each death in a critical sketch.Unlike the case with the earlier action classifier, these segments involved plausible sequences of thrusts that produced a plausible flight pattern even if it ended in a death.The EEG signal was used to select among the candidate model thrust segments, exploiting information that the signal provides about where the ship is on the screen.Part a of Figure 11 shows the scalp activity when the ship is in different 30-degree sectors.We trained a classifier to recognize which sector the ship was in plus when the ship was off the screen due to a death.We We used this classifier for ship position to select among different sequences of game thrusts produced by the model.We fed the candidate sequences of thrusts in these segments into the game engine to produce flight paths and chose the sequence whose flight path had the greatest summed log probability game ticks: 1-11: Game ticks with fires that raised the vulnerability to 1 through 11.In the averaged Figure 10a these would be the game ticks associated with the first 11 vertical lines.
12: Game ticks with fires that raised the vulnerability to more than 11but were too slow to destroy the fortress (not shown in part a of Figure 12) 13: Game ticks with fast fires that killed the fortress after reaching a vulnerability of at least 11 (the last vertical line numbered 12 in part a of Figure 12) 14: Game ticks with fast fires that reset the vulnerability to 0 before reaching a vulnerability of at least 11 (not shown in part a of Figure 12 -typically the fire will be about 150 ms before the reset which is time 0 in part e of Figure 5).Classifiers).This reflects the importance of aligning the thrusts and fires with the identification of the critical events.If the reconstructed deaths are aligned with the actual deaths, the ships will be set back to the starting position at the same time.If the reconstructed kills are aligned with the actual kills, the vulnerability buildups will start at the same points.

Summary Evaluation of Reconstruction Accuracy
The combination of EEG classification and the model can do well at reconstructing the critical events (Figures 8 and 9) and capturing where the ship is in space (Figure 11) and what the fortress vulnerability is (Figure 12).To provide a summary measure of the overall correspondence between actual games and their reconstructions we used an equally weighted sum of three factors: 1.The z-score of the rating of the critical events (Figures 8 and 9).As noted earlier, the reconstruction scores well on this measure with 818 of the 1080 games best predicted by their reconstruction and a mean ranking of the reconstruction is 8.412 .
2. The z-score of the angular disparity between the position of reconstructed ship and the actual ship averaged across the game ticks.While much better than chance this detail is less well reconstructed, with only 106 of the 1080 games best predicted by their reconstruction and a mean rank of 176.2 out of 1080.The focus in this paper has been reproduction of the details of game play.
A less detailed reproduction may be adequate for many purposes.For instance, one might just want to assess how well the person is playing the game.One can used these detailed reconstructions for such summary evaluations.Figure 13 shows the correspondence between the final scores of reconstructed and actual games, for which the correlation is .848.For comparison we investigated the correlation with the games produced with the Random Stitching and Key Classifier (used in Figures 11 and 12).The correlations were .853with the output of Random Stitching and .019with the output of the Key Classifier.The fact that there is little difference between informed and random stitching indicates that the correlation in Figure 13 really depends on getting the kills, deaths, and reset right and not on the details of the actions in between.Given the poor performance of the Key Classifier in playing the game, its low correlation with participant score is not surprising.Earlier we described hypothetical Key Classifiers that were 10 and 100 times more accurate than our classifier.Their correlations were .243and .687,still less that the product of stitching a sequence of action to match the critical sketch.

Discussion
The success of the sketch-and-stich approach depends critically on both having the EEG signal from the subject and a good model of human behavior.The variability in game play is enormous and no two games are identical, eliminating any chance that a model by itself could predict what subjects are going to do.
Even though we had a library of over 35,000 games none matched a subject game nor did any two subject games match.On the other hand, while the EEG signal allows for much better than chance classification of the different aspects of game play, that information by itself is also not good enough to enable successful reconstruction.
The Sketch-and-Stich method uses information from classification and the model at both the Sketch and the Stitch level.The Sketch level takes advantage of both the strong signals associated with critical events and statistical information about the order and timing of those events from the model.The Stitch level required the model to provide sequences of actions of the appropriate length to stitch in and selected among these according to their EEG signatures.Even if the procedure failed to choose the right detail for a period between two critical events, the reconstructed game stayed on a path that was consistent with the sketch.Thus, even after a period of significant mismatch between reconstruction and sbuject, the two would often come back to close correspondence.
At an abstract level, the approach in this paper bears similarities to BCI efforts for merging statistical regularities in language with EEG signals (Speier et al., 2016).These regularities can be leveraged to tailor the presentation of letters on an external device based on their base rates and posterior probabilities given the currently typed word.Likewise, these regularities can be used to support inference of intended spelling using EEG signals (Mora-Cortes et al., 2014).Our cognitive model effectively provides a "grammar" of gameplay and action.More generally, using such cognitive models could enable successful application of EEG decoding to a vastly wider range of tasks.
While we have focused on reproducing the subjects' actions because that is where we have the ground truth of the game record, the models also make commitments about the cognitive processes giving rise to these actions.For instance, as described in Anderson et al. (2019) decisions about how to pace fires is made by an evolving sense of the appropriate pacing.One could infer the threshold a subject is currently using to time fires by considering the threshold in the model segment that has been stitched in.Similarly, one could infer the subject's threshold for thrusting from the threshold in the inferred model segment.If there were alternative strategies for game play, one could infer the subject's strategy by which strategy provided the matching segments for stitching.One could use these inferences to provide feedback on what the subject needed to do to improve their performance.
The Space Fortress game has provided a good testbed for judging reconstruction from EEG.The domain is one where small differences in the timing of events can greatly affect the course of subsequent events.There is a strong sequential dependence where the earlier actions can change the consequences of later ones.The approach described in this paper handles these challenges well, and is capable of creating compelling reconstructions of game play.The approach combines training of classifiers to detect events and using a computational model to provide the statistical information about the distribution of events and the details about how these events where achieved.While reconstruction performance is good, it might be improved by better approaches to classification or tuning cognitive models to individual player's style.However, the current results do establish the potential of combining classification and cognitive models to reconstruct mental activity.
The complete game record of Space Fortress provides a reliable ground truth for training classifiers and for judging the success of reconstructions.In other work (e.g.Anderson et al., 2016) we have had similar success in parsing EEG signals using the HSMM-MVPA method without such training events.
Rather, HSMM-MVPA discovered critical events in the absence of labeled training events.However, in the earlier work the intervals were much briefer and the EEG signal was constrained to be a simple ERP-like bump.Although those conditions accurately represent the majority of ERP studies with brief trials, this unsupervised method did not scale to the current situation with much longer intervals.Error in identifying events without supervision increases as one moves from the beginning or end of an interval such that replicable ERP-like bumps are no longer found.The current approach is appropriate for longer intervals provided that one has access to ground truth from some data to train the classifiers for other data.

Figure 1 :
Figure 1: (a) The Space Fortress screen, showing the inner and outer hexagon, a missile fired at the fortress, and a shell fired at the ship.The distance from the center (fortress) to the corners of the outer hexagon is 200 pixels and the distance to the corners of the inner hexagon is 40 pixels.The ship starts 120 pixels to the left of the center, flying at 30 pixels per second, parallel to the upper left side of the hexagon.The dotted lines illustrate an example path during one game.(b) A schematic representation of critical values for firing and flight control..

Figure 2
Figure2shows how various measures changed over the course of the 60 games for the experiment participants and the 100 simulated subjects.Part (a) tracks the width of the space between the two hexagons.This is held constant at 160 pixels by the experiment for the first 10 games, after which the staircase process sets in.The width then decreases until it is bouncing around an average of about 100 pixels.Points and kills (Parts b and c) increase rapidly over the first 10 or so games and then increase more gradually.Ship deaths drop rapidly over the first 10 games before rising to about 1 death per game, which is the goal of the staircase procedure (Part d).Unlike human subjects, the model flies fairly safely from the beginning .Once the staircase procedure sets in both humans and the models show the expected rate of about 1 death per game 5 .

Figure 2 :
Figure 2: Mean values (line) and standard errors (area around lines) per game for subjects and models as a function of game (a) Border width; (b) points before bonuses for kills at narrow borders; (c) number of fortress destructions; (d) number of deaths..

Figure 3 :
Figure 3: Mean values (line) and standard errors (area around lines) per game for subjects and models as a function of border width (a) points before bonuses for kills at narrow borders; (b) number of fortress destructions; (c) number of deaths; (d) speed of ship.Because of the few cases of 40 pixels (1 game for subjects, 4 for models), these data are averaged in with 50 pixels (16 games for subjects, 20 games for models).

Figure 4 :
Figure 4: (a) EEG activity around a tick with the fire key (right hand) depressed.(b) EEG activity around a tick with the thrust key (left hand) depressed.There are 30 game ticks per second.Shaded areas represent a standard error of the mean calculated from the standard deviation of the subject means.Scalp profiles are for -133 ms and +333 ms (-4 and +10 game ticks).

Figure 5 :
Figure 5: (a, b) EEG activity around the destruction of the fortress and its respawn with the scalp profiles ranging from -4 to 4 microvolts.(c, d) EEG activity around the death of the ship and its respawn with the scalp profiles ranging from -10 to 10 microvolts.(e) EEG activity around a reset of the vulnerability with the scalp profiles ranging from -6 to 6 microvolts.Shaded areas represent a standard error of the mean calculated from the standard deviation of the subject means.

Figure 6 :
Figure 6: Information from model and subject combined to create the critical sketch.On the left, runs of the model generate large numbers of game records from which critical sketches are produced.The statistics from these critical events are used to parameterize a semi-Markov model.On right, signal vectors derived from an actual games EEG are passed through a classifier trained on other games to determine the likelihood of critical events in a game.These probabilities are passed through the semi-Markov model to produce a critical sketch.

Figure 7 :
Figure 7: Probabilities of transitions (given as "p=") between events and distribution of intervals between events in the simulations.The statistics displayed are averaged over border widths from 40 to 160 pixels (the actual sketching procedure uses statistics specific to a width).

Figure 8 :
Figure 8: Distributions of recall and precision ratings on a scale where 0 is perfect match between game and reconstructed critical events and 75 is the maximum possible mismatch score.

Figure 9 :
Figure 9: Reconstructions of critical events for two games.

Figure 10 :
Figure 10: IIllustration of how a thrust sequence is selected for a predicted sequence of critical events that goes from a ship spawn to a death or game end.On the left, runs of the model generate large numbers of flight segments.On right, a classifier trained on other games assigns probabilities to flight positions in actual subject games.Thrust sequences from the model are chosen for segments of the critical sketches that are associated with the most probable positions.

Figure 11 :
Figure 11: Scalp profiles when ship is in various 30-degree sectors of the hexagon space.(b)Disparity between assigned angle and actual angle using different criteria for reconstructing ship position.
then assigned each game tick to the category that maximized the probability of the EEG signal associated with that tick.This resulted in a d-prime for discrimination among the 13 categories of 0.57, an accuracy in choosing among the 13 categories of 17.2%, and an average pairwise AUC of .738.Part a of Figure11appears to show greater left-right discrimination than top-down, although the classification uses richer 1-second patterns than these single snapshots.A binary discrimination between the two left sectors from the two right sectors shows a d-prime of 1.33 (74.7% accuracy; AUC of .825)while a binary discrimination between two top sectors from the two bottom sectors shows a d-prime of 0.87 (65.3% accuracy; AUC of .75).

Figure 12 :
Figure 12: (a) EEG warped to the period of an average kill (4.12 seconds -only games with exactly 12 fires included).The black lines show the fires that increase the vulnerability to 11 with the last fire being the fire (12) that destroys the fortress.They are numbered with the vulnerability they produce.The EEG profiles between vulnerability changes have been warped to the average vulnerability intervals.The scalp profiles are for .4seconds, 2.25 seconds, and 4.0 seconds.(b) Distance from an assigned fire to a fire with a matching vulnerability in the actual game.Negative values mean that the fire came before the matching actual fire and positive values mean it came after.

15:
Game ticks with both the ship and fortress present without a fire (almost 75% of all ticks are in this category).16: Game ticks with the fortress absent because of a kill.(ticks covering 0 While Informed Stitching, which uses both the game library and the classifier, does best of the 4 alternatives at both at identifying ship position (Part b of Figure 11) and vulnerability (Part b of Figure 12), it is more important to have a coherent segment of the right length (Random Stitching) than it is the use the results of the stitching classifiers alone (Position and Vulnerability

Figure 13 :
Figure 13: Relationship between the score of the 1080 actual games and their reconstructions.
3. The z-score of the difference between the vulnerability of the reconstructed ship and the vulnerability of the actual ship averaged across the game ticks.803 of the games are best matched by their reconstruction and the mean rank is 14.2.While we weighted these three equally in coming up with a combined measure, they seem ordered as above in how compelling the reconstruction is as match to the original game.By this combined measure 846 of the games are best predicted by their reconstruction and the mean ranking of the reconstruction is 6.1.The highest ranked seem quite compelling as reconstructions of the original games.Only one was worse as a reconstruction of the game than the majority of the reconstructions of other games.

Table 1 :
Classification of Game Ticks according to Key Press Activity sketch-and-stitch combination does at reconstructing game play.

Table 2 :
Classification of Game Ticks according to Critical Events (F refers to Fortress and S refers to ship.)

Table 3 :
HSMM-MVPA Labeling of Critical Events (F refers to Fortress and S refers to ship)