How prior knowledge affects problem-solving performance in a medical simulation game_ Using game-logs and eye-tracking

Computer-based simulation games provide an environment to train complex problem-solving skills. Yet, it is largely unknown how the in-game performance of learners varies with different levels of prior knowledge. Based on theories of complex-skill acquisition (e.g., 4C/ID), we derive four performance aspects that prior knowledge may affect: (1) systematicity in approach, (2) accuracy in visual attention and motor reactions, (3) speed in performance, and (4) cognitive load. This study aims to empirically test whether prior knowledge affects these four aspects of performance in a medical simulation game for resuscitation skills training. Participants were 24 medical professionals (experts, with high prior knowledge) and 22 medical students (novices, with low prior knowledge). After pre-training, they all played one scenario, during which game-logs and eye-movements were collected. A cognitive-load questionnaire ensued. During game play, experts demonstrated a more systematic approach, higher accuracy in visual selection and motor reaction, and a higher performance speed than novices. Their reported levels of cognitive load were lower. These results indicate that prior knowledge has a substantial impact on performance in simulation games, opening up the possibility of using our measures for performance


Introduction
Computer-based simulation games (CBSG) are effective learning environments for complex skills. As simulations, they approximately replicate the complexity of real-life situations (Koivisto, Niemi, Multisilta, & Eriksson, 2017). As computer games, they provide a package of problems that are causally connected, based on learners' interaction with the game (Kiili, 2005). In this simulated problem-solving environment, learners can train specific professional skills in areas such as aviation, business management, and medicine (Dankbaar et al., 2016;De Freitas, 2006;Hernández-Lara, Perera-Lluna, & Serradell-López, 2019). However, CBSGs face a challenge in that the performance of a learner in the game is difficult to assess via traditional measurements such as achievement tests (Kang, Liu, & Qu, 2017). This challenge is mainly due to the open-ended nature of CBSGs (Squire, 2008), which allows for a large number of different behaviors. Therefore, recent research has focused on tracking users' in-game behaviors by looking at game data such as serious game analytics (Kang et al., 2017;Loh, Sheng, & Ifenthaler, 2015;Wallner & Kriglstein, 2013). These studies identified several limitations: Data analysis without involving educational theoretical principles often fails to fully account for students' performance (Kang et al., 2017), game-logs without translation into high-level meaningful actions can yield confounding information (Zhou, Xu, Nesbit, & Winne, 2010), some important factors such as timing cannot be explained by analyzing sequences of events only (Clark, Martinez-Garza, Biswas, Luecht, & Sengupta, 2012), and empirical studies about how game data can be informative for performance assessment are scarce (Hou, 2015;Kang et al., 2017).
We believe that theories of complex-skill acquisition might help to develop performance assessments in open-ended game environments. We view the playing of a CBSG as a problem-solving process in which domain-specific prior knowledge (DSPK) has an essential role. DSPK comprises knowledge structures in long-term memory, also known as cognitive schemas (Bartlett, 1995). Without these schemas, learners depend on domain-general problem-solving strategies which are inefficient and time-consuming and, most importantly, hamper the schema construction processes (Van Merriënboer, 2013). This means that playing a CBSG without sufficient DSPK might lead to suboptimal learning. The goal of this study is to empirically examine the effect of DSPK on game performance by comparing learners with two distinct https://doi.org/10.1016/j.chb.2019.05.035 Received 13 February 2019; Received in revised form 30 April 2019; Accepted 29 May 2019 levels of DSPK: learners with high DSPK (i.e., experts) and learners with low DSPK (i.e., novices).
Expert-novice differences in the use of problem-solving strategies have been investigated in multiple studies, suggesting various indicators of these differences (Donovan & Litchfield, 2013;Manning, Ethell, Donovan, & Crawford, 2006;McLaughlin, Bond, Hughes, McConnell, & McFadden, 2017). However, these indicators are highly conditional because problem-solving strategies greatly vary depending on domains and task environments (Ericsson, Hoffman, Kozbelt, & Williams, 2018;Liversedge, Gilchrist, & Everling, 2011, Chapter 30). To make indicators informative to education, they should be developed within an integral theoretical framework in education, specialized to a given task environment via careful task analysis, and validated by empirical studies. Regarding that the task environments of CBSGs are exceedingly dynamic and the tasks require interactions of performers with the environment, this study demonstrates how to develop specific indicators for a CBSG based on complex-skill acquisition theories.
In this introduction, we will first theoretically compare how experts and novices generate problem solutions, suggesting aspects of problemsolving performance that are directly affected by the level of DSPK. We will then discuss how to define indicators of these aspects by decomposing the skill structure hierarchically. Finally, we present the hypotheses of this study. Fig. 1 provides a process model that shows how experts and novices generate problem solutions differently, adopting concepts from the four-component instructional design (4C/ID) model (Van Merriënboer & Kirschner, 2018). The process involves two types of knowledge in long-term memory: domain models (i.e., schemas of how a domain is organized) at declarative level, and cognitive strategies (i.e., schemas of how to approach problems in the domain) at procedural level. Assume a continuum with novices with low DSPK at one extreme and experts with high DSPK at the other extreme. For novices, since their domain models are not yet structured, weak methods (i.e., slow and inefficient general problem-solving strategies such as general search or working backward) (Newell & Simon, 1972) are the only cognitive strategies they can use when solving a problem. This leads to inefficient approaches to the problem, and also to procedures with incorrect cognitive rules at the level of task performance (i.e., solution generated). For experts, on the other hand, well-structured domain models are interpreted and transformed into two types of stronger cognitive strategies: knowledge-based methods (i.e., heuristic strategies) and strong methods (i.e., algorithmic strategies) (Van Merriënboer, 2013). Knowledge-based methods guide students to reason within the domain and systematically approach nonroutine aspects of the problem (i.e., systematic approach). When a certain aspect of the given task is consistently repeated (i.e., routine aspects of tasks), cognitive if-then rules may be formed as strong methods. These rules provide algorithmic solutions to routine aspects of the task by matching conditional information in working memory (i.e., if part) with a coordinated reaction (i.e., then part), resulting in procedures with correct cognitive rules at the task performance level. As a function of extensive practice, the cognitive rules can be strengthened and eventually become fully automatized, leading to higher speed in performance (Palmeri, 1999).

Problem solution generation by experts and novices
Additionally, the schemas embodied in long-term memory cause one more distinction in task performance between experts and novices: reduced cognitive load resulting from optimized use of working memory. Problem-solving with weak methods imposes a heavy demand on cognitive resources in working memory (Sweller, Clark, & Kirschner, 2010), introducing high cognitive load or even cognitive overload (Sweller, 1988). However, with the availability of knowledge-based methods, cognitive schemas relevant for problem-solving are stored in long-term memory and retrieved into working memory as one element. Moreover, with fully automatized strong methods, cognitive schemas are activated directly without placing any demand on working memory resources, which further frees up working memory (Sweller, van Merriënboer, & Paas, 2019).
Consequently, we derive four constructs that represent aspects of task performance that are affected by DSPK: (1) systematicity in task approach (i.e., representation of acquired strategies), (2) accuracy in applying cognitive if-then rules (i.e., representation of formed cognitive rules), (3) speed in performance (i.e., representation of the strength of those rules), and (4) reduced level of cognitive load (i.e., representation of optimized process).
For example, in emergency medicine, an expert with knowledge and strategies in the domain would approach an emergency case J.Y. Lee, et al. Computers in Human Behavior 99 (2019) 268-277 systematically by reasoning in terms of priorities of interventions (systematicity in task approach). As for algorithmic rule-based aspects of the case (e.g., if the patient's oxygen level decreases, then apply oxygenation), the expert would perform the rule without errors (accuracy in applying cognitive rules). Since the expert has extensively practiced this rule, speed should be high (speed in performance). In addition, the expert would experience low cognitive load because knowledge from long-term memory can be applied directly (reduced level of cognitive load).

Skill decomposition
While the four aspects of task performance are applicable to all task environments, indicators to assess these aspects will be specific to a particular task environment. Researchers have strongly recommended that, to assess a certain task performance, constituent skills and their relationships should be identified in a process of skill decomposition (Gagne, 1968;Van Merriënboer & Kirschner, 2018). We deem that skill decomposition allows identification of the indicators of the four constructs mentioned above to be precise and theoretically sound.
The domain of this study is a resuscitation procedure, called the ABCDE method. The five letters ABCDE represent the five phases (i.e., Airway, Breathing, Circulation, Disabilities, Exposure) that a task performer goes through sequentially to stabilize an acutely ill patient. The sequence should be rigidly followed, based on the principle "treat first what kills first". AbcdeSIM (Erasmus University Medical Center & VirtualMedSchool, 2012), a CBSG for training the ABCDE method, is employed as the task environment. We decompose the task and develop a skill hierarchy by using Lee and Anderson's task analysis method (Lee & Anderson, 2001) (Fig. 2). In the hierarchy, the task-goal (i.e., stabilization of patient) is gradually divided into three levels: unit-task level, functional level, and physical level. To achieve the task-goal, unit-tasks (i.e., the five phases in the ABCDE method) are arranged accordingly. Each unit-task comprises multiple sub-tasks at the functional level (i.e., diagnosis and intervention). Every functional task is linked to individual activities at the physical level (e.g., look at "VFM", click "Talk to nurse"). What one can empirically measure is this physical level only, while other levels represent cognitive performance.
This hierarchy guides us in the development of the indicators of the four constructs, by identifying different task levels. For the first construct (systematicity), a systematic approach in the ABCDE principle can be defined as a high level of adherence to the order of the five phases at the unit-task level. The challenge of measuring this construct is that the systematic approach is not directly observable at the physical level. To see this, note that the knowledge-based methods deal with non-routine aspects of a task, using the same knowledge differently based on rules-of-thumb (Van Merriënboer, 2013). An action that is associated with a certain ABCDE phase can also be taken during other phases strategically. Thus, an irregular ABCDE sequence observed at the physical level does not necessarily represent irregular performance at other levels. Consequently, the indicator of this construct should concern the hidden cognitive performance at the unit-task level, rather than analyzing the physical level only.
For the second construct (accuracy in applying cognitive rules), we recall that cognitive rules consist of if and then parts. In CBSGs in general, the if part emerges as information gathering via visual selection (i.e., looking at a particular part of the screen), while the then part corresponds to motor reaction to the task environment (e.g., mouse clicks or keyboard input). The accuracy in visual selection and motor reaction might reside at the functional level in the skill hierarchy. This construct can be directly detected by observing the physical level, because the strong methods deal with routine aspects of a task, referring to the same use of same knowledge (Van Merriënboer, 2013).
The third construct, the strength of cognitive rules, is situated in the connection between the sub-tasks at the functional level. If this connection is strong and stable, certain motor reactions at the end of a series of cognitive rules will be performed fast. The indicator of the strength should be the speed of this motor reaction, observed again at the physical level.
Lastly, the construct of reduced cognitive load originates from wellstructured cognitive schemas. The entire structure of the skill hierarchy shows how the relevant cognitive schemas in long-term memory are developed, resulting in optimization of use of working memory. The degree of optimization can be indicated by the level of cognitive load.

Hypotheses
Four hypotheses will be tested in the current study: H1 (systematicity in approach). Participants with high DSPK (i.e., experts) show higher systematicity in approach than participants with low DSPK (i.e., novices) during performance in the AbcedSIM, demonstrating higher level of adherence to the ABCDE phases.
H2 (accuracy in applying cognitive rules). Experts show higher accuracy in visual selection by allocating more visual attention to critical diagnosis areas (H2a) and in motor reactions by completing more interventions (H2b) than novices.
H3 (speed in performance). Experts show higher speed in performance by completing interventions faster than novices.
H4 (reduced cognitive load). Experts experience lower cognitive load than novices. Table 1 provides an overview of the constructs, hypotheses, and indicators. Details of indicators are described in the following method section.

Participants and design
Participants (N = 46) were recruited on a voluntary basis from a medical center in the Netherlands. For the expert group, 24 residents in their second to fifth (final) year of residency training with an average of 3.1 years of experience in emergency departments (SD = 1.6) were recruited (Md = 29 years with a range from 26 to 44; 17 females). For the novice group, 22 medical students in their second to sixth academic year who had been taught the basics of emergency medicine but had received no training were recruited (Md = 23 years with a range from 20 to 26; 12 females). A causal-comparative design is adopted with the level of expertise as the single factor.

AbcdeSIM game set-up
AbcdeSIM is a medical simulation game to train the ABCDE method for resuscitation. The game starts with a storyline where users meet a virtual patient in an emergency room. The users are provided with tools for diagnosis (e.g., stethoscope, penlight) and intervention (e.g., infusion fluids, medication). Human physiology (e.g., respiration, circulation) of the patient is implemented in the game, giving feedback on user's interventions. A regular adult patient scenario, hemorrhagic shock due to gastrointestinal bleeding (GIB), was used. GIB is a scenario where learners should follow the basic ABCDE method, with most emphasis on the circulation phase. During the experiment, the game was run on a personal computer (Intel Core i7 2.67 GHz CPU, 1.98 GB RAM) and presented on a Dell 22 "LCD screen with a resolution of 1650 × 1080 pixels. Participants used a headset for sound effects and interaction with the simulation was done via the mouse.

Eye-tracking and game-log recording
The game log data, containing user-input (e.g., tools that participant used, actions taken), changes in the game (e.g., patient's physiological changes), and time stamps, were saved in JSON file format (www.json. org). Participants' eye movements were measured by an SMI RED remote eye-tracker (SensoMotoric Instruments GmbH, Teltow, Germany) with a sampling rate of 250 Hz. The SMI Experiment Center 3.5 software (version 3.2.11, www.smivision.com) was used to implement calibration, validation, stimulus presentation, and screen recording. Eye movement data was gathered via SMI iView X software (version 2.7.13).

Cognitive load questionnaire
The NASA Task Load Index (NASA-TLX) (Hart & Staveland, 1988) was used as a validated self-report questionnaire of cognitive load. It is a mental workload assessment tool for human-machine interaction domains such as aviation and aeronautics (Shamo, Dror, & Degani, 1999), healthcare (Weinger et al., 2000), and socio-technical fields (Grigg, Garrett, & Benson, 2012;Warm, Matthews, & Finomore Jr, 2017). The NASA-TLX provides an overall workload score with six subscales: mental demand, physical demand, temporal demand, performance, effort, and frustration. Certain wordings of the questionnaire were adapted to fit the game environment.

Procedure
Individual sessions were carried out in an eye-tracking laboratory at Maastricht University. First, participants were asked to sign an informed consent form and fill out a questionnaire about demographics and experience in emergency medicine. Then, a pre-training was provided to ensure that the level of game-specific knowledge (i.e., how to operate the game) was comparable between the expert and novice groups. After pre-training, additional time for participants to play around with a test scenario was given, to allow them to familiarize themselves with the game. When participants expressed their readiness, the GIB scenario was presented. The eye-tracking system was calibrated with a 9-point procedure, and validation followed directly. Participants had to stabilize the virtual patient in a maximum of 15 min, shown with a timer visible on screen during the entire session. As time pressure is an intrinsic component of cognitive load in medical emergencies, we controlled for the time pressure by measuring temporal demand that is one of the six subscales of NASA-TLX. After the scenario, participants filled out the NASA-TLX. The average time to complete a session was about 50 min.

Data analysis
For testing H1, H2b, and H3, the data from game-logs was used, while eye-tracking data was employed for testing H2a and H4. Parsing of the game-logs was performed using Python. Eye-tracking data of three experts and two novices were excluded due to low tracking ratio

Systematicity in approach (H1)
We consider Hidden Markov Models (HMM) a suitable method to develop a score for measuring systematicity in approach, since they can be used to model hidden state transitions (i.e., phase arrangement at the unit-task level) based on a sequence of emission states (i.e., arrangement of motor reactions observed at the physical level) (Baum, Petrie, Soules, & Weiss, 1970). The probability structure resulting from fitting the HMM to participant data contains information about the level of the adherence to the ABCDE sequence in hidden states. We used this probability structure to compute our score for systematicity in approach.
To do this, first, we classified the functional tasks of the GIB scenario into each of the ABCDE phases. Then, user-input data relevant to these functional tasks was extracted from the raw data in the game log file. The extracted data comprises the emission state sequences of ABCDE for each participant. A HMM is fitted to the sequences, resulting in a probability structure with two matrices: a state transition probability matrix and an emission probability matrix. From these matrices, we calculated the HMM score by averaging the sum of the diagonal and upper co-diagonal in the state transition matrix and the diagonal sum of emission probability matrix (see Appendix for a complete explanation of the HMM score computation).

Accuracy in visual selection (H2a) and motor reaction (H2b)
Research in visual science reports that, compared to novices, experts allocate more attention to task-relevant than task-redundant areas (Gibson, 2014;Haider & Frensch, 1999;Reingold & Sheridan, 2011, p. 528). However, in a real-life simulation such as the AbcdeSIM, areas with information cannot simply be dichotomized as relevant versus redundant. Information and game functions are compactly organized within the limited area of the screen, and the level of relevance gradually differs. Thus, we categorized the reason the screen into four groups in consultation with a medical professional: critical diagnosis area with critically relevant information for diagnosis (CDA), non-critical diagnosis area with information relevant for diagnosis to some extent but not critical (NDA), intervention area with functions for intervention (IVA), and neutral area with additional functions such as connecting different information (NA) (Fig. 3). We hypothesized that experts allocate more attentional resources to CDA than novices, thus formulating H2a.
All area groups mentioned above comprised areas of interest (AOIs) forming the basis of the eye-tracking data analysis (Holmqvist et al., 2011, Chapter 6). Since the appearance and layout of these areas dynamically change according to users' input and activated game function, we adapted the AOIs accordingly. The raw eye-tracking data was analyzed by SMI BeGaze 3.6 software. Fixations were identified when the gaze velocity was less than 40 visual degrees per second, with a minimum duration of 50 ms.
Three eye-movement measures are employed: dwell time (gaze visiting time for an AOI from entry to exit), fixation count (number of fixations on an AOI), and fixation duration (time duration when the eye is relatively still at a position). Each measure is expected to capture different aspects of attentional resources. Dwell time indicates the time that a participant spent fixating on an AOI, where constituent metrics are not decomposed (Orquin & Holmqvist, 2018). Fixation count indicates frequency of reference to the stimulus (Orquin & Loose, 2013), while longer duration of fixations can mean a deeper cognitive processing (Holmqvist et al., 2011, Chapter 11). To make the measures comparable across participants, relative values for each AOI group were calculated: the dwell time for each AOI group was divided by total play time, while the fixation count was divided by total number of fixations during the entire scenario. The mean fixation duration for each AOI group was calculated.
Visual selection and its associated motor reaction cannot be matched one-to-one, due to the dynamic characteristic of CBSGs. Thus, the accuracy in motor reaction was operationalized independently from the visual selection. We hypothesized experts complete more intervention tasks than novices, formulating H2b. The intervention completion score was developed as follows. In consultation with a medical professional, we selected five intervention tasks that are theoretically essential in the GIB scenario: oxygen mask application, fluid administration, blood administration, blood order, and calling gastroenterologists. We then calculated the proportion of the intervention completed. This was done by extracting the corresponding data from the game log files.

Speed in performance (H3)
The relative time to complete the five intervention tasks from H2b was used as the speed measure. Clicking the game start button was taken as the start point, with clicking the button for applying one of the five interventions as the end point for that invention. We assume that the speed of intervention includes the speed of diagnosis as they are closely connected and performed simultaneously in emergency medicine. To make the time on each task comparable, z-scores were used. First, we checked whether the time-on-task per task was normally distributed. We then transformed each time-on-task into a z-score per task for each participant.

Cognitive load (H4)
In addition to using the cognitive load questionnaire (i.e., NASA-TLX) as a subjective rating, we used eye-tracking measurements as an objective indicator of cognitive load. Several studies have shown that high cognitive load is related with long fixation durations (Korbach, Brünken, & Park, 2016;Park, Knörzer, Plass, & Brünken, 2015) and high fixation frequency (Van Orden, Limbert, Makeig, & Jung, 2001;Van Orden, Nugent, La Fleur, & Moncho, 1998;Zelinsky, Rao, Hayhoe, & Ballard, 1997). We also included transition rate (i.e., the movement from one AOI to another per second) that has been used in several studies of working memory capacity (Holmqvist et al., 2011, Chapter 12). As cognitive load represents the level of optimization of working memory, we assume that a robust transition rate might be interpreted as an active cognitive process with an optimal use of working memory (i.e., a low level of cognitive load). Average fixation duration and fixation frequency were calculated over the scenario. Transition was counted using all individual AOIs from the four AOI groups aforementioned. Then the per-second transition rate was calculated.

Results
All measures for each construct were compared between experts and novices by t-tests for independent samples. When the data is not normally distributed, Mann-Whitney U test was used instead. MANOVA was used for comparing multivariate variables. Table 2 provides an overview of the outcomes of the variables related to all constructs except visual selectivity that is specified separately in Table 3. Fig. 4 shows the distribution and the boxplot of HMM scores. The HMM score was significantly different between the two groups with a large effect size (t(32) = −3.49, p = .001; d = −1.16), indicating that experts adhered better to the ABCDE sequence than novices. There was no significant difference between groups in the length of the ABCDE sequences. Table 3 demonstrates an overview of outcomes of visual selectivity measures for each AOI category. A MANOVA was conducted for all three relative measures of visual selectivity (i.e., dwell time, fixation count, and fixation duration) for CDA. The MANOVA revealed a significant difference between experts and novices (F(3,37) = 4.67, p = .007). Further, separate t-tests on the all three variables showed significant difference: experts showed higher proportion of dwell time to total play time with a large effect size (t(38) = −2.62, p = .012; d = −0.82), higher ratio of fixation count to total fixation counts with a medium effect size (U = 123, p = .023, r = 0.35), and longer fixation duration with a large effect size (t(38) = −2.15, p = .038; d = −0.67) for CDA. There was no significant difference in other AOI groups, except for the experts' lower proportion of fixation count in IVA (t (38) = 2.07, p = .045; d = 0.65). The intervention completion score was significantly higher, with a medium effect size, for experts than for novices (U = 167.5, p = .027; r = 0.33).

Speed in performance
Experts showed faster unit-task reaction time with a large effect size (t(40) = 3.77, p = .001; d = 1.14). There was no significant difference in the total time on entire scenario performance (U = 348, p = .067; r = 0.27).

Cognitive load
A MANOVA was conducted for all four measures of cognitive load: NASA-TLX score, average fixation duration, fixation frequency, and transition rate. The MANOVA revealed a significant difference between experts and novices (F(4,36) = 2.68, p = .047) in cognitive load. However, separate t-tests showed diverged results. The NASA-TLX scores was lower for experts than novices, with a large effect size (t (40) = 2.33, p = .025; d = 0.70). There was no significant difference Note. a Due to non-normal distribution, Mann-Whitney U test was used. U-value and r were calculated instead of t-value and Cohen's d. b p < .05. Note. Critical diagnosis area (CDA), non-critical diagnosis area (NDA), intervention area (IVA), and neutral area (NA). Each measure was calculated as relative values: dwell time divided by total play time, fixation count divided by total fixation counts, and fixation duration divided by fixation duration averaged over the scenario. a Due to low normality, Mann-Whitney U test was used. U-value and r were calculated instead of t-value and Cohen's d. b p < .05.

Discussion
This study aimed to empirically determine whether the level of domain-specific prior knowledge (DSPK) affects performance in a computer-based simulation game (CBSG). In the introduction, we argued that, to assess this performance, certain constructs should be developed by taking theories of complex-skill acquisition as a starting point. We suggest four theoretical aspects of problem-solving performance to represent the level of DSPK, and defined indicators of these aspects based on a skill hierarchy, which resulted in four hypotheses. To confirm these hypotheses, game-logs and eye-tracking data were collected and analyzed, using the methods corresponding to each aspect.
Hypothesis 1 stated that participants with high DSPK (i.e., experts) show higher systematicity in their approach during performance in a CBSG than participants with low DSPK (i.e., novices). The results of this study support this hypothesis. Systematicity for the AbcdeSIM task environment was defined as a high level of adherence of the ABCDE sequence at unit-task level, while flexibly adjusting task performance at physical level. According to the result from the HMM, the experts showed a higher level of adherence than novices to the ABCDE sequence at a hidden level. Additionally, the length of the ABCDE sequences at the physical level did not show significant difference between experts and novices. This implies that the important difference between experts and novices resides in the inner structure of the task performance, rather than the amount of physical action.
Hypothesis 2 concerns the accuracy in applying cognitive rules, stating that experts show the availability of more accurate cognitive rules. We decomposed the cognitive rules into two parts specific to CBSG environments: visual input of information from the environment (i.e., if part) and motor reactions to the environment (i.e., then part). Therefore, the hypothesis consisted of two sub-hypotheses: Experts show higher accuracy in both visual selection (H2a) and motor reaction (H2b) than novices.
H2a was confirmed, showing that experts have higher accuracy in visual selection of areas with critical information. This construct was operationalized as the ratio of allocation of visual attentional resources to critical diagnosis areas (CDA). All three eye-tracking metrics that were used (i.e. dwell time, fixation count, and fixation duration) indicated a higher allocation to CDA for experts. We observed no significant difference in other AOI groups (i.e., NDA, IVA, and NA), except the fixation count in IVA. Interestingly, for the areas with intervention functions (IVA), experts showed lower fixation counts compared to novices. Regarding that high number of fixation counts indicate frequent reference to the stimulus (Orquin & Loose, 2013), this result suggests that novices search more frequently for what to execute (i.e., intervention) in the absence of an accurate diagnosis. This also can be interpreted as novices using weak methods such as general search (Gick, 1986) and working backward (Larkin, McDermott, Simon, & Simon, 1980). As a result of critical information collected via effective   J.Y. Lee, et al. Computers in Human Behavior 99 (2019) 268-277 information gathering strategies, experts reacted to the environment more accurately. They achieved higher intervention scores, which supports H2b. Thus, Hypothesis 2 was largely confirmed: Compared to novices, experts allocate more visual attentional resources to critical information, followed by more appropriate motor reactions. With regard to H3 which concerns speed in performance, the results confirmed H3 by demonstrating higher speed in performing specific unit-tasks that are most essential for the designated scenario. On the other hand, one might note as well that the total play time on the entire scenario showed no significant difference between experts and novices. Although experts perform tasks faster than novices in general, we assume that the time on the entire task is not an applicable indicator of expertise in certain tasks. In this study, experts seem to complete the essential interventions faster, then repeatedly perform reassessment (i.e., monitoring the effect of applying the ABCDE procedure and controlling the process), resulting in a similar length of overall performance time between experts and novices. The reassessment is one of the essential parts of the ABCDE method, trained throughout the traineeship of emergency medicine, which is often overlooked by novices. We suggest that the use of time on entire task as an indicator of expertise should be considered carefully through analyzing given tasks.
Lastly, Hypothesis 4 pertained to the lower level of cognitive load for experts compared to novices. This was supported in that experts reported lower cognitive load in a subjective rating scale (NASA-TLX), which was correlated with high transition rate. While subjective rating scales are a well validated measure of cognitive load, the interpretation of transition rates has been inconsistent in the literature. A robust transition rate can be related with optimal use of working memory (Epelboim & Suppes, 2001;Miall & Tchalenko, 2001) or better integration between different information sources (Bartels & Marshall, 2006;Johnson & Mayer, 2012;Schmidt-Weigand, Kohnert, & Glowalla, 2010), which is on the same line with our interpretation. On the other hand, a high transition rate could also be connected to difficulties in integrating information sources (Holsanova, Holmberg, & Holmqvist, 2009), inefficient visual problem-solving strategies (Van Meeuwen et al., 2014), or extraneous cognitive load caused by seductive details in multimedia learning (Korbach, Brünken, & Park, 2017).
We presume that these different interpretations stem from differences in characteristics of visual stimuli which are highly task-dependent. When a task presents static stimuli with fixed information (e.g., static texts or figures), shifting eye-gazes between AOIs might indicate a stagnation within the same information. In this case, AOIs with information that has already been processed become irrelevant areas that does not require revisits (Van Meeuwen et al., 2014). On the other hand, when a task provides dynamically changing pieces of information, shifts between AOIs signifies rather a different kind of process, a vigorous progress in gathering new information. Especially in medical simulations, monitoring patients' physiological changes and reacting upon them constantly (i.e., reassessment) is a major part in problemsolving, which can be facilitated by optimal use of working memory. Our result accords with the results of a previous study in ultrasound simulation (Aldekhyl, Cavalcanti, & Naismith, 2018), which also used a medical simulation task. Furthermore, this dynamic aspect of medical simulation tasks seems to reduce the sensitivity of fixation duration and fixation count during the overall task performance in measuring cognitive load, due to the fluctuation of these measures. Further research is needed on using eye-tracking to measure cognitive load in different task environments.
The results of this study have several implications for indicator development in CBSGs. First of all, multiple aspects of performance should be considered as a whole within an integrated theoretical framework when determining constructs to assess performance. Researchers in education have argued that a well-designed performance assessment should combine all aspects of performance in a global manner, rather than using a simple checklist (Cunnington, Neville, & Norman, 1996;Dankbaar et al., 2014). We suggest that the same principle should be applied to the performance assessment in CBSGs. This is the most important reason to use complex-skill acquisition theories as a driving force, because it facilitates considering different aspects of tasks (e.g., non-routine and routine) in an integrated theoretical framework. Secondly, since constructs are abstract and conceptualized broadly, they should be operationalized to concrete indicators that are fully designated to a specific task. This should be done through a task analysis in consultation with domain experts, also driven by educational theories. For instance, systematicity in approach is one of the broad concepts we explored in this study, which was problematic to operationalize. Thus, we robustly applied theories in complex-skill acquisition and the relevant domain (i.e., the ABCDE method), to define the indicator of systematicity. Thirdly, this study opens the potential of combining eye-tracking data with game-log to quantify performance in CBSGs. Since most of CBSGs depend on visual stimuli, eye-tracking can be an important source to obtain a complete account of certain aspects of performance. In this study, we have found two indicators from eyetracking (i.e., visual selection and cognitive load) that can possibly be used in CBSGs. Future research is needed to apply these findings to assessment and the development of support in CBSGs (e.g., student modelling to adaptively support individuals).
The insights from this study can help educators to assess students' performance in CBSGs and provide scaffolding to students with a low level of DSPK. For instance, they might focus on the different aspects in students' performance, then adjust the level of scaffolding to enhance each aspect. When the systematicity in approach is not high enough, instructors might stimulate the student to construct domain-specific knowledge and strategies (e.g., advise them to consult learning resources with relevant information). When a student concentrates on reactions only without sufficient information gathering, they can guide the student to pay more attention to information gathering as a sound foundation for taking actions in the game. Additionally, when the student's cognitive load is high, extra support could be given to manage the load. This can be done either by reducing the cognitive load itself (e.g., providing pauses during the game or presenting less complex scenarios), or facilitate self-regulation of students to manage their own cognitive load (Sweller et al., 2019).
Several limitations of this study need to be mentioned. Firstly, our findings might not be generalizable to other CBSG environments since the indicators were specialized for a specific task. Future research should follow to examine how our methods can be applied in other CBSG environments. Secondly, the participants in the expert group were composed of residents, rather than medical doctors. In this study, we selected medical students as novices and residents as experts, in order to form comparable groups. Although it led to a better controlled experimental setup, including a wider range of expertise levels could have yielded more informative results. Thirdly, although one could well argue that eye-hand coordination in performing cognitive rules is another aspect of performance, it was not explored in this study. We rather analyzed visual selection and motor reaction separately based on our assumption that those two cannot be matched one-to-one in a dynamic environment of a CBSG. However, investigating eye-hand coordination as an aspect of performance via non-linear analysis should be an intriguing topic for future study.
In conclusion, this study has demonstrated the development of performance assessment that can be used in a highly dynamic game environment. This was accomplished by starting from theories of complex-skill acquisition, identifying constructs for assessment and valid indicators. We believe empirical investigation for reliable indicators in CBSGs can be seen as a problem-solving process itself. As in problem-solving by experts, the research should be driven by a certain knowledge structure (e.g., educational theories) to avoid an inefficient process and suboptimal solutions. Educational theories and empirical experiment are in a reciprocal relationship, where one cannot stand alone without the other.