Investigating the validity of subjective workload rating (NASA TLX) and subjective situation awareness rating (SART) for cognitively complex human – machine work

Subjective workload and situation awareness measures, such as the NASA task load index (TLX) and the situational awareness rating technique (SART), are frequently used in human – system evaluation. However, the interpretation of these ratings is debated. In this study, empirical evidence for the measures ’ theoretical assumptions was investigated by comparing operators ’ ratings collected immediately after performing a scenario and ratings collected after operators ’ acquisition through a video review of the scenario, knowledge of actual system states. Eighteen licensed control room operators participated in the simulator study, running 12 relatively challenging scenarios. It was found that the interpretation of TLX items involving introspection remained stable after operators acquired factual scenario knowledge, while the interpretation of items involving the perception of external events, such as situation awareness and performance, depended on the operators ’ scenario knowledge. The result shows that operators ’ ratings could discriminate between mental effort, performance, frustration, and situation awareness. No clear evidence for the SART index as a measure of situation awareness was found. Instead, a subjective situation awareness measure developed for this study was distinct from workload and related to operator performance, showing that this type of measure warrants future investigation of its validity. The study findings help in developing measurement procedures and interpreting subjective measures. Finally, the study reveals that informing operators about the scenario can provide useful subjective ratings of situation awareness and performance. Future research should include procedures for how to inform participants adequately and efficiently in subjective assessments.


Introduction
Mental workload and situation awareness are important criteria for designing and assessing human-machine systems (Endsley et al., 1998;Salmon et al., 2009;O'Hara et al., 2012). Workload and situation awareness measures are applied to obtain sensitive evaluations and gain an in-depth understanding of complex task performance (O'Donnel and Eggemeier, 1986;Endsley et al., 1998). Important questions include how human-system design and organisation of work influence workload and situation awareness, and to what extent a given human-machine configuration represents optimal or acceptable levels (Reid and Colle, 1988;Endsley, 2000a;Young et al., 2015). To adequately address these questions, measurement credibility and utility are important considerations (Muckler and Seven, 1992;Annett, 2002).
Mental workload and situation awareness are frequently viewed as separate but interrelated constructs (Endsley, 2000a;Vidulich and Tsang, 2015). For example, when the attentional resources involved in performance compete with the resources needed for monitoring and comprehension, situation awareness may decrease (Vidulcih and Tsang, 2015). However, the operator's increased effort may be related to increased situation awareness. By improving the human-system interface or increasing the level of expertise, one could reduce the workload and increase situation awareness-a frequent goal of system design efforts (Vidulich, 2000;Endsley, 2000a;Vidulich and Tsang, 2015). It follows that how workload and situation awareness relate to each other and under what conditions may provide important insights regarding human-system safety and efficiency (Vidulich and Tsang, 2015). Consequently, separate and valid measures of each construct are warranted (Endsley, 2000a;Parasuraman et al., 2008).
Subjective assessment techniques are frequently used due to their ease of use, low cost of application, and sensitivity to varying conditions (Reid and Nygren, 1988). The NASA task load index (TLX) (Hart and Staveland, 1988) is the most popular subjective workload measure (De Winter, 2014;Grier, 2015), and the situational awareness rating technique (SART) (Taylor, 1990) is the most widely used subjective situation awareness measure (Endsley et al., 1998;Salmon et al., 2009). However, subjective workload techniques have been questioned due to a lack of correspondence with performance-based and physiological workload measures (Yeh and Wickens, 1988;Matthews et al., 2020), and subjective situation awareness has been found to dissociate with objective measures (Endsley, 2020). Therefore, it is important to improve our knowledge of what subjective measures can inform us (Messick, 1990(Messick, , 1995. Subjective ratings concern the assessment of events in the external world and internal sensations and feelings (Annett, 2002). The latter is obtained through introspection, while events in the external world are objects of direct verification or consensus view. Some external events are directly verifiable (Muckler and Seven, 1992). For example, if a system had stopped, the time available for the task was 5 min. Some events are verifiable by consensus; for example, the event created many alarms in need of attention. The situation implied that the system needed to be shut down. Internal sensations are hard to verify, but mental workload, as an example, can considerably be inferred from external conditions and performance (Colle and Reid, 2005), psychophysiological response (Charles &Nixon, 2019), and overt behaviour Gan et al., 2020).
Mental workload likely involves both private sensations regarding effort and verifiable elements, such as task assessment (Annett, 2002). Situation awareness involves the perception of events in the external world and is probably the object of direct verification and consensus (Parasuraman et al., 2008). Therefore, it is interesting to note that subjective ratings are frequently collected immediately after completing a scenario (post-session) without the participant having explicit information about the scenario's operational challenges, actual system states, or the consequences of one's actions. Ratings are based on the operator's memory recall of their observations during the performance. This is contrary to expert observers who are usually fully informed about these issues (Endsley, 2020). Thus, the operator may lack an adequate point of reference for the assessment. Could we better understand the meaning of subjective ratings by providing the operator with an external reference to what actually happened during the scenario?
Validity and reliability are critical considerations when selecting human factors measures and interpreting their results (Annett, 2002;Salmon et al., 2009). It is important that the measures represent the phenomena one attempts to investigate and that the measures collected can be interpreted according to the purpose of the evaluation. From a psychometric perspective (Murphy and Davidshofer, 1994;Messick, 1995;Annett, 2002), as with many human factors phenomena, subjective mental workload and subjective situation awareness are constructs. Construct validity is investigated through a process of collecting evidence for or against the accuracy of interpretations and actions taken based on the measurement (Messick, 1990(Messick, , 1995. Validity evidence includes a review of the measurement content, the internal structure of its components, and relationships with phenomena external to the measure (Messick, 1990;Murphy and Davidshofer, 1994). Experts can assess the extent to which the measurement items represent the concept (Fracker, 1991;Salmon et al., 2009). One can compare the structure identified by factor analysis to the theoretically proposed structure (Annett, 2002). Also, sensitivity to varying loads and alternative human-machine configurations are important applied measurement criteria (O'Donnel and Eggemeier, 1986;Endsley, 2000b;Salmon et al., 2009). Hart and Staveland (1988) developed the NASA TLX. The measure was developed based on substantial theory and considerable empirical testing (Hart, 2006;De Winter, 2014). Using the measure has extended to contexts exceeding its empirical developmental basis, for example, air traffic control, process control, healthcare, and military (Hart, 2006;Grier, 2015). Hart and Staveland (1988, p. 144) defined workload as "… the cost incurred by human operators to achieve a specific level of performance." They focused on three aspects, with the following measurement items: a) The external demand imposed by the tasks-three items consider mental, physical, and temporal demands; b) Effort based on the operator perception of task demand, including the self-regulation of effort and understanding of task demand based on perceived performance-the items effort and performance cover these aspects; c) psychological impact of perceived task demand, effort, and performance-captured by the item frustration. Hart and Staveland (1988) also referred to these three aspects as task, behaviour, and subject related, respectively. The intended application, which is frequently used, occurs immediately after completing the task or scenario (Hart and Staveland, 1988). From the development of Hart and Staveland's (1988) measure, one can develop several assumptions about the NASA TLX. (a) The rating of demand, the work loaded on the operator, could be substantially influenced by the operator's perception of its significance and magnitude, including what system behaviour is detected and understood during a scenario. Effort or resources invested, however, can be viewed as representing introspective characteristics. One can hypothesise that being informed about what actually happened in a scenario, e.g., system and component states, operational consequences of one's own and team members' performance, could influence the understanding of demand to a higher extent than this information would influence perception of effort invested. (b) TLX performance should reflect self-regulation and should therefore be related to effort. (c) TLX performance should be related to frustration, and eventually, this relationship should be modified by task demand. According to Hart and Staveland (1988, p. 166), frustration provides "… information about how comfortable operators felt about the effectiveness of their efforts relative to the magnitude of the task demands imposed on them." (d) Adding to frustration relating to performance, frustration represents the psychological impact of perceived task demand and effort, and one can hypothesise that frustration should relate to all TLX items. This type of relationship was found by Hart and Staveland (1988)-that their preliminary scales of stress and frustration were highly correlated with any other subscales. Tayler's (1990, p. 3-3) working definition of situation awareness when developing SART is that "Situational Awareness is the knowledge, cognition and anticipation of events, factors and variables affecting the safe, expedient and effective conduct of the mission". The research behind the SART development was influenced by the workload paradigm with its aim of optimising operator workload (Taylor, 1990). Knowledge elicitation and structural analysis resulted in three broad dimensions (Taylor, 1990, pp. 3-7): (a) Demands on Attentional Resources (Instability, Complexity, Variability), (b) Supply of Attentional Resources (Arousal, Concentration, Division of Attention, Spare Capacity), and (c) Understanding of the Situation (Information Quantity, Information Quality, Familiarity). Several studies have found that SART is substantially correlated with workload (Hendy, 1995;Selcon et al., 1991;Loft et al., 2015). This is not very surprising given SART's developmental basis, and the development procedure applied by Taylor (1990). The knowledge elicitation technique generated scenarios representing low and high situation awareness. The scenarios correspondingly varied in workload. For example, "Flying in formation in an unfamiliar aircraft working at the limit of your capacity" vs. "Approaching to land in good weather at a familiar airfield, in a familiar aircraft fitted with good displays". Consequently, constructs elicited from subjects tended to describe the task demand, such as attentional demand, familiarity, and complexity-and constructs that one could relate directly to workload, such as spare capacity, workload, and arousal (Taylor, 1990, Table 1). Taylor (1990, p. 3-11) suggested that situation awareness can be enhanced by controlling the demand on attentional resources and improving the supply of attentional resources, for example, by prioritising and cuing tasks or exploiting mental resource modalities. Taylor and Selcon (1994) developed a formula for the SART index as SA = Understanding-(Demand -Supply). As the formula prescribes, an imbalance between demand and supply should increase or reduce SA beyond what is measured by the understanding element. For example, supply exceeding demand would increase situation awareness beyond an operators' understanding. From the SART basis, one can develop assumptions of a) demand and supply of attention mainly capturing workload; (b) Similar to the NASA TLX dimensions, demand can be seen as operator perception of the external task, while supply of cognitive resources tends toward a subject-oriented element susceptible to introspection. These could be expected to behave similar to NASA TLX workload items depending on the operator being informed about what actually happened in a scenario; c) One can also assume that the element demand-supply of the SART formula, representing a factor influencing situation awareness, should be related to the rating of understanding.
Since it is debateable to what extent the SART measure covers situation awareness, workload, or their relationship, the study found it imperative to consider a supplemental theoretical basis for situation awareness. Endsley's three-level theory (1995a;1995b) is the most popular and probably the most cited theory of situational awareness (Salmon et al., 2009). Endsley (1995a, p. 36) defined situation awareness as ''the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future''. Situation awareness Level 1 concerns the observation of the status and behaviour of, for example, system parameters, alarms, and functionality. Level 2 concerns the understanding of the observations for managing and controlling the system, while Level 3 concerns the anticipation and prediction of future system development. The theory is the basis for the Situation Awareness Global Assessment Technique (SAGAT; Endsley, 1995b), an objective freeze-probe-based measure of situation awareness. However, to investigate subjective measures, this study developed a simple subjective measure based on Endsley's theory. Similar subjective measures have been developed and applied to aviation and military command and control (McGuinness and Foy, 2000;Matthews and Beal, 2002;McGuinness, 2021). From the theoretical basis and previous studies (Endsley et al., 1998), the assumptions include a) that a subjective measure based on Endsley's theory would be distinct from workload, b) being a pure situation awareness measure, it should be closer related to operator performance than the SART, and c) operator ratings could distinguish the theories' three levels from each other.
This study set out to provide evidence related to the meaning of scores from the most popular post-session subjective measures of workload and situation awareness. Based on the measures' theoretical basis, it was hypothesised that operator assessment of workload and situation awareness items related to an external reference would be influenced by knowledge about the scenarios' intended task demand, actual scenario development, and knowledge of own performance. However, assessment based on introspection was expected to be minimally influenced by this type of knowledge. In addition, due to the SART measure integrating situation awareness and workload, the purpose was to explore a simple subjective measure of situation awareness based on Endsley's (1995a) theory while investigating the demand-supply element of the SART formula. The research questions were studied by collecting nuclear operators' subjective ratings immediately after performing scenarios in a full-scope research simulator (post-session, "non-informed" assessment) and collecting the same subjective ratings after operators were informed about the scenario demands and completing a scenario replay/video analysis of the scenario (post-video, "informed" assessment).

Participants
Eighteen licensed operators from a Nordic nuclear power plant, organised as six three-person teams, participated in the study. Each team comprised a supervisor, a reactor operator, and a turbine operator. Their mean age was 39.2 years (SD = 11.9), ranging from 26 to 62 years, and their mean control room work experience was 10.6 years (SD = 10.8), ranging from 1 to 35 years. Five of the six teams comprised operators working as a team at their home plant, while one team was assembled from two different home plant teams. All members of a given team worked at the same reactor unit and thereby possessed shared competence in technical work, collaboration practices, and communication procedures. Operators maintained their competence by regular simulator-based training at their home plant. In this study, the teams were instructed to use teamwork practices and operating procedures as they ordinarily would in their daily work and in their home plant training. The study was reviewed and approved by the Halden Reactor Project Human Studies Review Committee and was performed according to the Halden Reactor Project's human participant protection procedures.

NASA TLX
The study utilised an unweighted version of the NASA TLX (Hart and Staveland, 1988), often referred to as the raw TLX (Byers et al., 1989;Nygren, 1991). The NASA-TLX comprises the following six items: mental demand, physical demand, temporal demand, performance, effort, and frustration. All questions, except the performance question, offered a scale ranging from "very low" to "very high". The rating scale for the performance question ranged from "perfect" to "failure". Each question's scale ranged from 1 to 11.

Situation awareness rating technique (SART)
The study used the 3-item version of the SART (Taylor, 1990). This version is often referred to as 3D SART. The measure comprised the following dimensions: Demand-demands on attentional resources, Supply-supply of attentional resources, and Understanding-understanding of the situation. The rating scale for each question ranged from 1 to 11. The items were worded, and scale endpoints labelled as follows in parentheses; The situation was (Very stable, Simple and straight forward, Few variables changing-Unstable, changes suddenly, Many interrelated components, Many variables changing); Attention, my effort was (Low alertness, Focused on one aspect, Much spare capacity-High alertness, Concentrating on many aspects, No spare capacity); My understanding of the situation was (Fully informed and full understanding, Very familiar situation-Very limited informed/understanding, Very novel situation).

Subjective situation awareness three levels (SA3)
A self-rating measure based on Endsley's theory of situation awareness was developed specifically for this study. The measure was given the preliminary label "SA3". Starting from the definition ''the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future'' (Endsley, 1995a, p. 36), three items were developed to represent each of the three levels of situation awareness. The items were worded, and scale endpoints labelled as follows in parentheses; My observation of critical information (Identified all needed information-Missed important information); My understanding of what was going on (Fully understood-Did not make sense to me); I could look ahead, and foresee what was going to happen (Very accurately-Could not predict). The rating scale for each question ranged from 1 to 11. The SA3 measure was similar to the subjective measures developed by McGuinness and Foy (2000) and Matthews and Beal (2002). However, beyond Endsley's three-level theory, these two measures included a workload subscale, and items about synthesising situation awareness with one's course of action.

SCORE performance evaluation
The performance assessment used the Supervisory Control and Resilience Evaluation (SCORE) framework (Braarud et al., 2015(Braarud et al., , 2016.
Using the framework, a subject matter expert developed a task-specific assessment sheet for each scenario. The assessment items developed considered the control room team's monitoring, interpretation, strategy, action, verification, teamwork, work process, and goals. During operators' review of their own team performance, utilising the scenario replay tool described below, each operator individually evaluated each item regarding the degree of acceptability on a rating scale. The scale ranged from 1 to 6, where 1-2 defined levels of unacceptable and 5-6 defined levels of acceptable. The middle values of 3 and 4 represented borderline acceptability. To create a scenario index score for each operator, an average was calculated for the task-specific items belonging to each of the control room positions. Consequently, the calculation resulted in a task-specific performance index for each operator for each scenario, ranging between 1 and 6.

Scenarios
Each team participated in 12 scenarios designed to last for about twenty to thirty min. The scenarios were designed by a subject matter expert with 20 years' experience in scenario design and operator performance assessment at nuclear training and research simulators. The subject matter expert addressed the study purpose by utilising the experiences from numerous human factors experiments previously performed at the research site (Braarud and Kirwan, 2011;Øvre, 2011;. The resulting scenarios were jointly reviewed by the subject matter expert and a human factors researcher (the author) utilising the full-scope research simulator. The general scenario structure comprises initial work, normal operation, or periodic testing, including minor plant system failures, and aggravating deviations inducing actuation of plant safety functions (reactor protection systems). To make parts of the scenarios complicated and cognitively demanding, malfunctions were simulated for several safety system components. Malfunctions included failed instrumentation, spurious actuated safety functions, and complicated plant system status due to the loss of external and internal power. Consequently, the scenario tasks posed the highest load on the control room team's reactor operator, although the turbine operator was also substantially involved in verifying and controlling plant safety. The supervisor had the overall task of overviewing plant safety, deciding strategy, and supervising the team's work. The scenarios included events commonly found in nuclear power plant safety analysis, such as loss of reactor coolant accidents, loss of all offsite power, loss of turbine condenser, and loss of main feedwater. While individual failures like these are included in the operators' regular home plant training, combining the main event and multiple safety component malfunctions made the specific combinations of malfunctions relatively unfamiliar to the operators. The team applied event-based operating procedures to acquire an overview of the plant's safety status and to develop a basis for selecting a strategy to mitigate the situation and control plant safety. The scenarios were counterbalanced using the Latin-square procedure described in Kirk (1995).

Simulator and session recording
The study was performed in IFE's Halden Human-Machine Laboratory (IFE, 2021), which is a full-scope research simulator based on an advanced nuclear power plant. The simulator has a fully computerised human-machine interface. Fig. 1 shows the control room layout. The shift supervisor workstation is at the back (closest to the camera), the reactor operator workstation is to the left, and the turbine operator workstation is to the right. The large screen display at the front provides a plant overview.
The simulator sessions were recorded with the laboratory's Video Audio Data Analysing (VAD) tool for use in post-video assessment. Each operator wore a headset with a microphone. The tool provided synchronised play of simulator logs, video, and audio from a scenario completed in the simulator. The recording included the simulated plant's process development (alarms, process parameters, and process events), operator process commands, navigation and interfaces accessed by the operator, a video of each operator workstation alongside an overview video of the control room, and separate audio recordings from each of the control room operators. The operator could play, pause, rewind, and forward the scenario during the performance review.

Study design and study procedure
The NASA TLX, SART, and SA3 rating questionnaires, the performance assessment, and the scenario replay tool were explained to the participating operators before performing the scenarios. The explanation included demonstrating the rating questionnaires, the scenario replay, and the performance assessment tool. Just after completing the scenario, the rating questionnaires were administered. Operators answered the rating questionnaires individually. Thereafter followed a short team briefing of the scenario's task demand, including a brief explanation of the plant failures implemented during the scenario and their consequences to plant operation. No discussions between team members were allowed during this briefing. After the briefing, operators individually, at separate workstations, utilised the scenario replay tool and performed the SCORE performance assessment for the scenario just performed. Operators could, at their own pace, play, pause, rewind, and forward the scenario during the performance assessment. There was no time limit for the assessment. Laboratory staff were, upon request from the operators, available for assistance on the technical aspects of answering the rating questionnaires, the scenario replay tool, or the performance assessment tool. After completing the performance assessment, each operator individually performed a second rating of the subjective measures of NASA TLX, SART, and SA3. The sequence of activities is depicted in Fig. 2, and this sequence was repeated for each of the 12 scenarios. Table 1 shows the mean and 95% confidence interval for each measurement item for the post-session condition and the post video condition, for each control room position and the tam average. The lower part of the table shows three indexes-the average of the six TLX items, SART calculated according to its formula (SA = U -(D− S)), and the average of the three SA3 items. Table 1 shows that the post-session and post-video mean of the ratings and the 95% confidence intervals were not very different. Noteworthy differences were the slightly lower team average of NASA TLX performance and effort, and SA3 observing in the post-video condition, differences in team average of 0.26, 0.27 and 0.44, respectively. Table 1 also shows that the operators' ratings of the TLX dimensions varied substantially. Looking at the post-session ratings, the team average ranged from a physical demand of 3.34 to a performance of 7.40. The team average SART understanding was about 1 scale point above demand and supply. The TLX index was substantially lower than the SART and SA3 indexes for all three control room positions and the team average.

Reliability in terms of Cronbach's alpha
Reliability, in terms of internal consistency, was assessed with Cronbach's alpha. The internal consistencies of the scales were high, and  video (assumed 6 items; α = 0.83), SA3 post-session (assumed 6 items; α = 0.94), and SA3 post-video assumed 6 items; α = 0.95). Assuming six items, the results suggest similar high reliability of the NASA TLX and the SART, while SA3 showed very high reliability. Table 2 shows the bi-variate correlations between the NASA TLX, SART, and SA3 items for the operators' post-session and the post-video rating. Table 2 shows a similar pattern of bi-variate correlation between the items of the post-session rating and the post-video rating. Some noteworthy observations are high bi-variate correlations between the TLX dimensions of mental demand, temporal demand, and effort. Also, the SART items demand and supply correlated substantially with the above mentioned TLX items. The SA3 items were highly intercorrelated, while only the SART item understanding correlated with the SA3 items. The TLX performance item correlated positively with the SA3 items and the SART understanding item, while performance and frustration were substantially negatively correlated. Table 3 shows the correlation between post-session and post-video ratings for each item of the NASA TLX, SART, and the SA3 measures. As hypothesised, the correlation for effort and frustration was relatively high, while the correlation for TLX performance, SART understanding, and SA3 items were relatively low. Surprisingly, unlike the assumptions based on the NASA TLX theoretical basis, the correlations for mental demand, physical demand, and temporal demand were relatively high.

Item correlation and construct structure
A factor analysis of the post-session and post-video ratings together revealed an interesting structure from the operator ratings. The interpretation of the resulting scree plot alongside the criterion of eigenvalues exceeding 1 (Kim and Mueller, 1978) suggested four underlying factors. The principal axis method for factor extraction was applied, and normalised varimax rotation was applied for interpreting the factors. Table 4 shows the resulting factor loadings of the items.
Factor 1 was interpreted as a workload dimension stable across both the post-session and post-video ratings. The TLX items mental demand, temporal demand, and effort alongside the SART items demand and supply, from both post-session rating and post-video rating, loaded on this factor. Interestingly, regarding the SA3 items and SART understanding, the post-session and post-video ratings loaded on different factors. Factor 2 was interpreted as an "informed" situation awareness dimensions, while Factor 4 was interpreted as a "non-informed" situation awareness dimension. NASA TLX performance post-session was loaded mostly on the non-informed SA dimension, and NASA TLX postvideo was loaded mostly on the informed SA dimension. These TLX performance loadings supported the interpretation of an informed and a non-informed SA dimension. Factor 3 was defined by the high loadings from the TLX physical demand rating both post-session and post-video. TLX frustration, both post-session and post-video, loaded moderately on this factor. Factor 3 was interpreted as a physical demand factor.

The SART Demand-Supply element
Investigating the bi-variate correlations and factor analysis suggested that the element D-S (Demand-Supply) was not systematically related to the SART item understanding or to the NASA TLX or SA3 items. The correlation between D-S and understanding was 0.10 (n.s.) and 0.13 (n.s.) for the post-session rating and the post-video rating, respectively. The correlation between the D-S element and SA3 ranged from 0.04 to 0.10 and from 0.06 to 0.09 for the post-session rating and the post-video rating, respectively, and the correlation between the D-S and TLX items ranged from 0.08 to 0.31 and from 0.10 to 0.24 for the post-session and the post-video ratings, respectively. Replacing the SART items demand and supply, both post-session and post-video, with the respective element D-S (Demand-Supply) in the factor analysis above resulted in no clear factor loadings. The loadings on the four factors ranged from 0.003 to 0.17.

Sensitivity
Sensitivity of the measures to varying loads for the positions of a control room team was measured by analysis of variance (ANOVA). The scenarios were designed with a predominance of malfunctions and operational challenges for the reactor side of the plant, thereby creating the highest load for the reactor operator. An overall analysis was performed for each index measure, and the effects of team position, and conditions of rating (post-session vs post-video) were investigated by 3X2 ANOVA. The mean ratings and 95% confidence intervals for the two factors are illustrated in Fig. 3.
For the NASA TLX, the effect of operator position was significant, F (2,213) = 11.65, p < .001, while there was no significant difference in rating post-session versus post-video, F(1,213) = 0.12, p = .73. Tukey's Table 2 Bi-variate correlation between items. Post-Session to the left. Post-Video to the right. Correlations above or equal to 0.5 are in bold.   post-hoc test revealed that reactor operators' ratings significantly exceeded those of both the turbine operators and the supervisors, p = .006 and p > .001, respectively. The turbine operators' and supervisors' ratings were not statistically different. For the SART, the effect of position was significant, F(2,213) = 7.34, p > .001, while there was no significant difference in rating post-session versus post-video, F(1, 213) = 1.99, p = .16. Tukey's post-hoc test showed that both reactor operators' and supervisors' ratings exceeded turbine operators' rating, p < .001 and p = .02, respectively. The reactor operators' and supervisors' ratings were not statistically different. For SA3, the effect of position was significant F(2,213) = 6.91, p = .001, effect of post-session vs post-video F(1, 213) = 5.43, p = .02. Tukey's post-hoc test revealed that turbine operators rated lower than supervisors, p < .001. Reactor operators' ratings were not significantly different from turbine operators or supervisors.
Besides the overall analysis, the sensitivity to control room position (supervisor, reactor operator, turbine operator) of the post-session and the post-video ratings individually was analysed. Table 5 shows the Fstatistic and the partial omega squared effect size resulting from a oneway ANOVA with position as the independent factor. Generally, the results show that both post-session and post-video ratings were sensitive to varying loads of team positions. Particularly, the TLX physical demand, TLX frustration, and SART supply showed high sensitivity to the control room team position. Table 5 shows that the NASA TLX index sensitivity to team positions was quite similar for the post-session and the post-video ratings. Sensitivity of individual TLX items was slightly lower for the post-video than post-session, except for performance which showed slightly higher sensitivity for the post-video ratings. Of the individual items, not considering the indexes, the NASA TLX performance, SART understanding, and SA3 prediction showed higher sensitivity of the post-video ratings compared to the post-session ratings. Also, the SART understanding and the SA3 prediction were not significantly sensitive to the post-session rating but significantly sensitive to the post-video ratings.

Relating subjective ratings to performance
Validity evidence regarding the relationship with other criteria was investigated by relating the subjective ratings to the SCORE performance index. Table 6 presents the correlation between operator ratings, both post-session and post-video, and the SCORE performance index. Table 6 shows that the workload items correlated negatively with task performance, while the subjective performance and situation awareness items correlated positively with task performance. The NASA TLX index and the majority of individual items, except performance, correlated negatively and significantly with the performance index. Also, the SART demand and supply correlations with task performance were negative. Due to relatively demanding scenarios, it was expected that an increase in experienced mental workload would relate negatively to performance. The TLX performance item was reversed to ease interpretation of the correlations-and the TLX performance correlated positively with the performance index. Also, the SA3 items and the SART understanding correlated positively with performance. Interestingly, only situation awareness items correlated significantly higher with performance in the post-video condition than in the post-session condition. Similar to situation awareness, the TLX performance correlation trended higher for the post-video rating than for the post-session, 0.28 and 0.38, respectively, but this difference was not significant.
For relating workload and situation jointly to performance, multiple regressions of both post-session rating and post-video ratings were performed with the SCORE performance index as dependent variable. The resulting overall model fit and beta weights are presented in Table 7. Table 7 shows that both the model containing the post-session ratings and the post-video ratings were statistically significant. The adjusted R 2 was 0.07 for the post-session ratings and 0.22 for the postvideo ratings. Only the SA3 index was statistically significant, and the beta weight increased substantially in the post-video regression. The beta weight for the TXL index was in the expected direction but not statistically significant. Interestingly, the SART index behaved similar to the TLX workload index.

Discussion
This study investigated operators' subjective workload and situation awareness ratings just after the scenario was completed (post-session) and after operators' video-review of the scenario (post-video) to provide evidence for interpretations of this type of ratings. Based on the theoretical basis of the measures, it was hypothesised that operators' postvideo ratings, having an external reference, would differ from postsession ratings, while items involving introspection would be similarly rated post-session and post-video. The results supported this hypothesis. Factor analysis resulted in a mental effort factor defined by both postsession and post-video ratings, while the analysis identified separate factors for operators' post-session perceptions of situation awareness and their post-video perceptions of situation awareness. The NASA TLX item frustration was related to other TLX dimensions and performance, Note: *) p < .05, **) p < .01, ***) p < .001.
as suggested by theory. However, the NASA TLX performance could not clearly be interpreted as an indication of operator self-regulation of effort. The SART items demand and supply were correlated with workload but not with situation awareness items, and the SART element of demand-supply was not related to SART understanding. The operators' rating of the scale based on Endsley's three-level theory of situation awareness was distinct from the operator's rating of workload and was substantially positively correlated with operator task performance.

The Nasa TLX
The study results did not support subjective ratings being able to capture the theoretical distinction between demand and effort. Similar findings have been reported in the literature (Hendy, 1995;Braarud, 2020). The factor analysis suggested one factor defined by mental demand, temporal demand, and effort regardless of operators' ratings being performed post-session or post-video. Although the distinction between demand and effort is theoretically sound (Gopher and Donchin, 1986;Hart and Staveland, 1988), it seems that the operator's rating of both mental demand and effort represents the mental effort invested. The NASA TLX scale description for mental demand (Hart and Staveland, 1988, Figure 8) reads "How much mental and perceptual activity was required (e.g., thinking, deciding, calculating, remembering, looking, searching, etc.)? Was the task easy or demanding, simple or complex, exacting or forgiving?". The first part of the description may guide the operator to introspect rather than looking outward at task demand; consequently, mental demand is assessed similarly to the operator's perception of mental effort. As hypothesised, plausibly due to the operator's assessment relying on introspection, the correlation between effort and operator task performance was not influenced by being informed about the actual scenario demand.
Complex motivational processes may be difficult to entangle from subjective scales alone, but the results of operator ratings did not support the interpretation of the NASA TLX performance as an indicator of self-regulation of effort. The TLX performance correlation with either effort or mental demand was low. However, performance substantially correlated with frustration (− 0.53 and − 0.52, post-session and postvideo, respectively), suggesting that perception of poor performance was related to negative feelings. Similar to this study, the literature reports that the TLX performance item does not relate strongly to any of the other TLX items (Hendy, 1995;Bailey and Thompson, 2001;Braarud, 2020), and a hypothetical interpretation is that the TLX performance item represents how satisfied one is with one's own performance rather than self-regulation of workload. Adding to being correlated with performance, frustration correlated substantially with all other TLX items and loaded broadly on several factors-which can be interpreted as theoretically proposed, the psychological impact of workload.
A somewhat surprising result, given the highly mental characteristics of modern control room work, was the identification of a physical demand factor. Looking at the NASA TLX scale description for physical demand (Hart and Staveland, 1988, Figure 8), it reads "How much physical activity was required (e.g., pushing, pulling, turning, controlling, activating, etc.)? Was the task easy or demanding, slow or brisk, slack or strenuous, restful or laborious?". Given that the operators were seated at computerised workstations, it is hypothesised that they rated this item based on the physical human-computer interaction, such as navigating between process formats, navigating computerised procedures, and acknowledging alarms. This interpretation corresponds with findings relating human-computer interaction to operator workload (Lin et al., 2013;. Being informed about the scenario's task demand and reviewing one's own performance seemed to have no effect on the operator's rating of physical demand. This seems reasonable since the scenario replay likely did not result in any new insights about the scenario's physical demands. It is also interesting to note that the operators rated physical demand as relatively low, which corresponded to the mental characteristics of control room work. The item was also highly sensitive to the varying load of the team position, e. g., reflecting the physical human-machine interaction related to the reactor operators' relatively high information load.

SART
The results question the interpretation of the SART index, as an indicator of situation awareness. Similar to other studies (Hendy, 1995;Selcon et al., 1991;Loft et al., 2015;Braarud, 2020), this study found that the rating of demand and supply were related to the NASA TLX workload items rather than to situation awareness items. The rating of both demand and supply seems to involve introspection. Similar to TLX mental demand and effort, both the post-session and post-video ratings of the SART demand and supply loaded on the same factor. A factor that can be interpreted as mental effort. However, The SART understanding item loaded on the non-informed and informed situation awareness factors as expected, and understanding correlated with operator task performance.
The challenging part of the SART index seems to be the demand-supply element. There is a lack of evidence that the rating of demand-supply should modify situation awareness beyond the explicit rating of understanding. Unfortunately, the post-session demand-supply element of the SART formula did not relate to the understanding item or any of the SA3 situation awareness items. Also, informing the operators of the scenario demand (the post-video evaluation) did not influence the operators' rating of demand, such that the demand-supply element related significantly to the situation awareness ratings. The study's results on SART correspond with previous research (Endsley et al., 1998;Salmon et al., 2009), suggesting that the use of the SART index as a measure of situation awareness is questionable. The SART index may behave differently depending on the level of mental workload and may, in some cases, behave as a measure of workload rather than situation awareness (Pierce et al., 2008;Loft et al., 2015). Consequently, the extent to which the index represents situation awareness more accurately than the item understanding may not be clear. A reasonable interpretation of SART seems to be to interpret the dimensions individually-understanding indicating situation awareness, and demand and supply both indicating mental effort.

SA3
The SA3 rating of situation awareness was clearly distinct from workload. The SA3 items loaded highly on the two situation awareness Table 7 Multiple regression of TLX index, SART, and SA3 on performance. Post-session and post-video ratings. factors and did not load on the workload factors. As hypothesised, being informed about scenario demand, and reviewing one's own performance influenced operators' rating of SA3 situation awareness-the noninformed ratings (post-session) and the informed rating (post-video) defined separate factors. The SA3 index was also more highly related to operator task performance than the SART index. Also, as expected, the informed rating (post-video) of SA3 was more closely related to operator task performance than the non-informed (post-session) rating. A probable explanation for this result is that, being informed, the operators could base their rating on what actually occurred in the scenario. As such, informed rating compared to non-informed rating would be hypothesised to be more closely related to a hypothetical true measure of situation awareness. Regarding the SA3 measures' three levels, the bivariate correlations between the three items suggest that operators did not strongly discriminate these three items. The post-session bi-variate correlations ranged from 0.68 to 0.80, which is quite similar to correlations reported for similar items by Matthews et al. (2002), ranging from 0.66 to 0.74. Theoretically, the three levels should be related to each other (Matthews et al., 2002;Endsley et al., 2000). However, separate processes may be involved in the three levels, and human--system interface conditions may influence these three SA elements differently (Parasuraman et al., 2008;Endsley, 2000b). A plausible explanation for the relatively high bi-variate correlations is that the study was not explicitly designed to distinctively influence the three levels of situation awareness and that the overall rating of the relatively long scenarios made it difficult for the operators to separate the levels.
The positive correlation between SA and operator task performance increased significantly for post-video ratings compared to post-session ratings (post-session correlation ranged from 0.20 to 0.24, while postvideo correlations ranged from 0.37 to 0.53). Also, the correlation between SA3 observing and operator task performance increased more from post-session to post-video rating than for the other two SA3 items. This indication of a somewhat distinct meaning of the SA3 observation plausible related to the scenario replay provided relatively concrete information on the operator's observation of malfunction symptoms, while the scenario replay provided an improved general basis for the rating of comprehension and prediction. However, the post-video correlations actualise the question of to what extent operator ratings represent perceptions of own performance rather than situation awareness (Endsley et al., 1998;Endsley, 2020). The increased correlation could mean that the informed rating approached objective situation awareness-which in supervisory control settings is assumed to be relatively highly related to task performance. Alternatively, the post-video SA rating included an element of an informed task performance rating. While subjective SA and performance seem distinguishable, future research needs to investigate whether informed subjective SA rating approaches objective SA or rather represents elements of operator task performance.

Sensitivity of subjective measures
Corresponding with the literature (O'Donnel and Eggemeier, 1986;Endsley et al., 1998;Hart, 2006), the results of this study suggest that subjective measures are sensitive. The NASA TLX index was more sensitive to the control room position than the SART index and the SA3 index. The higher sensitivity of the NASA TLX seems reasonable due to the scenarios being designed for different task loads on the control room positions. Both post-session and post-video ratings were sensitive to the control room position. However, ratings may be interpreted differently, although they have similar sensitivity. Looking at the SA3 index as an example, the effect size was 0.05 for both the post-session and post-video ratings. However, the SA3 post-session and post-video ratings loaded on different factors, and the post-video compared to the post-session SA3 index was more significantly correlated with the operator task performance index. This type of result reminds us that sensitivity does not equal validity, and this might be well worth emphasising regarding subjective ratings of complex phenomena.

Study limitations and future research
The study included a relatively modest sample of 18 operators from six control room teams. This relates to the practical challenge of recruiting professional control room operators for full-scope simulator studies, and future research should investigate the replicability of the results and investigate the extent to which the results generalise to less dynamic non-supervisory work. The operators' informed ratings (postvideo) were performed after both being informed about scenario demand and after reviewing own performance. The study did not investigate to what extent either being informed or reviewing one's own performance influenced subjective ratings. However, the study demonstrated that being informed about scenario demand and reviewing one's own performance affected ratings of items involving the perception of external events but limited so for ratings involving introspection.
The SA3 situation awareness measure was developed for this study. The results and previous studies (McGuinness and Foy, 2000;Matthews and Beal, 2002) suggest that further investigation of this type of subjective situation awareness measure is desirable. Future studies are needed to investigate the degree to which SA3 is correlated with objective situation awareness measures, subjective performance, and confidence in one's own situation awareness to better assess its validity as an indicator of operator situation awareness (Endsley, 2020). Such studies could also investigate to what extent subjective ratings can distinguish between Endsley's (1995a) three levels of situation awareness. It can also be noted that the scale end points can preferably be reversed compared to the items used in this study. The study utilised the so-called 3D quick version of the SART measure. An application of the 10-item version (Taylor, 1990) might reveal nuances of the SART measures not captured in this study. Future studies could also investigate to what extent informed subjective ratings compared to non-informed subjective ratings are closer related to objective measures. Hence, future research could address whether more efficient approaches than the scenario replay applied in this study can adequately inform participants' subjective ratings.
It is also worth considering that overall subjective ratings, such as those investigated in this study, are generally not very good at capturing the detailed dynamics of work, nor do they accurately measure human factor phenomena of interest (Lysaght et al., 1989;Endsley et al., 1998). However, establishing the adequate interpretation of subjective ratings, e.g., what are the measures valid for, is important in guiding the selection of measure and interpretation of their results. To assess complex mental work, subjective measures can serve a purpose in combination with other types of measures (Lysaght et al., 1989;Annett, 2002) and can be applied to screening scenarios or performance episodes for further analysis (Meister, 1976).

Conclusion
The study found that subjective assessment involving introspection seems to have a robust interpretation across conditions, while the interpretation of items involving perception of external events depends on the participants being informed about what actually happened in the scenario. To obtain valid subjective measures of situation awareness and performance it is recommended to inform participants of system malfunctions implemented in the scenario and system performance implications of their actions. Operators' ratings seem to entangle separated and interpretable constructs for workload, situation awareness, and performance. However subjective workload ratings did not distinguish between mental demand and effort. There is evidence of the multidimensionality of the NASA TLX measure applied to complex cognitive work. Operators' ratings could distinguish between mental effort, physical activity, subjective performance, and frustration. The results of the study question the SART index as a measure of situation awareness.
The results did not support an interpretation of the SART index's demand-supply element as an indicator of situation awareness. However, the interpretation of the SART dimensions individually seems reasonable-demand and supply indicate operator effort, while the dimension understanding represents situation awareness. A subjective measure based on Endsley's (1995a) three-level situation awareness theory showed promising results, adding to previous studies of similar subjective measures ( McGuinness and Foy, 2000;Matthews and Beal, 2002). The measure was distinct from workload and related to operator task performance. The promising results warrant future research to determine the validity and utility of this type of subjective measure. Future research could investigate whether informing participants about what actually occurred in a scenario results in subjective ratings of situation awareness and performance that approach the results of objective measures. To what extent and how to inform participants adequately and efficiently in subjective ratings could also be further researched. Finally, future research could investigate whether the study's findings can be replicated in related domains and if the findings extend to less dynamic non-supervisory work.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.