Keywords

1 Introduction

Future military operations may require single individuals to control multiple unmanned aerial vehicles (UAVs) to decrease demand for operators, safeguard human lives, increase efficiency of operations, and increase military capability [1]. To realize this ambition, automated systems must be used extensively to guard against operator overload [2]. For these automated systems to work effectively, operators must rely appropriately on them. Unfortunately, recent work has shown that humans are imperfect at properly calibrating their use of automation [3]. With this in mind, automation researchers have sought to develop closed-loop systems to help operators make proper use of their automated aids [4]. To power such systems, measures must be utilized during operation to continually gauge operator state to inform the system of the proper intervention or interface configuration for a given situation. Recent work in our lab [5] has focused on the problem of fatigue induced by prolonged low workload portions of intelligence, surveillance, and reconnaissance (ISR) missions.

2 Fatigue

Fatigue is a complex and multifaceted construct about which there is an extensive literature but no single and exacting definition [6]. Figure 1 presents a useful conceptual model for understanding fatigue that distinguishes its trait and state components [6]. The experience of state fatigue emerges over time (i.e. time on task) as a result of a fatiguing agent and the moderating effects of fatigue proneness. The conscious and nonconscious experience of a person’s fatigue state can be assessed through subjective measures of self-report and physiological techniques respectively [6]. Additionally, fatigue state is influenced by self-regulation, which along with task performance can serve as a behavioral gauge of fatigue.

Fig. 1.
figure 1

A simple trait-state model for fatigue. Adapted from [6]

3 Overview of Available Measures and Metrics

In this overview, we refer only to momentary assessments which can be used to gauge fatigue in response to task performance.

3.1 Subjective Measures

Subjective measures are easy to use and provide a more nuanced account of experienced fatigue than can be afforded by physiological or performance based measures alone. These scales can be unidimensional [7] or multidimensional. Typically, state fatigue inventories are administered following task performance and are compared with a version taken prior to tasking. For example, the Dundee Stress State Questionnaire [8] assesses indicators of self-regulation during task performance: distress, task engagement, and worry. These indicators identify different fatigue states such as active and passive fatigue [9]. Subjective fatigue assessments used online are typically simpler than the multi-item post-task assessments, often using a single item to gauge fatigue as a univariate construct. The Karolinska sleepiness scale (KSS) [10] assesses sleepiness on a 9-point scale, and has been used to assess instantaneous fatigue during driving tasks [11]. Subjective fatigue assessments are typically administered via paper or electronically; however, similar measures have been administered verbally [12].

3.2 Performance Measures

There is a long history of using decreased performance with time on task as a gauge of fatigue [13]. In sustained attention research, this decrease is often referred to as the “vigilance decrement” and can occur as a result of monotony or sustained periods of high task-load. The vigilance decrement can occur within 15 min and is characterized by a steep decline in performance followed by a continued, steady decline [14]. Performance decrements resulting from fatigue may occur even before an operator is aware of them [15] and thus performance measures can be more useful than subjective measures. However, the goal of fatigue detection is often to preserve task performance; a metric that depends on performance decrement for fatigue detection may not be ideal.

The explanation for the impact of fatigue on performance depends on the facet of fatigue in question. The model proposed by Desmond and Hancock [16] distinguishes two types of fatigue states: active and passive. Active fatigue is thought to result from a depletion of resources and is brought on by sustained high levels of task-load [16]. Conversely, passive fatigue results from long periods of low task-load and is characterized by loss of task engagement and motivation. With low task-load, operators may adopt an energy conservation strategy and pressure to redirect attention off-task may mount [17]. Performance declines become more severe as these pressures overcome goal based coping strategies designed to maintain attention to the task.

Declines in performance can be detected in a task that is of primary interest or in a secondary task of lesser or even diagnostic importance [18]. The latter method is typically employed to serve as an early warning of the onset of fatigue; compensatory coping strategies can typically sustain performance on a primary task for some time after the onset of fatigue [18]. Examples of performance metrics susceptible to fatigue include reaction time [19], response speed to emergency events [20], and lapses [21].

Probe tasks are another way to measure operator fatigue by testing components not under primary control. Use of a probe in automation monitoring tasks has been demonstrated in several studies [3, 20]. These tasks may assess fatigue by asking operators to recall a certain feature of the task without forewarning them about the question, or by having operators react to an unexpected event. A driving study [20], for example, had participants avoid a van that suddenly appeared in the road. In studies cited here, the probe task was able to discriminate conditions that induced passive fatigue despite participants’ success maintaining performance on the primary task.

3.3 Physiological Measures

This discussion will focus on cognitive fatigue measures that can be reasonably used during UAV operation (e.g., we omit functional magnetic resonance imaging; fMRI). Researchers have investigated fatigue and related constructs using several different physiological methods including metrics derived from cardiac activity such as inter-beat interval (IBI) and heart rate variability (HRV) [18] cerebral blood flow velocity (CBFV) [22], electroencephalography (EEG) [23], and eye tracking metrics [24].

Prominent dual process theories have proposed that information processing is supported by two hierarchical levels, a lower level of automatic processing, and a higher level of controlled processing [25]. Whereas controlled processing is characterized by conscious and effortful processing, lower level processing functions effortlessly and largely without conscious awareness. Physiological measures may gauge the impact of fatigue at the lower, automatic level [24] more effectively than subjective measures of fatigue which are more oriented to higher, controlled level processing.

Electrocortical.

Thus far, research exploring physiological assessments to gauge fatigue in adaptive automation systems has focused on EEG metrics and event-related potential (ERP) analysis [26]. With the onset of fatigue, EEG registers relatively reliable increases in slow wave activity, related to drowsiness and sleep, and alpha wave activity, inversely related to cortical arousal [23]. These changes in wave activity may occur before performance is impacted [27].

Electrocardiographic.

Electrocardiography (ECG) measures have a history of use for detection of fatigue effects with much of the initial work originating in the late 1970s and early 1980s [18]. Typically, fatigue is characterized by an increase in IBI and an increase in HRV. Increase in HRV has been tied to increased self-regulatory effort, or effort to inhibit impulses and persist at difficult tasks [28].

Hemodynamic.

Several measures related to blood flow and oxygenation have been tied to fatigue and performance of vigilance tasks [29] but are seldom if ever employed to power adaptive systems. CBFV, as measured by transcranial Doppler sonography (TCD), has been shown to decline reliably with vigilance decrements [22]. Further, declines in left and right hemispheres depend on task characteristics. Interestingly, CBFV on short tasks which may gauge resource availability predicts performance on vigilance tasks, but concurrent measures of CBFV have been less successful in predicting performance [22]. Concurrent CBFV did predict subjective experience of fatigue. Cerebral oxygen saturation, as measured by functional near-infrared spectroscopy (fNIRS), is not commonly used to detect fatigue [29]. One study [30] found that participants performing a three hour drive had lower oxygenation in the left frontal lobe than those in a control group who performed no task. However, there was a relationship between declined oxygenation and reaction time.

Eye Tracking.

Like EEG, eye tracking has been evaluated for online state detection in adaptive systems (e.g. workload) [31]. A large body of research has linked eye tracking metrics to states of fatigue [32]. A prominent eyelid closure metric linked to fatigue is percentage of eye closure (PERCLOS), [33] which is considered a standard drowsiness gauge by many researchers. PERCLOS is the proportion of time that a person’s eyes are more than 80 % closed and is reflective of slow eyelid closures rather than blinks (<500 ms), which are usually excluded from the computation (e.g., [34]). Lid closures greater than 500 ms are usually defined as microsleeps [35].

Another eye tracking method for gauging fatigue is fixation duration. Eye movements consists of frequent, quick movements called saccades, interspersed with periods of steady gaze called fixations [36]. During fixations, perception and cognitive activity occur [35], and extended fixations can indicate difficulty extracting information [36]. Specifically, as a person struggles to maintain focus and attention with fatigue, fixations lasting 150–900 ms, which are associated with cognitive processing, decrease. Fixations longer than 900 ms, indicative of staring, and less than 150 ms, which may relate to low level unconscious control but not deep processing, increase [35]. Mean fixation duration does not reliably relate to fatigue.

4 Criteria for Metric Selection

Selection of an appropriate metric for any situation requires a high level of regard for the context within which a state is to be measured, and a determination of what state or facet of a state is of particular interest. Criteria used to select metrics are not always explicitly enumerated; however Eggemeier and colleagues [37] identified six properties of workload assessment techniques, of which three were principal: i.e., sensitivity, diagnosticity, intrusiveness. We have framed our discussion of metrics for multi-tasking environments below around these criteria and added one more criterion, robustness. With the presentation of criteria, we evaluate measures and metrics in the context of our lab’s current effort to identify metrics for online detection of passive fatigue during multi-UAV operation.

4.1 Sensitivity

Sensitivity may refer to signal to noise ratios as well as the quickness with which a measure can detect changes in state [37]. Concerning the former, sensitivity of the instrument over the entire range or levels of a state (e.g., sleeping to hype-vigilance) is not always an important requirement. Rather, an instrument should be chosen which provides sensitivity within the range of a state that is of interest. In our effort, we were concerned with variance in fatigue while operators were still relatively wakeful so that severe fatigue could be prevented from occurring altogether. Thus, a measure that is very sensitive to variations at higher levels of fatigue (drowsiness), but relatively insensitive to variation at lower levels of fatigue, such as PERCLOS, was not ideal. Further, our aim to detect and respond to operators’ fatigue before their compensatory efforts to sustain performance were exhausted precluded the use of primary task performance decrements as a gauge of fatigue.

Concerning sensitivity related to time, an important concept in online intervention or adaptive systems, is the window size of the measure. Window size refers to the amount of time immediately prior to real time from which data is analyzed to determine current state. For example, an online measure may use data from 5 min prior to real time up to real time. Window size is determined by the amount of data required to yield a reliable indication of state. Our effort did not require immediate fatigue detection as might be afforded by EEG (4 s window [38]), but did require detection before operator coping efforts failed and performance was impacted.

One interesting consideration is that subjective measures, which are highly time sensitive, may not be sensitive to fatigue that has not yet reached conscious awareness (implicit fatigue) [15]. Indeed, physiological measures may be capable of detecting onset of fatigue before operators can report it, despite requiring a large window size to provide a reliable signal. Implicit fatigue may be a low-magnitude fatigue state that has yet to reach consciousness, in which case a highly sensitive metric is required. Yet, sleep studies suggest that operators can be highly compromised without realizing it [39], which implies a component of fatigue distinct from subjective tiredness. A test would need diagnosticity to distinguish the two components.

4.2 Diagnosticity

Diagnosticity refers to the ability of the measure to distinguish between different components of a construct.Footnote 1 It is especially important for measuring fatigue, which is multifaceted. One advantage to multivariate self-report measures relative to univariate self-report measures is their ability to distinguish fatigue types. For example, the DSSQ is able to discriminate active fatigue, which is distinguished by an increase in distress, from passive fatigue, which is associated with a loss of task engagement [9].

Generally, task performance decrements may be less severe for active fatigue, which can be bolstered by maintenance of effort, than for passive fatigue which is associated with loss of motivation [40]. A recent driving simulator study to illuminate the relative performance impacts of active and passive fatigue revealed higher standard deviation of lateral position (SDLP), as well as longer braking and steering response times to an emergency event for those in the passive fatigue condition, compared to controls. Unfortunately, relative differences in performance decrement may also signify greater or lesser fatigue which might be difficult to distinguish from differences resulting from fatigue type.

Like performance indices, CBFV is useful for detecting fatigue as it shows a reliable decrease with time on task. Whereas disengagement may indicate either a lack of resources or an unwillingness to allocate resources to a task, CBFV may relate specifically to resource availability [22]. Some work has shown that CBFV decreases as a function of task difficulty, and that changes do not occur without a work imperative, e.g., when monitoring automation that is successfully performing a task [41]. Thus, CBFV might serve as a gauge of resource mobilization, making it a potentially useful gauge of active, rather than passive, fatigue. It is in that regard that CBFV may be useful for multi-UAV monitoring.

In addition to the active and passive fatigue facets discussed above [16], an online gauge of fatigue must be able to detect different types of fatigue expression. One important type is described in driving research as highway hypnosis or driving without awareness (DWA) [39]. Anecdotally this state involves the competent performance of basic tasks without conscious engagement. DWA can be induced by bright points of fixation and highly predictable environments [39], such as monotonous roads. Control stations for UAV operation may be at risk of inducing DWA states, as they offer very little environmental variation and require operators to monitor backlit displays. Although some have questioned the usefulness of eye closure metrics for detecting DWA, this state is purportedly marked by changes in gaze behavior [39]. A binned fixation duration approach [35] might circumvent the diagnosticity problems associated with eye closure.

4.3 Intrusiveness

This criterion refers to the level of disruption caused by the use of a measure. In terms of fatigue, it is important not to use measures that might exacerbate the problem. For example, when operators are actively fatigued, the addition of regular, intermittent subjective fatigue assessments (i.e., instantaneous self-assessment; ISA) may be disruptive to performance [42]. Specifically, they can increase workload when workload is already too high, which can result in task shedding and lack of response. Conversely, passive fatigue may result in acquiescent response bias.

The issue of compounding workload also applies to the use of secondary diagnostic tasking. In laboratory settings the use of secondary tasking requires assigning the secondary task low priority [37]. This prioritization remains constant throughout the experiment which allows secondary task performance to be used as a gauge of workload or fatigue. In practice, evaluating a secondary task in this way is unrealistic as the priority of any secondary task relevant to an operation is likely to shift in different situations [37]. Adding a noncritical diagnostic task to operationally relevant tasking presents other problems. Such tasks add workload or, at minimum, a distraction from important tasking. Operators may reject the tasking as artificial or bothersome and blame such tasking for performance deficiencies [37]. Further, such a method is likely to be valid only at lower levels of workload or fatigue as this task would be among the first to be shed in higher workload conditions.

Probe tasks may also be disruptive to task performance and not ideal for providing a continuous online gauge of fatigue level. In the driving fatigue study mentioned above [20], drivers were distressed following the emergency situation probe such that expected stress state differences due to the experimental manipulation may have been masked in a post-task assessment. Less jarring probes may be vulnerable to habituation; the operator may learn to change behavior in anticipation of them [43]. As a result, a probe might only be used successfully one time.

Physiological measures provide mixed potential with regard to intrusiveness. Despite the demonstrated utility of EEG for online detection of fatigue, it is relatively impractical for use in day to day operations supported by closed-loop adaptive systems. For a UAV operator shift, the setup of a capable EEG system would require: an assistant to place electrodes, baseline calibration, and pre-task testing (e.g., impedance checking). During the shift, occasional calibration checks would have to be included to correct electrode drift and/or unsecured electrodes. After the shift, the system components would have to be removed, cleaned, and stored, likely with the help of an assistant. During operations, EEG electrodes would need to be in contact with the scalp, usually in multiple locations. To conduct signals, a cream or gel typically must be used. The operator therefore must maintain a hair style that is conducive to electrode placement, and clean his or her hair after the shift. Further, the electrodes and the unit to relay data may restrict movement during operation and accidental shifts of the equipment may modify the signal or even cause electrodes to lose contact with the scalp. However, future wearable systems may be less intrusive. Similar problems plague set-up and use of TCD which requires head gear and careful set up and calibration with the help of assistants. Specifically, assistants would need to mount the headgear, find an appropriate window through which to locate the mid cerebral artery, and run a baseline assessment. The headgear and wires would restrict movement of the operator during tasking.

Eye tracking shows promise as a solution to the intrusiveness of other available measures. An eye tracking unit may only require a single calibration for each operator, which can be used for all subsequent shifts. Further, that calibration can be done by the operator without assistance. No post-shift cleanup or storage would be required. Eye tracking is restricted in that operators must remain in view of the tracking cameras and must not obstruct large portions of their faces, but these requirements may not be difficult to meet as attention should be focused in the direction of the monitors for task performance and hands on controls.

4.4 Robustness

In a controlled, laboratory setting, many metrics may show promise because noise can be eliminated from the signal which is to be detected. Unfortunately the applied setting does not offer the ability to control for other variables and thus it is crucial to account for the noise added by other factors. For example, gaze pattern-based metrics may be problematic for detecting state due to their sensitivity to the relative spatial distribution and frequency of critical signals, which cannot be controlled in an applied setting. EEG and TCD are disrupted by talking and chewing, and by head movements that might cause the electrodes/probes to lose good skin contact.

Thus, the metric must be robust to quality problems. There are three considerations here. The first issue is whether missing data for a metric tend to be at random or systematic. In the former case, a measure might provide only 500 out of 1000 possible observations, but nonetheless provide a reliable assessment. However, if loss of data is related to particular events in the scenario, the assessment is much more problematic. The second consideration is how easily the metric loses its ability to detect a state with decreasing data quality. The third and the potentially most problematic situation is whether or not the outcome indicated by the metric changes qualitatively as a result of shifts in quality of the data. One example of this problem came up with our assessment of an eye tracking technique based on the binning of fixations by duration with guidance from a previous effort, which looked at alertness using EOG [35].

Our expectation was that with fatigue, participants would have a decreasing proportion of fixations between 150 ms and 900 ms, the purported cognitive fixation range, to total fixations. Although cognitive fixations declined as expected overall, we found that, at the individual level, our performance and subjective data suggested that those with the lowest proportion of cognitive fixations were the best performers and were the least fatigued. However, the method for identifying fixations we utilized was potentially sensitive to missing data, leading to an underestimation of fixation duration. This meant that the quickest cognitive fixations may have been categorized as express fixations (<150 ms).

Cognitive fixations may have lengthened with diminished perceptual efficiency caused by fatigue. This would cause the incorrectly classified short cognitive fixations to be correctly reclassified as cognitive fixations with the onset of fatigue. Operators more resistant to fatigue would maintain shorter cognitive fixations which would remain incorrectly classified as express fixations. The result is that those who struggled less were less likely to have their incorrectly classified cognitive fixations (those accidentally considered express fixations) reclassified as cognitive fixations; the outcome indicated by the fixation duration binning metric changed qualitatively when data quality was poor. Thus, maintaining reasonable quality for this metric is essential.

5 Conclusions

Using these four criteria in our own effort to identify metrics for passive fatigue, we have concluded that binned fixation duration based eye tracking metrics provide the most promise. Like other physiological measures, these metrics are sensitive to lower levels of fatigue and, therefore, can be used to identify fatigue prior to subjective awareness and performance decrements. Binned fixation duration is diagnostic of passive fatigue whether or not operators are in a DWA state. Finally, eye tracking is noninvasive and easy to set up. Unfortunately, binned fixation duration is not robust to quality problems with our current equipment (a Seeing Machines faceLAB 5 eye tracker recording two 21 in. monitors at 60 Hz). However, systems which are far more capable than that which we used are already available and demonstrate much higher and consistent tracking quality with a dual monitor setup than we were able to achieve. Although other measures may improve with technological advancements, these improvements are currently less promising than those of eye tracking. For example, less intrusive EEG systems lack the quality of the medical grade systems typically used in laboratory settings. In closing, we hope that this elaboration on Eggemeier and colleague’s criteria [37] using our work as a case study has provided some guidance for metric selection for operator state assessment in a closed loop system.