Benchmarking Teamwork of Humans and Cobots—An Overview of Metrics, Strategies, and Tasks

Human-robot teaming receives an ever-increasing level of attention in research, development and industry. Novel approaches to task sharing in hybrid teams range from optimized schedules to intelligent cobot assistants with a high degree of autonomy. These approaches must prove their usefulness and benefits compared to manual work or full automation – particularly when it comes to assessing their potential for productive industrial use. This leverages demand for standardized, repeatable benchmarks to compare approaches and measure improvements in a reproducible way. Designing such benchmarks is challenging as numerous aspects, from safety considerations to human factors and team performance, must be considered. This survey seeks to contribute to the future development of benchmarks for the field of collaborative assembly, handling, and industrial cobot applications by giving a comprehensive overview of relevant metrics, evaluation strategies, and tasks for human-robot teams.


I. INTRODUCTION
Human-robot teaming has lately gained increased attention also in the context of industrial applications. Hybrid teams of humans and lightweight robots promise the symbiotic use of individual human and robot strengths in small and medium-sized enterprises (SMEs) [1]. Various planning methods for organizing joint task execution have been proposed: Static teaming approaches target the calculation of fixed schedules for mixed teams (e.g. [2], [3], [4], [5], [6]). By contrast, dynamic teaming methods emphasize situationdependent co-working similar to human team coordination (e.g. [7], [8], [9], [10], [11], [12], [13]). These methods leverage robot reasoning and decision-making competencies to achieve dynamic online adaptation, e.g. regarding varying assembly sequences, product variety, worker availability etc. They integrate several complex techniques from different fields of cognitive robotics and artificial intelligence to make The associate editor coordinating the review of this manuscript and approving it for publication was Shaohua Wan. a robot communicate, sense, plan, and act together with humans in possibly unknown environments.
Independently of the teaming mode (static or dynamic), the introduction of robot co-workers in SMEs comes along with investment costs for suitable manipulators and sensory equipment -particularly cognitive robot systems as proposed in recent years must therefore be thoroughly tested regarding their usefulness to justify the investment. This leverages demand for structured benchmarks enabling the comparative assessment of different approaches, which is increased by the currently prevalent replication crisis in human-robot interaction (HRI) research [14]. Compared to prior benchmarks for intelligent robots (cf. e.g. [15], [16], [17]), establishing such benchmarks for collaborative robots is even more challenging. Metrics, strategies, and tasks must take human and robot into account: Firstly, human-robot teaming requires an evaluation along multiple scales in addition to task efficiency and effectiveness [18], e.g. human factors and occupational safety. Secondly, these scales are partly based on subjective metrics which can not directly be measured. And thirdly, FIGURE 1. This paper addresses the key steps for benchmarking industrial cobot use: The process starts with the goals of cobot use under the constraint of ensuring safety. Evaluation strategies are then derived. Based on these strategies, concrete tasks and metrics are picked. Insights from HRI research influence metrics and strategies.
an additional challenge is raised by dynamic methods: Static approaches produce fixed human-robot workflows that are optimal according to some optimization criterion (e.g. physical ergonomics [5]). This yields an objective standard of comparison. By contrast, dynamic approaches are designed to adapt robot behaviour to different situations online. These situations emerge e.g. from human decisions or errors. Hence, teaming performance depends on the interplay of robot and human decisions, actions, and events that are not necessarily deterministic and fully foreseeable for robots. Benchmarking strategies for dynamic teams must therefore cover a range of situations which might occur during joint task execution to explore the average performance to expect from adaptive robot assistants.
Prior publications have already gathered sets of metrics for objective as well as subjective and cognitive aspects in the general field of HRI [19], [20], [21], [22], [23], [24]. In contrast to these surveys, this paper seeks to contribute more targeted insights for collaborative robotics and puts a stronger focus on the overall benchmarking process ( Figure 1): After a short description of the literature review process (Section II), we will first reframe relevant metrics from HRI research with regard to typical goals of human-robot teaming in industrial settings (Section III). Strategies for collecting these metrics are then reviewed and discussed regarding their ease of use, reproducibility, and versatility (Section IV). Finally, we will give an overview of recently used tasks and model sets which might inspire future unified benchmark problems (Section V). To the best of our knowledge, this holistic view on hybrid teaming benchmarks has not yet been taken in literature.

II. METHODOLOGY
This survey is based on literature we gathered during our work in the field of human-robot teaming [25], [26], [27], [28], [29], [30], starting from 2017. We have expanded and completed this literature collection through an exploratory forward and backward search. For each item in our initial collection, we recursively screened the works referenced by the item and more recent works referencing the item.
For the latter, we used the Google Scholar 'Cited by' functionality. The key criterion to include a publication was whether individual studies involved task sharing between human and robot agents according to the following nomenclature: A task or process (T , ≺ T ) is a pair of a set T = {τ 1 , . . . , τ |T | } of |T | subtasks or process steps denoted τ i (i ∈ {1, . . . , |T |}) and a partial order ≺ T that encodes ''earlier-later''-relations (also called precedence constraints) between subtasks. Each subtask τ i is assumed to be feasible for only one of the agents (human or robot) exclusively, for both agents alike, or for neither of them on his/her own in the case of timely synchronized collaboration. The problem we seek to benchmark is, hence, the coordinated division of subtasks among the members of a hybrid team. Accordingly, we will not cover benchmarking the performance of individual robot system components (perception, navigation, manipulation, etc.; see [23]) but rather consider measuring the effects of multi-agent teams as a whole. We will refer to the task definition when formalizing evaluation metrics in the next section.

III. EVALUATION METRICS
Metrics for measuring aspects of HRI have comprehensively been discussed [19], [22], [23], [24], [31], [32], [33], [34], [35]. In this section, we summarize this extensive body of work. We extracted metrics that can be numerically quantified and compared and put them into the context of industrial human-robot task sharing. These metrics are hereinafter clustered by the major goals that companies pursue when considering human-robot co-working for production. These are [1] • increasing productivity by combining the strengths of humans and robots and sharing work in situations in which full automation of a process would be inefficient, • increasing flexibility to reflect the trend towards small-batch production for mass customization without the long changeover times of classical automation systems, • and increasing job attractiveness to improve the reputation of manufacturing work by reducing physical and mental stress and fostering innovative technologies.
In line with these goals, we have clustered metrics related to productivity as a profitability-oriented view on mixed teams (Section III-A), those measuring flexibility (Section III-B), and those suitable to quantify job quality (Section III-C). Seeking to achieve these goals can not be thought of without ensuring occupational safety in line with applicable laws and standards, e.g. defined by ISO/TS 15066 [36]. Therefore, we have also included metrics related to worker safety (Section III-D). Figure 2 summarizes the metrics considered hereinafter.
for introducing human-robot teams. We assume that workers are generally skilled and that robots are also equipped with appropriate manipulation and perception skills to work effectively towards the correct completion of the shared task. Based on these assumptions, the first cluster of metrics targets the effects of hybrid teams on goal achievement with regard to efficiency. Definitions of ''efficient goal achievement'' are, of course, strongly application-dependent. For the manufacturing domain, we propose to consider the following aspects: The overall time to completion D is the time needed to finish a task [23], e.g. assembling one product instance. This time span is also referred to as makespan [5] or cycle time [56] in the terminology of production key performance indicators. Let D H denote the duration of the purely manual process as carried out by a human worker H. Concrete values can then be estimated with standard motion time systems (e.g.Methods-Time Measurement (MTM) [57]). Further, let D H/R denote the duration of the same task when partly automated by a human-robot team. We then propose defining the cooperative speed-up S H/R [10] as A desirable cobot system must achieve high speed-up values across different tasks -reducing the time to completion carries over to production costs induced by human labour as well as robot energy consumption. Reducing task durations by work sharing is not only a matter of economic considerations but also from a perspective of technology acceptance: Expectations of being relieved from parts of a task have been found to influence workers' attitudes towards robots positively [58] -this brings us to the notion of helpful robots ''trying to play a positive role [for humans] with the task at hand'' [37]. Therefore, a robot co-worker should generally be perceived as a helpful partner. Freedman et al. have lately proposed relative helpfulness H R as a quantitative metric that relates the decrease in generic costs to achieve some goal by human-robot teaming to the cost of human-only task execution [37]. With this definition, H R is directly related to cooperative speed-up (Equation 1) if we take time to completion as a measure for task costs: Although dual to cooperative speed-up, relative helpfulness provides a more worker-centred quantitative view on the overall relief induced by robot teammates. The alignment of subtask allocation with individual agent capabilities is another central aspect of efficiency in humanrobot teaming. E.g., humans are still ascribed superior sensorimotor abilities, whereas robots exceed in precision [1]. It would therefore be inefficient to allocate a strongly dexterous subtask to a robot while the worker is assigned process steps to place small workpieces with high precision in the meantime -in this example, each subtask would likely be performed far more quickly by the other agent in line with respective capabilities. Structured methods to determine socalled capability indicators have been proposed early [38]. These indicators rate to what extent subtasks τ are suitable for execution by either human or robot. To this end, they typically condense several criteria into real-valued scores c H (τ ) for humans and c R (τ ) for robots, with higher indicator values meaning better suitability of the corresponding agent. Static planning approaches use capability indicators as optimization criteria before task execution (e.g. [6], [38]). Yet, accumulated capability indicators of subtasks assigned to each agent can analogously be used to evaluate the quality of decisions for dynamic teaming approaches after task execution.
Agile robots can make mistakes [59], especially when humans are around and unpredictably modify parts in the workspace. Consider e.g. manipulation failure [23], [59] or erroneous robot task allocation decisions colliding with human actions due to incomplete knowledge of the task progress [10]: any human or robot error requires time to resolve. This timespan is lost from the productivity point of view. The number of errors should therefore be considered an important metric for mixed teams [19]. As we are focussing on the impact on productivity here, the concrete error source is of no importance. Consequently, we can say that trying 43650 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
to execute a subtask can either succeed or fail. Overall, the robot will successfully complete N success R ≤ |T | of all |T | subtasks. To this end, it will issue a cumulated number of N attempt R ≥ N success R attempts to handle subtasks. The number of attempts may exceed the number of completed subtasks, as subtasks that failed may be retried several times. We can then define the robot error rate E R as a relative metric: This metric can analogously be used to measure human error. Teaming fluency, as discussed by Hoffman [34], helps to analyse the timing aspect during shared task execution beyond makespans. The corresponding metrics are connected with efficiency and productivity as follows: Human Idle Time D idle H accumulates timespans during which the human was willing to contribute to task progress but was delayed, e.g. while waiting for the robot to fulfil an ''earlier-later''constraint. Similarly, phases of Robot Idle Time D idle R can emerge while waiting for human input or action. Both sorts of idle time indicate pure utilization of production resources and should be minimized from the perspective of productivity. By contrast, measuring long timespans D coop of Concurrent Activity, during which human and robot are equally active, and a small deviation D H/R − D coop between concurrent work and overall task duration indicates a successful division of work and efficient use of available agent capacities. Normalizing the aforementioned measures by the task duration D H/R renders them comparable across different tasks and robot systems. Hoffman [34] furthermore proposed to track the functional delay as the time between an agent's action and the beginning of her/his partner's action. This metric is particularly relevant for approaches where the agents take turns one after another or in the context of collaborative interaction (e.g. when waiting for a part to be handed over); however, this metric is redundant from a productivity-centred point of view, as idle time includes such timespans.
Concurrent activity D coop provides a basic understanding of resource utilization. Still, this metric does not necessarily indicate a proper division of labour amongst agents. Compared to humans, robots may work slowly, e.g. due to safety considerations or limited dexterity. Teamwork will then not speed up tasks significantly. We have previously used the robot participation rate P R [10] to capture this aspect. This metric relates the number of subtasks N R handled by the robot to the overall number of subtasks |T |: Assuming subtasks of similar durations (as is the case in e.g. pick-and-place tasks within the bounded workspace of typical cobots), a competitive robot co-worker should achieve P R values close to the fraction of robot to human working speed. Otherwise, the system is not performing as a partner up to its potential. This may be caused by the performance of perception and planning components or by coordination schemes with an overhead of interaction effort -such issues would also be indicated by negative helpfulness scores (H R < 0) and speed-up values S H/R < 1. When interpreting P R , one must consider that its value is capped to the percentage of subtasks the robot can contribute to -if subtasks exist that only humans are capable of (Section I).

B. FLEXIBILITY
Low changeover times in small-batch production are a major goal of industrial human-robot teaming [1] that we refer to by task flexibility. Programming has been identified as the most time-consuming activity in this context [60], and we can hence assume this step to also constitute the major part of overall changeover times in robot co-working systems.
To quantify this effort, Marvel et al. proposed using the programming time as a metric for multi-robot teams. As there are various programming approaches used for instructing collaborative robots (e.g. visual programming with skills [10] or learning from demonstration [61]), the more general term teaching time D teach will be used hereinafter. It is important to notice that this term does not necessarily have to refer to the teaching of robots only: safe and efficient co-working is supported by well-trained employees [62], and qualification times can thus additionally be taken into account when measuring D teach . This absolute metric is hard to compare across tasks. We, therefore, propose the normalized teaching time D teach : Assume TA(τ ) to break down a subtask τ ∈ T of some task (T , ≺ T ) into a set of work items with equally short durations using standard task analysis (TA) techniques (e.g. MTM [57]). The normalized teaching time is then given by According to this definition, low teaching times per work item ( D teach → 0) indicate a robot teammate that can quickly be adapted when partial automation of a new task is desired.
Ongoing operational costs are a key concern when introducing hybrid teams [1]. Even if low D teach scores are achieved, the absolute effort for commissioning may still prevent profitable system operation. Therefore, the duration of teaching a new task must also be put in relation to the subsequent use times [39]. This is achieved by the teachingto-use time rate T with where N lot ∈ N denotes the lot size to produce. In line with the effects of automated mass production, T tends towards zero with increasing lot sizes. When comparing two humanrobot teaming approaches for small-scale partial automation, lower T scores indicate better task flexibility. Teaching times should, in any case, not exceed the gain in productivity as expressed by a decrease D H − D H/R in production times per task execution (Section III-A). This interplay between productivity and task flexibility metrics can be expressed with an alternative formulation H ′ of relative helpfulness VOLUME 11, 2023 43651 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
(Equation 2): As this metric is defined for generic costs [37], the lot size and teaching time can be integrated by adding the contribution D teach N lot of teaching per production cycle to the expected human-robot time to completion D H/R : In analogy to Equation 2, high scores for H ′ indicate a helpful robot with regard to a given task. When we assume that teamwork speeds up tasks (D H/R < D H , or respectively H > 0), negative values of H ′ carry the additional information that teaching efforts exceed the achieved gain in productivity.
In addition to task flexibility, particularly dynamic teaming methods are also directed towards teaming flexibility, i.e. towards adapting to variance during the joint execution process. A need for such adaptations can e.g.arise in situations when a worker leaves temporarily, e.g. at shift changes, or when handling a more important intermediate task is necessary [10], [13]. Further aspects can be derived from considerations on robot agility in general [59]: To what extent can a robot handle (self-or human-induced) failure in the process? Can it adapt to environmental changes, e.g. a tool moved to a different position by workers? All these aspects of teaming flexibility can be covered by classifying the autonomy level of a robot co-worker. To this end, early definitions of Levels of Automation (LOA) for the division of work among humans and any sort of machines in general (e.g. [63], [64]) have inspired more specialized taxonomic frameworks for HRI. In particular, the Levels Of Robot Autonomy (LORA) proposed by Beer et al. [40] are suited for evaluating human-robot teaming [18]. Respective frameworks enable a categorization of the overall system, e.g. in the case of LORA by evaluating the function allocation to either human or robot for the basic actions of sensing, planning, and acting [40].
For the task-sharing scenarios under consideration in this survey, answers to the question of whether certain functions are allocated to the robot are not necessarily binary for a given system -they may depend on the concrete task that a mixed team works on: Consider e.g. a stationary robot manipulator with a camera attached to the robot hand as the only sensor. According to the LORA taxonomy, a task-sharing system using this sensory setup will certainly implement sensing capabilities for object detection in general. Still, the degree of autonomy will depend on the concrete task design. If parts are out of perception range, support by the partner is required to gather information on these objects. Similarly, the capability to act on parts may vary within a continuum depending on part locations and robot reach. A quantitative placement along this continuum of shared control in which humans, as well as robots, are accountable for a (possibly sliding) amount of allocated functions [41] can be achieved by measuring the amount of intervention or using the related neglect tolerance metric [40]: The amount of intervention is defined as the fraction of time during which a human controls the robot [41]. Yanco and Drury have proposed it in the context of mobile robots able to navigate with varying amounts of control by a human supervisor -this is opposed to our task-sharing setting in which the human task is not supervision but equally contributing to a shared goal. We can still reframe this metric accordingly: Human intervention can be assumed to occur as punctual events of short duration in the context of productive robot co-workers, e.g. providing pieces of information, supporting collaborative subtasks, or especially helping out in case of robot failure. The amount of intervention can then be measured implicitly via other metrics, e.g.the interaction effort (Section III-C) or the robot error rate (Equation 3).
Yanco and Drury have further defined the autonomy level as ''the percentage of time that the robot is carrying out its task on its own'', complementary to the amount of intervention [41]. Similarly, neglect tolerance has been introduced by Olsen and Goodrich [35]. Neglect tolerance quantifies the timespan that a system can work without human attention or intervention on a level of effectiveness that is considered sufficient (so-called neglect time). Although originally motivated by man-machine interface design for remote robot operation, neglect tolerance has semantics well-suited for industrial co-working: High neglect tolerance expresses a productive, failure-proof robot even in situations when humans are not available or not willing to interact. Consider e.g.a cobot which needs humans to explicitly confirm each subtask they completed (e.g. [65], [66]). Systems relying on this coordination scheme will stop working if the input on task progress is no longer provided. They have low neglect tolerance. By contrast, approaches with implicit coordination by action or world state observation (e.g. [10], [11]) are more neglect tolerant. Human attention is here only needed when a subtask requires the collaboration of both agents simultaneously. For precise measurements of neglect tolerance, Olsen et al.have proposed to observe the mean neglect timeD neg between two subsequent user interactions by observing the points in time when users actually intervene, or effectiveness drops below the acceptable threshold [35]. For our nomenclature regarding task sharing (Section I), we can e.g.say that a robot is sufficiently effective as long as it productively works on one of the subtasks and that it is no longer effective when entering an idle time phase (Section III-A). As with participation rates (Equation 4), neglect tolerance must be carefully interpreted in the context of the task at hand. For the same robot system, a task which is dominated by subtasks that only humans are capable of might lead to low neglect tolerance. In contrast, tasks with a high potential for parallel work will yield high neglect times.

C. JOB QUALITY
In addition to productivity and flexibility, another goal of cobot use is to enhance human job quality. There are two major research fields which offer metrics to quantify working conditions: physical ergonomics assesses loads impacting the human body to prevent disorders of muscles, nerves, and joints (Section III-C1). By contrast, cognitive ergonomics aims at mental health and perceived comfort (Section III-C2). We will hereinafter review relevant metrics from both areas.

1) PHYSICAL ERGONOMICS
There are several established tools for assessing the physical ergonomics of a workplace: frameworks such as the Strain Index (SI) [42], Revised Strain Index (RSI) [67], Rapid Entire Body Assessment (REBA) [43], Rapid Upper Limb Assessment (RULA) [44], European Assembly Worksheet (EAWS) [68], or the Washington Industrial Safety and Health Act (WISHA) [69] have been used as indicators to evaluate human-robot workflows with regard to human physical strain (e.g. [2], [3], [5], [18], [70], [71], [72], [73]). These frameworks condense the information on the human posture (encoded by limb angles, the weight of handled parts etc.) into a single numerical score. Typically, lower values indicate ergonomically favourable situations -accordingly, unfavourable subtasks can be shifted to robot co-workers to reduce the physical load by capability-based task allocation. On top of the indicators for single process steps or motions, attempts have been made to cumulatively rate the effects of sequences of actions, ultimately taking muscle fatigue into consideration [71]. Aside from the ergonomics scores based on observing limb angles with the aforementioned frameworks, electromyography (EMG) records of the electrical signals transmitted to muscles can be used to estimate muscle activity and fatigue online [74], [75], [76], [77]. We refer the reader to further literature that discusses measurement methods for physical risk assessment in depth [31], [78].

2) COGNITIVE ERGONOMICS
Improving cognitive ergonomics can influence productivity and product quality positively [79], [80]. Hence, the benefits of task sharing and enhanced physical ergonomics should not be compensated by the negative impacts of cobots on cognitive ergonomics, e.g. increased stress or fatigue as a consequence of high mental workload [81], [82] or of resistance to cooperating with the robot [83]. Contrasting to the above productivity, flexibility, and physical ergonomics metrics, assessing concepts related to human factors is more challenging as they concern subjective human impressions during the teaming processes. These are investigated with two predominant strategies, which the remainder of this section puts emphasis on: • Questionnaires are the most common tool for participant self-reporting in human factors analysis. They are frequently composed of questions to rate one's impression on a 5-or 7-point Likert scale and have been applied to measuring a broad range of specific aspects.
• Physiological Measurements can be used to monitor respiratory rates, heart activity (electrocardiography, ECG), brain activity (electroencephalography, EEG), eye activity (electrooculography, EOG), or the electrodermal activity (EDA; also known as galvanic skin response). Aside from these methods, there are further, less frequent strategies which we name for the sake of completeness: (i) Direct Input Devices, such as joysticks or sliders, allow subjects to input values continuously. The input data can then e.g.be used to quantify emotions in terms of valence and arousal [84], [85]. (ii) Behavioural Assessment relies on video recordings of subjects to categorize their behaviour after the actual experiment [27], [86], [87], [88]. (iii) Computational Models relate quantitative values to human factor concepts, e.g.trust to objective metrics [89], [90], [91], [92], cognitive workload to physiological signals [93], [94], stress to skin temperature [95], or anxiety to facial expressions [96].
These methods are used to collect data on a wide variety of human factors. To identify concepts previously raised in the context of human-robot task sharing, we reviewed prior surveys of Baltrusch et al. [97], Hopko et al. [98], Lorenzini et al. [31], Nelles et al. [21], Rubagotti et al. [52] and Wurhofer et al. [99]. We unified their terminology and extended them towards recent publications. This gave us the below list of concepts with prominent examples of applied measurement instruments (see Tables 1 and 2 for a comprehensive list of questionnaires and physiological measures used with these concepts): Cognitive Workload is a ''multidimensional concept that consists of four components: 1) task complexity; 2) mental workload; 3) performance; and 4) depletion factors (e.g.stress, fatigue, motivation) '' [143]. Reaching the limits of one's cognitive capacities can lead to stress and anxiety [82]. In the long run, high mental workload causes fatigue, increases error rates, and hence decreases performance. The well-known NASA Task Load Index (NASA-TLX) [45] puts emphasis on the perceived physical and cognitive workload when using a system. Helton et al. have used an extension of the NASA-TLX towards perceived teaming workload (e.g. in terms of perceived effort for coordination and communication, team support etc.) [100]. If only cognitive workload is relevant, the Subjective Workload Assessment Technique (SWAT) [102] is a validated alternative. When only mental effort is relevant, the Rating Scale Mental Effort (RSME) [103] is another established measurement instrument. It gives users more guidance by providing descriptions at certain scale levels [150]. These questionnaires can be complemented with physiological signals related to cognitive load ( Table 2).
Affect refers to the experience of feelings, emotions, or mood [151]. Negative emotions towards the robot and the interaction with it may degrade trust and acceptance. Frequently investigated sub-aspects of affect are anxiety, frustration, emotional stress and (dis)comfort. Prominent scales with a focus on measuring these sub-aspects are the State-Trait Anxiety Inventory (STAI) [47], the Positive and Negative Affect Schedule (PANAS) [105], and the Negative Attitudes Towards Robots scale (NARS) [106], [107]. Overview of studies which target human-robot cooperation and use physiological measures for human factors analysis. References in grey indicate that the study did not find statistically significant differences between conditions with respect to the concept. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Satisfaction: Satisfaction is the ''extent to which the user's physical, cognitive and emotional responses that result from the use of a system (. . . ) meet the user's needs and expectations'' [152]. Important factors influencing satisfaction are feature consistency, robot support [100], contentment with the interaction [99], self-efficacy [108], trust, and pleasure/frustration [152]. Satisfaction is (besides efficiency and effectiveness) a core dimension of usability [152]. The most prominent usability evaluation method is the System Usability Scale (SUS) [46]. It accurately captures the usability and learnability of a system [153] while only moderately correlating with task performance [154]. A compact variant is the Usability Metric for User Experience (UMUX) [101].

43654
Subjective Performance complements the objective performance metrics (Section III-A). Respective questionnaires capture how subjects perceive overall efficiency, effectiveness, and output quality, as well as the individual contribution of human and robot to the task. E.g., objective productivity metrics as the concurrent activity have counterparts capturing human subjective ratings of teaming performance in variations of the fluency questionnaire [34].
Acceptance encompasses a person's attitude and behaviour towards a robot [58], and behavioural acceptance may range from commitment to refusal. Early models for measuring the acceptability of automation are the Technology Acceptance Model (TAM) [155] and the Unified Theory of Acceptance and Use of Technology (UTAUT) [108]. When using more recent extensions of these models, as proposed by Venkatesh et al. [156], [157], it must be taken into account that interaction with technology can be mandatory rather than voluntary in industrial settings [158], [159]. A better-suited variant, particularly for human-robot cooperation, has been constructed and validated by Bröhl et al. with ease of use, usefulness, and intention to use as core factors [113].
Personality refers to the personality traits that humans attribute to the robot based on its behaviour. Examples are likeability, intelligence, appreciation, respect, cooperativeness or legibility of behaviour. The Godspeed questionnaire is a well-known example of capturing these aspects [104].
Interaction Quality refers to the perceived fluency and naturalness during joint task execution. Coordination, communication, and time-sharing demands [100], as well as the experienced teaming and waiting times, are important factors influencing this concept. Similar to subjective performance, some aspects of interaction quality can objectively be measured, e.g.with productivity-related fluency metrics such as human idle times, functional delays [34], or the interaction effort [35]. A questionnaire for the subjective fluency of HRI has been introduced by Hoffman et al. [160] and modified for several studies [48], [61], [109], [161]. Paliga et al.have refined the concept into human-oriented, robotoriented, and team-oriented components [111]. Importantly, subjective fluency scores do not necessarily correlate with corresponding objective measures: a cumbersome interaction which objectively increases efficiency can feel less fluent than a natural but less performant interaction [34].
Trust is a multidimensional concept which emerges from the interaction of two partners. Wurhofer et al. define it as ''the extent to which the user feels confident that the system will behave as intended'' [99]. An appropriate level of trust forms a keystone towards efficient teamwork [89]: Both overtrust (i.e.users overestimating robot capabilities) and distrust (i.e.users underestimating a robot and, hence, intervening too frequently) can degrade the overall team performance [162]. Established measurement instruments for trust in automation exist [163], [164], but these are not directly usable for HRI [165], where robots can act autonomously and humans may play the role of a teammate [166]. In consequence, modelling techniques (see e.g.the surveys of Khavas et al. [167], Hancock et al. [168]) and items in several questionnaires (Table 1) have been proposed to specifically target trust in HRI. Yet results must be cautiously interpreted since subjects tend to put more trust in robots when experiments are conducted in controlled lab settings [48].
Situational Awareness (SA) is ''the perception of the elements in the environment within a volume of time and space, comprehension of their meaning and the projection of their status in the near future'' [169]. High SA can thus help to predict and prevent mistakes. We found three major strategies to evaluate SA: (i) Freeze-probes freeze the task at some point in time and ask the user questions about his/her understanding of the situation [51], [110]. (ii) Attention can be measured by observing for how long subjects' gaze is directed towards certain regions, either with an eye-tracking system [96], [136] or with questionnaires [53]. (iii) Intervention can be observed by intentionally creating exceptional situations and giving users a small time window to understand and react [136].

D. SAFETY
Keeping humans safe at any time during HRI is a primary obligation for ethical reasons in general and due to laws and regulations in manufacturing environments in particular. Two sorts of safety are commonly distinguished [20]: Physical safety is concerned with preventing unintended, forceful human-robot contact, which might injure the human body. Even if a robot is technically capable of stopping in time to avoid such harm, too fast motions close to a human may still cause discomfort [85] -this sort of psychological harm as also induced by stress, anxiety, or the violation of social norms renders an additional consideration of psychological safety necessary. The achievement of safety can be measured with the following metrics: Regarding physical safety, the International Organization for Standardization defines the protective separation distance and force limits, which may not be exceeded in case of direct contact [36]. The protective separation distance is defined as the minimum distance that lets a robot stop before colliding with a human based on speed, reaction times and positioning uncertainties. Accordingly, comparing the current distance (and the currently exerted force in case of wanted humanrobot contacts) with the protective separation distance can be used as a metric to judge the obtained level of physical safety. This comparison requires estimating the human body pose, e.g.by inertial measurement units (IMU) [170], [171], [172], [173], by colour segmentation [174], by skeleton tracking in camera images [175], [176], with data gloves [177], [178], or with dedicated markers attached to the body and captured by a surrounding detection system [179], [180], [181], [182], [183]. An alternative perspective on separation distance is the time to collision [184]. We direct the reader to Kumar et al. [33] for an in-depth discussion of speed and separation monitoring and a list of measures.
According to Rubagotti et al., the core concepts related to psychological safety are trust, comfort, stress, fear, anxiety, and surprise. We have already encountered these aspects in the context of cognitive ergonomics (Section III-C), where we subsumed a part of them (comfort, stress, and emotions like fear and surprise) under the notion of affect. As a consequence of this similarity, perceived psychological safety is often evaluated with questionnaire items in conjunction with these aspects (e.g. [48], [51], [104], see Table 1), items explicitly targeting safety (e.g. [49], [50], [112]), or, via physiological measurements in the case of anxiety (Table 2).

E. COMBINED METRICS
A benchmark to comprehensively evaluate human-robot teaming must ideally cover all relevant goals. By combining complementary metrics, one better understands a system's overall impact on workers. We found two ways in literature to achieve this: (i) Several goals can be combined into a single numerical indicator. Zhang et al. [54] have proposed the Throughput Rate per Unit of Work Effort Time. It combines productivity and physical ergonomics by putting throughput rates and the Strain Index into relation. Rephrased in the formalization nomenclature of this paper, this measure is defined by with throughput rate 1/D H /R [56] and the overall Strain Index [42] SI accumulated across all subtasks.
(ii) Alternatively, several measurement instruments can be combined into more complex evaluation frameworks. Following this strategy, the framework of Gervasi et al. [18] embeds established metrics from Sections III-A to III-D (e.g.Levels of Robot Autonomy, NASA-TLX, EAWS, SUS) into a higher-level rating scheme. Similarly, Wallström and Lindblom [185] have proposed to combine measures for productivity (e.g.effectiveness and efficiency) and for job quality (e.g.trust and safety) into an HRI design process inspired by user experience (UX) design goals. An important feature of the latter method is that it does not require large sample sizes -a convenience sample is often sufficient, hence rendering the framework well-suited for early prototype design [185].

F. DISCUSSION
Productivity, flexibility, job quality and safety are the main goals for evaluating industrial human-robot applications. Metrics to measure productivity (Section III-A) and flexibility (Section III-B) are often objective and easily measurable by an observing experimenter (e.g.task completion time, idle time, teaching time). There are also objective metrics to quantify aspects of job quality (e.g.physical ergonomics frameworks, Section III-C) and safety (e.g.separation distances, Section III-D), some of them requiring more sophisticated measurements of physiological signals (e.g.anxiety). However, particularly job quality is strongly linked to human factors, which are predominantly evaluated with users' selfreports based on questionnaires. Physiological signals and questionnaire-based metrics come along with specific challenges, as discussed below -more general guidelines for the design and conduct of HRI studies to gather these metrics are outlined in Section IV-E. Particularly questionnaires need to prove their reliability and validity, i.e.it must be shown that there is a ''correlation between respondent's scores and the true level of the concept being measured'' [186] and that there is a high ''degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses'' [187]. Long-term established and well-known standard questionnaires often have a broad range of literature showing under which circumstances reliability and validity can be assumed (e.g.SUS [46], NASA-TLX [45]). Whenever using modified or self-designed questionnaires, researchers should consider newly proving the reliability and validity as a part of their evaluation. This can e.g.be achieved by adhering to the structured design process for measuring new concepts as proposed by Rueben et al. [186] (Figure 3). A popular way to prove reliability is to investigate the internal consistency via the tau-equivalent reliability (formally known as Cronbach's alpha) -see Cho et al.for detailed guidelines [188]. In their work, Rueben et al.also discuss indicators related to validity in the process of creating and adapting questionnaires. Despite following these guidelines, the response process can still undermine the validity of the measurement instrument. Different sorts of biases need to be considered when interpreting results, e.g.a bias towards the extremes of a scale, the social desirability bias [189], or subjective biases resulting from one's personal (dis)likes [190]. Rosenthal et al. [191] give detailed insights into study design for behavioural assessment.
Compared to questionnaire-based self-reporting, physiological signals promise less subjective bias and higher sampling rates [121]. In turn, two other challenges arise: (i) According measurements of the underlying physical property must be obtained. This can partly be achieved with consumer devices, such as smartwatches or chest-strap sensors (e.g.for measuring the heart rate), but often requires specialized medical devices (e.g.EEG). For the latter, high deployment costs, limitations of the human movement space, accuracy issues, and the problem of choosing an appropriate measurement horizon and baseline measurement must be considered [192], [193]. (ii) The measured values must relate to the concept under investigation, i.e.the question of reliability and validity arises for questionnaires and physiological measures alike. Finding a fitting measure and categorization is still challenging: The current literature is e.g.still inconclusive regarding the best-suited combination of physiological signals for measuring cognitive workload [31], [143], [144], [145], [146] when physical workload and anxiety towards the robot might change as well during the same experiment. Models developed so far might hence still be insufficient to match the results to be obtained with self-reports [115], [144], [145], [146], or insufficient to discriminate different conditions (grey references in Table 2).
A comprehensive evaluation needs more than one metric to cover all concepts relevant to human-robot teaming (Section III-E). However, using multiple measures and evaluation methods for a single concept within the same study should also be considered. This can help to reduce measurement errors and leads to more valid, reliable, and conflict-free results with regard to some concept [194]. The strategy of using three or more measures is generally called triangulation (see e.g. [195] for a general in-depth discussion). HRI studies have e.g.combined questionnaires with objective physiological measures [53], [196], [197] and/or task performance metrics [115], [198].

IV. EVALUATION STRATEGIES
Insights on human-robot teaming can be gathered by applying different evaluation strategies to a given system. We will discuss these hereinafter: Starting with considerations on research demonstrators in Section IV-A, Section IV-B gathers recent works on user study design. Section IV-C then elaborates on simulation-based evaluation strategies where HRI are observed (partly or fully) in virtual spaces. We conclude with analytical performance models in Section IV-D and discuss the properties and interrelations of individual methods in Section IV-E.

A. RESEARCH DEMONSTRATORS
Demonstrators play a vital role in research and development processes. Moultrie has formulated a comprehensive model that differentiates between several types of demonstrators according to their purpose [199]. For basic research on human-robot teams, physical prototypes are important artefacts for advancements under laboratory conditions [200]. Yet, partially or fully virtual prototypes of hybrid workstations may also be built with modern simulation and virtual reality techniques (e.g. [201], [202], [203]). Demonstrators can moreover serve as 'boundary objects' between individuals with different expertise (e.g. academic researchers and stakeholders from industry) [199] when approaching higher technology readiness levels. They are then used as common ground for structured interdisciplinary dialogue and participative design steps, e.g. when jointly investigating human-robot safety solutions [204]. Demonstrators are themselves an evaluation strategy as any proof-of-concept implementation directly measures 'feasibility' as a binary metric. This is usually achieved by showcasing concrete use-cases (e.g. [205]) or by reporting on the typical sequence of actions when interacting with a cobot (e.g. [206], [207], [208]).

B. HUMAN-PARTICIPANT STUDIES
Based on a prototype system, human-participant studies (often also called user studies) are frequently used experimental tools for evaluating HRI [209]. They enable users to work with a system so that researchers can observe and gather measures (cf. Section III) as required for judging and comparing the performance of different approaches. Human-participant studies will, hence, certainly be part of future benchmarking protocols. This section is intended to introduce a brief taxonomy of design characteristics and methods (Figure 4). A more comprehensive coverage of the field is provided by recent works and guidelines of Hoffman [210], Bartneck et al. [209], and Bethel et al. [211].
Three major study paradigms can be differentiated [212]: Insight-driven studies are directed towards developing general ideas or new theories in a problem context. Examples from the field of industrial cobot use are e.g. understanding (un-)desired job attributes to inform cobot deployment by interviewing assembly line workers [213], finding aspects relevant to trust in robots [50], or identifying challenges associated with cobot deployment [214], [215]. In contrast to the exploratory nature of insights-driven studies, a design study is specifically directed towards designing concrete robots or robot behaviours, e.g. during participatory work on demonstrators [204]. Finally, hypothesis-driven studies seek to objectively test hypotheses with statistically significant, numerical data. In the context of industrial applications, we can e.g. hypothesize that a mixed human-robot team will enhance productivity, job quality etc., compared to the baseline condition of the same process when performed by a human.  The above paradigms are pursued with qualitative and quantitative experimental procedures according to the type of data they provide [210], [212]. Qualitative methods, such as (semi-)structured interviews, focus groups, generative activities, or reflective and narrative accounts, provide textual participant responses, field notes, audio or video recordings to be interpreted by the researcher. The resulting data can typically not be expressed numerically [209] -qualitative methods are, hence, predominantly used for insights-driven and design studies [212]. By contrast, quantitative methods condense complex aspects into directly and objectively comparable numerical measures as needed for statistical hypothesis testing. The chosen measure defines the experimental procedure during quantitative studies: Subjective measures are gathered from participants by asking them to self-report their experience with the system, e.g. by answering questionnaires. By contrast, objective measures can be obtained through independent observations, e.g. measuring task durations or human physiological variables [216]. With most of the metrics discussed in this paper being numerical, Section III is a catalogue of subjective and objective quantitative measures.
Recruitment and involvement of participants are of crucial importance in human-participant studies. It is not only important to ensure a sufficient number of participants to achieve an appropriate level of significance (e.g. using an a priori power analysis [194]) but also how these participants are divided into groups to test a system under different conditions. With regard to this design dimension, a separation into between-subject, within-subject, and mixed-model studies is common [194]: In a between-subjects study, the participants are randomly divided into several groups. Each group will then take part in one variation of the experiment, and results will be compared across the groups (e.g. [125], [217], [218], [219], [220], [221]). By contrast, each subject experiences several experimental conditions in a randomized ordering in within-subject studies (e.g. [119], [197], [222], [223], [224], [225], [226], [227], [228]). Compared to between-subject designs, this approach enables comparisons between participants and allows the collection of more data per participant -this may, on the other hand, lead to habituation and fatigue effects [194]. Finally, mixedmodel factorial designs (e.g. [229], [230]) combine the aforementioned designs by running a within-subject study (e.g. regarding interaction experience [229]) with each member of groups in a between-subjects study (e.g. regarding robot anthropomorphism [229]).
Independently of the assignment of participants to conditions, user studies can further be distinguished regarding the location where they are carried out: Experiments in the field with demonstrators of productive use-cases are still rather rare, especially when it comes to true task sharing rather than mere human-robot coexistence [205], [231], [232]. Respectively, studies in laboratory settings are predominant (e.g. [119], [197], [217], [218], [220], [221], [222], [223], [225], [228], [229], [230]) albeit partly seeking to replicate realistic industrial settings rather than relying on synthetic tasks (e.g. [219], [224], [233]). Laboratory studies can be conducted with physical or virtual prototypes. In the latter case, different parts of the robot system under test can be virtualized utilizing simulation: Virtual hardware can be accessed by participants to share tasks with robots using virtual reality techniques [197] (cf. Section IV-C). Likewise, the existence of software components can be simulated by substituting autonomous teaming capabilities with a researcher teleoperating an embodied robot in Wizard-of-Oz studies [220]. Finally, the potential for large, representative samples through crowdsourcing has motivated using the internet as a study context, e.g. by presenting videos [234] or by enabling the aforementioned interaction in virtual spaces [235], [236].

C. VIRTUAL COMMISSIONING
The term virtual commissioning has traditionally been used for procedures during which physical hardware is connected to a realtime-capable simulation system, e.g. for programming and testing Programmable Logic Controllers decoupled from the physical production system [237]. Lechler et al. have only recently pointed out that this technique also offers a high potential for applications beyond this early use-case, particularly for HRI [238] -indeed, simulations of robot as well as human behaviour have been part of several HRI experiments in recent years. This section frames corresponding evaluation strategies in the context of virtual commissioning terminology. As we are focussing on settings in which humans and robots actively share a joint task (Section I), approaches where humans passively observe simulated robot behaviour (e.g. [227], [234]) will not be considered in depth. Following the nomenclature of Erős et al. we distinguish between immersed and virtual human-in-the-loop virtual commissioning [239].
Human interaction influences the virtual environment, and the control and planning algorithms under test produce corresponding robot reactions -just as in human-subject studies with physical robot prototypes. Accordingly, the design dimensions outlined in Section IV-B equally apply to user studies conducted by Immersed Human-In-The-Loop Virtual Commissioning. Corresponding studies mostly seek to gather quantitative data (e.g. [53], [115], [196], [197], [198], [240], [241], [242], [243], [247]), but virtual reality also enables insights-driven research [248]. A major difference is that human safety during experiments is not an issue in virtual environments as opposed to physical interaction in laboratory setups. Consequently, we found that (possibly dangerous) variations of robot trajectories in terms of speed or unexpected, jerky movements are a frequently manipulated variable [115], [196], [197], [198], [242], either directly or indirectly when testing different cobot safety mechanisms [243]. Other variables are, e.g., the use of different communication channels [240], [247], robot morphology [197], or unforeseen events in the surrounding of the shared task [246]. Dependent measures cover the full spectrum of metrics outlined in Section III: Productivity in terms of completion times, errors, and fluency have been observed [115], [196], [198], [240], [243] as well as the flexibility of humans regarding changes to the robot working speed [198]. In terms of job quality, physical ergonomics scores have been calculated from skeletal tracking data [241], and cognitive workload related to stress has been evaluated with corresponding questionnaires (e.g. [115], [196], [247]). Lastly, the safety of planned robot motions can be tested against true human motion data (e.g. [201], [240], [243]). Constructs related to psychological safety, including anxiety, trust, and behavioural patterns (e.g. leaning back or stepping away from the robot), have also been observed in VR-based experiments [53], [196], [197], [198], [242]. All aforementioned studies have been conducted in laboratories. However, VR technology is also a suitable medium for crowdsourcing and remote participation in HRI experiments via the internet, i.e. without having to invite subjects to the laboratory [235].

2) VIRTUAL HUMAN-IN-THE-LOOP VIRTUAL COMMISSIONING
Contrasting to Immersed Human-in-the-Loop virtual commissioning, Virtual Human-in-the-Loop Virtual Commissioning relies not only on virtual robot hardware but also on simulated human behaviour -human subjects are thus not part of corresponding experiments [239], but the HRI is fully simulated. Related approaches can be classified according to the expressiveness of underlying digital human models (DHMs) used to replicate human behaviour: Most DHMs emphasize physical aspects of human action [249], e.g. regarding motion times and ergonomics. Accordingly, systems for simulating manual work (e.g. ema work designer [250]; IPS IMMA 1 ) have been used to analyse cycle times and ergonomic metrics of collaborative workflows [4], [203], [251]. In contrast to these 3D simulations with realistically animated manikins and robots, 2D simulations of workers and mobile robots on the shop floor [252] or of hand motions above a working surface plane [242], [253], [254] have been proposed to estimate task times. In addition to productivity-and ergonomics-related metrics, simulations are predestined to investigate safety mechanisms without actually endangering human subjects. To this end, expected contact forces can be estimated with biomechanical collision models (e.g. according to ISO/TS 15066 [36]) to design workstations that are prepared to pass the risk assessment afterwards [203], [255]. Particularly approaches based on commercial simulation tools are designed to iteratively draft and test candidate HRC workflows by precisely entering work items for all involved agents. In contrast, Antakli et al. [202] have proposed a simulation architecture for more interactive testing with coupled agent behaviours: Production planners can here create different situations by manipulating objects and agent states at simulation runtime, hence influencing the course of actions emerging from a near-optimal optimization scheme and human motion synthesis.
Cognitive aspects of human behaviour are hard to model and have less frequently been addressed in the context of industrial HRC evaluation. There are practical models capturing individual factors and dependencies, e.g. between • fatigue, learning, and human error rates [256] • robot performance, task complexity, human physical and cognitive workload [257] • trust and human-robot teaming performance [258] • eye gaze and hand-reaching motions [259] • consecutive decisions when choosing from several assembly steps yet to be done [260].
These can be used as components in models of high-level human decision behaviour to simulate non-deterministic but plausible human action in industrial settings [10], [261].
Recent models include behaviours such as inter-agent communication and leaving the workstation for a break [261], e.g. by relating transition probabilities in a Markov Decision Process to human fatigue [261], [262] or frustration [261] accumulated over time.

D. ANALYTICAL MODELS
The final evaluation strategy that we are going to survey makes use of analytical models. Such models seek to encode parametric relationships in sets of mathematical equations. They are primarily applied to observe cobot impacts on productivity metrics when varying different economic parameters. In this class, process-independent and process-oriented approaches are distinguishable. Process-independent models are designed to estimate the minimum productivity gain (e.g. in terms of assembly line throughput [263]) that must be achieved to justify the introduction of cobots [263], [264]. To this end, expected cycle time reductions resulting from teamwork are set in relation to the costs of cobot use. Beyond initial hardware acquisition expenditures, the cost model proposed by Calvo and Gil [264] also reflects product changes during the operative lifetime, wage raises over time, and social costs, such as welfare support for human workers replaced by robots. In contrast, the approach of Cohen et al. puts a stronger emphasis on cobot-based compensation of productivity losses emerging from the temporal absenteeism of experienced workers and replacement with less experienced ones in assembly lines [263].
The applicability of such high-level models presumes that estimates of cycle times for manual processes and particularly of human-robot teams are already available. Although this question is strongly product-and process-dependent, several authors have shown how analytical process-oriented models for hybrid workstations can be shaped. On this level of system analysis, deterministic and stochastic models have been proposed. Deterministic models such as the one proposed by Faccio et al. [265] are applicable to a class of processes in which individual process steps do not depend on each other and are assumed to have equal durations. More generally, arbitrary processes with precedence relations between operations can be analysed by modelling human-robot teamwork as a multi-agent scheduling problem [266]. This model can then be solved for different HRC settings (e.g. varying number of humans and robots, discrepancy between human and robot working speed, percentage of process steps that the robot is capable of etc.), yielding the optimal workflow and time savings compared to manual work. A consequence of optimality is that this model does not capture the variance of dynamic, flexible teams across different runs of the same process. In contrast, stochastic models have been proposed to account for uncertainties in manual operations. Similar to the aforementioned model of Faccio et al., these models rely on a limited process pattern. But they assume process step durations which follow exponential distributions [54] or gamma distributions [267] -it is this way possible to calculate the expected value of the overall process duration and even derive formulas for the probability of a product to be finished within given time bounds [267], [268].

E. DISCUSSION
From our point of view, which is based on prior considerations on benchmarking in the computing domain [269], there are three major goals to be satisfied by benchmark protocols to become an established part of the scientific method: • Ease of use: Since benchmarks are merely a tool to gather the data needed for investigating research questions, they should be designed as researcher-friendly, easy-to-use testbeds which enable cost-effective, timeefficient, and scalable experiments.
• Reproducibility: It must be feasible to repeat prior experiments to the greatest possible extent, as this renders research results transparent.
• Versatility: Benchmarking protocols should be designed to gain information flexibly regarding various constructs and relevant metrics to ensure versatility and foster high acceptance in the community. Beyond this general, practicality-oriented view, another important aspect of experimental setups must be discussed in the particular context of humans interacting with robots: • Participant Well-Being: During any benchmark or experimental procedure, it must be easy to ensure that human participants are safe and not harmed at any time. Each of the strategies in Sections IV-A to IV-D has individual properties, (dis)advantages, and best practices which influence the above aspects -corresponding dependencies are discussed in the below sections. A brief summary of the resulting classification is given in Table 3, with the achievable versatility in terms of metrics covered by individual strategies being further elaborated in Table 4.

1) EASE OF USE
Building research demonstrators usually goes along with significant engineering efforts, thus leading to limitations in the ease of use: Intelligent robots require complex software stacks to expose all skills necessary to collaborate with humans. When seeking to evaluate novel planning and interaction methods, it is e.g. also necessary to implement state-of-the-art vision and manipulation algorithms -as these system components are a necessity which mostly does not directly contribute to individual research goals, corresponding software and hardware are often closed-source, simplified, or specifically tailored to work in individual laboratories. This does not only negatively impact reproducibility but also leads to rather heavy-to-use systems [14].
When conducting human-participant studies with (physical) cobot setups, challenges beyond engineering efforts and unstable, error-prone prototypes further reduce the ease of use: (i) Particularly when relying on participant self-assessment with questionnaires, various influences on human behaviour with potential impacts on the validity of results must be considered (e.g. social desirability bias; novelty vs. habituation effects; side effects outside the study protocol, such as robot failure [14], [194]). Accordingly, conducting high-quality human-participant studies is challenging, and there has lately been profound criticism of a lack of methodological rigour in the field. This concerns a lack of reproducibility, critical conclusions from too small populations [14], a strong focus on convenience samples [210] with corresponding biases (e.g. regarding participants' age [270]), or incorrect design and statistical testing of Likert scales and associated data [271]. (ii) With field demonstrators for human-robot task sharing still being rare [231], [232], most experiments take place in laboratories [270] and are often conducted with synthetic model sets and tasks -this raises further questions regarding the trade-off between experimental control and external/ecological validity (see Section V). All in all, achieving rigour in the design, execution, and reporting of human-participant studies is a complex task. We therefore want to refer the reader to further literature which introduces best practices in depth [209], [210], [211], [271], [272].
Immersed-HITL Virtual Commissioning is a special case of human participant studies with virtual rather than physical robot prototype systems. Certainly, not needing to build possibly expensive physical prototypes and not having to ensure the safety of participants compared to the laboratory operation of physical robots are favourable aspects regarding the ease of use. Despite these advantages, we still judge Immersed-HITL VC to be a similarly complex, hard-touse use evaluation strategy as studies with physical robots: beyond the issues for rigorous studies, experiments conducted in VR raise the question of transferability of results from the virtual to the 'real' world. Although VR experiments are already frequently used, this assumption is still discussed in the literature (e.g. by Wijnen et al. [273]). Transferability should thus not generally be assumed. It is widely accepted that presence, i.e. the feeling or illusion of actually being in an immersive virtual environment [274], is a key prerequisite to ensure realistic participant responses when subjects interact with a virtual robot [275]. Presence is therefore often measured and discussed as a part of VR user studies to justify the validity of results (e.g. [53], [115], [196], [198], [246], [247], [276]). A comprehensive list of available questionnaires covering the presence concept has been compiled by Schwind et al. [277]. With presence being related to the fidelity [274], [278] and validity [278] of a virtual world, these aspects should already be considered during a structured design phase for high-quality virtual environments [245].
The complexities associated with human subject handling do not arise for Virtual-HITL Virtual Commissioning with simulated humans: Building and running fully virtual human-robot tasks is supported by commercial tooling (see [279] for a listing) as well as by freely available simulation platforms with physics, sensor simulation, and human motion animation capabilities (e.g. gazebo, 2 Webots 3 ). Once a simulation environment has been prepared, adjustments to the layout, to the human-robot task assignment etc. can be arbitrarily evaluated. It has been shown that workflow variations can also be generated automatically [10], [261], hence enabling large-scale experiments without needing to recruit human subjects and even without an experimenter's active supervision. These factors increase the ease of use compared to human-participant studies with physical or virtual prototypes. However, we still consider the gathering and integration of realistic simulation environment assets (CAD models) a task which needs a certain degree of expert knowledge and experience (e.g. when modelling complex assembly processes with professional tools as the ema Work Designer [250]). This sets a limit to the ease of use.
Lastly, we consider Analytical Models -the most experimenter-friendly tooling in this survey. Such models can be compiled by 'pen-and-paper'-work without realizing complex robot software stacks. They provide reliable and verifiable results without effortful system implementations or demanding user studies. Different scenarios can easily be evaluated within the bounds of aspects considered in the model by determining suitable input parameter values and solving for the output metrics.

2) REPRODUCIBILITY
The aforementioned aspects related to closed-source, simplified, and specifically tailored hardware and software of laboratory research demonstrators render reproduction of most robotics experiments hardly feasible [280] -they are seen as a major source of the so-called 'replication crisis' in HRI [14]. We consequently consider the overall reproducibility of research demonstrators as low at the time being (Table 3). These issues directly propagate to humanparticipant studies as these are usually based on prototype cobot implementations. Yet there is even more to reproducible human-participant studies than the reproduction of the mere experimental platform (hardware and software): It is here moreover necessary to make the experimental setup (e.g. the benchmark task, cf. Section V) as well as the experimental procedure (study design, questionnaires etc.) available [281]. These issues are addressed by initiatives to foster publications with extended, detailed information on the used hardware and software implementation: So-called 'R-Articles' must be accompanied by mandatory, in-depth system descriptions, code, and further data relevant for reproducing experimental setups [282]. This step towards more transparent experiments is supported by online platforms to publish the required data 2 https://gazebosim.org/home (Date accessed: 2022/06/15) 3 https://cyberbotics.com/ (Date accessed: 2022/06/15) (e.g. CodeOcean, 4 IEEE DataPort 5 ). From a technical point of view, the situation may be enhanced in the future by unified architectures for experimental cobot systems [283] and by the use of containerization techniques [284], [285]. Yet, to the best of our knowledge, these approaches have not yet been applied to complex HRI user studies -we hence classify the reproducibility of human-participant studies as low.
When transitioning from human-participant studies with physical prototypes to Immersed-HITL VC, the experimental platform is reduced to software to be built for and run on commercially available standard VR hardware. Beyond this reduction of required hardware, questionnaires (see [286] for design guidelines of in-VR-questionnaires) and the overall experimental procedure can be embedded into the code base (e.g. by means of 'mini games' [287]). In our ranking, the reproducibility that such integrated VR experiments could offer is only superseded by fully simulated Virtual-HITL VC and analytical models. Structured containerization of simulation components has been proposed [288], and closed-form analytical performance models may not even need additional materials for reproduction. In both cases, it may not only be possible to easily reproduce the experiment but even to precisely replicate and verify prior results.

3) VERSATILITY
The experimental versatility enabled by building a research demonstrator is limited to proving technical feasibility of a novel approach. Demonstrators can certainly be used to measure technical aspects of individual system components (e.g. object classification accuracy, positional accuracy etc.)but to acquire any further metrics, laboratory prototypes must be embedded into human-participant studies to measure aspects related to HRI. The user studies surveyed in Section IV-B report metrics across all relevant categories outlined ( Table 4). The versatility when conducting human-participant studies is, thus, very high, and metrics to be raised are only bounded by two limiting factors: (i) The physical integrity of participants is a requirement during human-participant studies, and physical safety metrics can thus not be evaluated. (ii) Guiding participants through a laboratory HRI user study usually needs time to introduce the topic, to perform tasks together with a cobot, and to query feedback with questionnaires. This expenditure of time is limited by the capacities of experimenters and participants. In consequence, the number of different workflows and scenarios to be tested is equally limited. This restricts the applicability of human-participant studies to productivity metrics under the influence of team flexibility.
These limitations can be overcome with Virtual Commissioning methods. When applying Immersed-HITL VC, user studies can measure physical safety metrics (e.g. the separation distance) without taking the risk of harmful humanrobot collisions. VR moreover enables crowdsourcing, which can help to increase sample sizes and reduce the required time by moving user studies to participants' homes [235]. Virtual-HITL VC enables experiments ranging from tests of single, fixed workflows [203], [251] to large-scale evaluations of automatically varied human-robot co-working processes [10], [261] and structured situation coverage [289]; not building experiments around human subjects as a limited resource means in particular that team flexibility and the effects of dynamic task sharing can fully be covered. Yet modelling plausible human behaviour which reflects the indeterminism in human actions is still an open challenge [290]. As a consequence, we see the simulation-based evaluation of qualitative human factors (cognitive ergonomics, psychological safety etc.) still in its infancy. Due to the high importance of human factors metrics, we assign a medium versatility to this approach despite the broad range of scenarios and metrics covered. Still, Virtual-HITL VC is already a versatile tool, which should especially be considered during early preparatory steps of enabling technologies research prior to future detailed human subject experiments [291].
Compared to the above strategies, Analytical Models are less versatile. To our knowledge, existing models target estimating productivity gains as the key output metric. Aspects related to team flexibility (e.g. absenteeism of workers [263], number of humans/robots in the team [253], [266]) or to task flexibility (e.g. switching costs for programming and exchanging hardware components [264], structure of the task [254]) are not evaluated but used as input parameters or constants tailored to specific use-cases. The high-level view on whole processes or abstract process steps enables complex economic considerations on the long-term implications of human-robot teaming [264] -yet, concrete system details can not be modelled on this high level of abstraction, hence preventing detailed evaluations of human factors or safety.

4) PARTICIPANT WELL-BEING
The issue of participants' well-being emerges for the two human-subject-based strategies. It is here important to guarantee that human subjects are not harmed, neither psychologically nor physically, for ethical and legal reasons. When working with hardware prototypes, physical integrity can theoretically be retained by safety mechanisms in line with relevant norms and standards on occupational safety (see e.g. [60] for a comprehensive overview). Yet compliance with the still rigid risk assessment procedures is hard to realize for flexible robot systems which plan and behave dynamically [292]. This leaves us with the strategy of an experimenter carefully overseeing the situation and operating the dead man's button [228]. When combined with lightweight, intrinsically less harmful robots as practised in recent user studies (e.g. [207], [218], [219], [222], [223], [225], [229]), this strategy is an acceptable solution. Simulator sickness is a problem related to Immersed-Human-in-the-Loop VC without a similarly common solution. Despite the technical improvements to VR hardware in recent years, this phenomenon still frequently affects humans during and even after using VR hardware [293]. The influence of technical and temporal aspects [293] as well as of the content displayed in VR [294] should therefore be considered when designing the virtual HRI environment. Established tools such as the Simulator Sickness Questionnaire (SSQ) [295] can be used to validate one's setup with regard to this aspect. Additionally, a debriefing phase with each participant can help to identify unforeseen harms and offer assistance if needed [210]. Overall, any experimental procedure (virtual or with physical robots) should be reviewed by institutional ethics committees or review boards (see e.g. [211]). This is not only a necessary prerequisite for publishing results with some venues, but it will also ensure a holistic view on potential dangers that researchers used to robots might overlook.

V. BENCHMARK TASKS
With metrics (Section III) and experimental strategies (Section IV), we have so far covered two main components of benchmarks for human-cobot teamwork. The definition of the actual task is the last component. More formally, benchmark problems in terms of object model sets and associated actions are ''designed or used to establish a point of comparison for the performance or effectiveness of something'' [296] are needed. Accordingly, several model sets have already been proposed in the general field of robotics research. They range from household settings [16], [17], [297] to industrial bin-picking [298], [299] and assembly scenarios [15], [300], [301]. Corresponding tasks are mainly designed to challenge robot grasping and manipulation skills based on robot performance metrics (e.g. time to completion and success rates [15]). Raising such metrics is also important in the field of HRI (Section III-A). Yet, the scalability of manipulation complexity is not sufficient to evoke human-and team-related effects with an impact on job quality, safety etc. By contrast, appropriate reference tasks for collaborative scenarios require scalability regarding (i) individual agent capabilities and contributions to the tasks, (ii) complexity of the required interaction and coordination, also in terms of communication, and (iii) applicability to the different teaming modes still under investigation (e.g. predefined task allocation, negotiation, implicit mutual adaptation) [302]. There have been a few attempts to define reference collaboration tasks and model sets with these requirements in mind (Section V-A). Beyond these works, we have comprehensively surveyed the individual tasks used in previous experiments to provide further inspiration (Section V-B).

A. DEDICATED MODEL SETS AND REFERENCE TASKS
The number of publications with an explicit focus on dedicated model sets and reference tasks for human-robot tasksharing benchmarks is rather limited. Zeylikman et al.have proposed a modular model set including plywood panels, dowels, and freely available 3D-printed connectors [302]. From these components, differently complex pieces of furniture can be assembled. Besides this scalability of task complexity and duration, a mixture of actions feasible for both human and robot agents (e.g. bringing and holding certain parts) and actions which exclusively require human dexterity (e.g. screwing) enable scaling the interaction in terms of role assignment. Our prior work [25] has a similar focus on easy-to-reproduce, task-centred interaction. The model set spans a broader range of domains (simple building blocks, abstracted electrical circuitry, gear meshing). All parts are designed for robust manipulation by humans and particularly by robots using specific shapes and adhesive forces. This way, confounds as a consequence of robot failure during user studies are actively prevented. Contrasting to these assembly-inspired model sets [25], [302], the task described by Sarthou et al.is based on the 'Director Task' as known from psychology studies and hence fosters cognitive and behavioural aspects more strongly based [303]. Involved agents are here facing each other with a shelf in between. From this shelf, cube-shaped objects have to be pickedthis approach offers less scalability regarding task-centred complexity (e.g. coordination due to assembly precedence relations [25]), but intrinsically challenges referential communication, perspective taking etc.

B. TASKS USED IN JOINT ACTION EXPERIMENTS
In addition to the initial attempts towards establishing reference tasks as common ground for comparable, repeatable experiments (Section V-A), a broad range of tasks has been used in HRI experiments. We have clustered those tasks by domain in Table 5 which fall within the scope of this survey, i.e. which target scenarios in which subtasks are allocated to different agents in line with our definitions in Section I: Tasks during which agents have to pick-and-place several objects from start to goal locations are frequently used in experiments with physical robots and in virtual reality. The parts to manipulate range from primitively shaped, often distinctly coloured objects (e.g. [25], [125], [223], [303]) to everyday life objects (e.g. apples [304], USB keys [9]) and paper cut-outs [217]. Tasks composed of pick-and-place subtasks are often intended to represent packaging jobs as an underlying use-case [9], [10], [217], [304]. They may also incorporate parts stacking as a strongly abstracted form of assembly (e.g. [10], [25], [115]).
Another widespread category of tasks, which we identified, is toy assemblies. Building 2D (e.g. [7], [226], [305]) or 3D structures (e.g. [90]) from interlocking plastic bricks is similar to pick-and-place tasks but requires a slightly higher level of manipulation skills, especially on the robot side. Other construction toy sets (e.g.'Baufix' with wooden screws, nuts, and bolts [174], [306]) bring experiments closer to the next group, which we have named product mock-up assemblies. This category covers artificial products intended to simulate realistic assembly processes. In contrast to the real process, products are here built from specially constructed, often strongly simplified parts. Several HRI experiments have applied human-robot teamwork to scaled pieces of furniture [11], [302], [307], [308], [309]. Other exemplary tasks are gear meshing and gearboxes [72], [219], electrical circuitry [310], a jet engine mock-up [315], bicycle subassembly [314], sanding machine [313], or flange assembly from metal parts and standard screws [12]. In addition to these tasks created by abstracting systematically from real products and production processes, fully synthetic tasks such as the Cranfield Benchmark [312], Bourjault's Pen [311], or simple 3-component products [222] have been used.
Compared to product mock-up assemblies, our final category of realistic product assemblies summarizes tasks which involve joint work with real parts taken from industrial processes. Within this scope, we found a group of experimental setups in which humans and robots needed to coordinate while placing and fastening screws [65], [198], [221], [316]. Aside from this cluster, use-cases are highly individual. They include the assembly of desktop PCs [8], candy tins [224], emergency buttons [318], a filament winding head [241], car engine sub-assemblies [233], [317], pin-back buttons [240], USB adapters [320], or carbon fibre shells [243].

C. DISCUSSION
When investigating aspects of human-robot task sharing, the model set and tasks used are strongly linked with the chosen experimental strategy (Section IV). We will therefore discuss the matter of benchmark tasks in the context of the same goals as defined for experimental strategies (ease of use, reproducibility, and versatility; see Section IV-E): Ease of Use: Research prototypes in the HRI field are often based on simplifications to ease robot system implementation (see subsection 'Realistic Product Assembly' in Table 5). Robot vision is often supported by attaching a fiducial marker 43664 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Certainly, all the aforementioned simplifications mean stepping away from realistic tasks and cobot use-casesrealism and fidelity are traded for robustness and, hence, increased experimental control. This abstraction step is an important feature, particularly in the context of user studies with physical prototypes: unintended robot failure due to unstable robot vision and manipulation would here mean confounds. These confounds reduce internal validity and can even render samples invalid. However, experiments with abstract model sets and tasks raise the question of validity and transferability of results to the real world. Regarding this question, we can draw on insights from experimenting with synthetic task environments in the field of human factors research: Synthetic tasks inherit the relevant functional relationships from real-world tasks [321], but they have reduced physical fidelity regarding the equipment, environment etc. Still, investigations related to high-level human skills (e.g. teamwork, dynamic problem-solving) can be valid as long as they were conducted with high psychological fidelity regarding functional, cognitive, and construct-related aspects of the tasks under consideration [322]. To this end, synthetic tasks should emerge from a systematic abstraction and validation process based on identifying research objectives and the concrete field of practice [323].
Reproducibility: Highly realistic research demonstrators are important to prove the value of human-robot teaming in practice. However, highly realistic assembly station environments (e.g. [319], [320]) are hard to replicate due to high costs and a lack of information and plans in corresponding publications. Even if experiments are not conducted in fully developed assembly cells, replication can be prevented if the parts involved in joint tasks are taken from industrial processes (e.g. [233], [317]) and, hence, are not broadly available. As discussed in Section IV-E, these problems can be reduced by shifting experiments into virtual spaces. In case of user studies with physical demonstrators, synthetic tasks offer further beneficial properties also with regard to reproducibility: Product mock-ups can easily be made with 3D printers [25], [72], [219], [302], [311], [312], [315] which are a broadly available resource in the meantime. Unfortunately, prior publications with interesting model sets are not always accompanied by the required CAD models (e.g. [72], [219], [315]) -the replication of experimental setups could, hence, be strengthened by using specifically designed, online-available model sets [25], [302] (Section V-A), or by referring to online repositories with free CAD models as e.g.done by Cramer et al. [311]. As an alternative to 3D printing, parts as standard aluminium profiles [313], [314] or toy sets [7], [90], [226], [305], [306] are commercially available and easy to acquire. When relying on such parts, it is important from a reproducibility point of view to provide precise measures of screws and other metal components (as e.g.in [12]), to report on the concrete workspace layout (e.g.in [7], [217], [304], [305]), and to specify the patterns or structures to assemble (e.g. [7], [90], [223], [226], [305], [306]) with according referential graphics or photographs.
Versatility: Lastly, we will discuss properties which tasks should possess to be suited as a versatile-to-use benchmark problem for different HRI experiments. Versatility is achieved if a benchmark problem has variables which can be manipulated to create individual tasks with differently scaled characteristics, ideally related to various constructs and metrics under investigation. We extracted the following important dimensions of scalability from the tasks we considered in our analysis (Table 5): It is important to be able to scale the amount of work of a task. This can easily be achieved in the pick-and-place or toy assembly domainsadditional parts and associated subtasks can here be added as needed. This also opens the possibility of arranging parts in the workspace and creating settings that encourage parallel working or provoke conflicts during reaching motions in spatially narrow situations [10], [223]. Similarly, product mock-ups with a higher degree of abstraction and a design specifically focused on scalability enable various options to configure tasks (e.g. [25], [302]). Other mock-ups (e.g. [72], [219]), as well as realistic demonstrators and case studies with high fidelity (e.g. [319], [320]), replicate exactly one product and can thus hardly be scaled regarding the necessary subtasks. A necessity to employ individual agents' capabilities can be created implicitly by adding particularly heavy parts (e.g. [9], [233], [319]). This also establishes a link to embedding physical contact between humans and robots by means of hand-guiding [233], [319]. Another way of implicitly introducing complementary capabilities is limiting robot manipulation skills to certain parts [9], [72], [233], [302], [310], [313]. Alternatively, capability-aligned contributions within an assembly process can explicitly be enforced by assigning a specific sort of process steps to certain agents (e.g.one teammate placing and the other fastening screws [65], [198], [221], [316]). It is here common to restrict robots to picking/placing/handing over parts [12], [174], [240], [306], [307], [308], [315], [317], [324] and holding sub-assemblies, whereas the human partner performs dexterous manipulations [241], [243], [309], [314], [324] or operates tools [240], [317]. On the one hand, embedding this distribution of work statically into experimental tasks has two advantages: it reflects the situation 'as is' with robots still VOLUME 11, 2023 being less skilled than humans in dexterous manipulation; and it contributes to the ease of use as picking, placing, and holding actions can easily be implemented and enable robust execution during user studies. On the other hand, fixed roles can limit the interaction to a single, fixed task allocation [176], [219], [224], [319], [320] or even mean that robots merely deliver kits for humans to assemble without further interaction during the actual assembly process (e.g. [88], [110], [229]). As opposed to these limitations, it has been shown that especially the easier-to-implement pick-and-place tasks [125], [217], [218], [223], [303], [304], toy assemblies [7], [226], [305], and abstract product mockups [11], [25] can also be realized in a way that enables the other extreme case of equal, symmetric capabilities. In consequence, these tasks can be used to benchmark the full range of HRC modes from statically planned optimal schedules to fixed 'leader-follower' role distributions and, ultimately, fully dynamic mutual adaption among equal peers. Independently of the HRC mode under investigation, the cognitive load can be scaled by superimposing an additional cognitive task on the original assembly task (e.g.solving Towers of Hanoi [222], product quality inspection [196]).

VI. CONCLUSION
Standardized benchmarks are an important foundation of reproducible research and comparable performance evaluations. From our point of view, such benchmarks are yet to emerge in the field of collaborative human-robot task sharing. Towards this target, our survey seeks to give an overview of aspects related to HRI experiments. Compared to prior literature reviews with a focus on HRI metrics, we provide a broader overview of the field: we have surveyed metrics more specifically in the context of the currently most frequent use-case of human-robot collaboration in industrial settings by investigating their suitability to measure productivity, flexibility, job quality, and safety in Section III. Evaluation strategies to raise these metrics, particularly when dynamic teaming approaches and mutual adaptation would require large numbers of test runs or subjects, are discussed in Section IV (e.g.human-participant studies, variations of virtual commissioning). Lastly, we have gathered dedicated object model sets and tasks previously used in HRI experiments in Section V. We hope this comprehensive overview will serve the community as a starting point and inspiration for the future design of scalable benchmark problems, protocols, and evaluation procedures.