Usability Assessments for Augmented Reality Head-Mounted Displays in Open Surgery and Interventional Procedures: A Systematic Review

: Augmented reality (AR) head-mounted displays (HMDs) are an increasingly popular technology. For surgical applications, the use of AR HMDs to display medical images or models may reduce invasiveness and improve task performance by enhancing understanding of the underlying anatomy. This technology may be particularly beneﬁcial in open surgeries and interventional procedures for which the use of endoscopes, microscopes, or other visualization tools is insufﬁ-cient or infeasible. While the capabilities of AR HMDs are promising, their usability for surgery is not well-deﬁned. This review identiﬁes current trends in the literature, including device types, surgical specialties, and reporting of user demographics, and provides a description of usability assessments of AR HMDs for open surgeries and interventional procedures. Assessments applied to other extended reality technologies are included to identify additional usability assessments for consideration when assessing AR HMDs. The PubMed, Web of Science, and EMBASE databases were searched through September 2022 for relevant articles that described user studies. User assessments most often addressed task performance. However, objective measurements of cognitive, visual, and physical loads, known to affect task performance and the occurrence of adverse events, were limited. There was also incomplete reporting of user demographics. This review reveals knowledge and methodology gaps for usability of AR HMDs and demonstrates the potential impact of future usability research.


Introduction
Augmented reality (AR) describes the display of virtual objects integrated with the real environment. AR has been applied across various fields, including medicine, with surgical applications among the most common medical applications [1]. While interest in AR for surgical applications began as early as the 1980s [2][3][4], the development of AR systems for surgical planning and procedures has increased in recent years, driven by technical advances and proliferation of commercially available head-mounted displays (HMD) [5,6].
HMDs provide a hands-free view of virtual 2D and 3D text and images anchored to the physician's field of view, to a specific location in the surgical suite, to objects in the environment, or to anatomical landmarks in or near the surgical site, with minimal visual occlusion of the surroundings. With these capabilities, the use of AR HMDs for surgery is purported to enhance diagnosis and surgical planning through interaction with virtual, patient-specific models [7]; facilitate intraoperative navigation by displaying medical images on or near the surgical site [8][9][10]; improve ergonomics by displaying one or more datasets in easily viewable locations [10][11][12]; and increase attention to patient vital signs and alarms compared to conventional visual displays [13,14].
While the purported benefits of AR HMD use could be widely applicable across surgical approaches, their use in open surgery and interventional procedures demands special attention. In particular, the use of AR HMDs presents an opportunity to enhance the visualization available for procedures that are not amenable to the use of endoscopes, microscopes, or other advanced visual aids. Further, maintaining an egocentric view of the surgical scene is a key feature of the technology, and it is not yet fully exploited in scope-mediated procedures. As such, the usability of AR HMDs for open surgery and interventional procedures is the focus of this work. Given the distinct visual, cognitive, and physical demands of scope-mediated procedures (e.g., dependence on a camera for visualization of the patient's anatomy, flattening of 3D anatomy visualization to a 2D video stream, and complex transformation of hand movement to tool movement), the application of AR technologies for these procedures is not reviewed here, although it has been studied elsewhere [15][16][17][18].
Incorporating AR HMDs into conventional surgical and interventional applications is a non-trivial pursuit, requiring careful studies of physical, perceptual, and cognitive considerations that accompany the use of AR HMDs [19,20]. The potential use of AR HMDs in such high-stress and high-stakes applications as surgery necessitates a thorough understanding of their usability (i.e., the users, use cases, and user-device interactions) to ensure safety and effectiveness. To lay groundwork for future usability research and device evaluations for AR HMDs in open surgery and interventional procedures, a comprehensive description of published usability assessments and reporting for AR HMDs and related technologies is presented. As the use and evaluation of AR HMD technologies is still relatively new and may be limited, the authors included usability assessments for devices that share a subset of technical aspects or device use characteristics with AR HMDs. This broadened search included devices that display virtual information (i.e., anatomical models, markings, or annotations) beyond the standard use of monitors to display 2D medical information; devices that display any information superimposed onto the user's field of view, whether in the main display area or an inset display; devices that project content onto the patient or the environment; and devices that utilize the principles of stereoscopy to create the illusion of depth in 2D content. The included devices expand beyond the spectrum of partial to fully virtual visualizations known as extended reality (XR); they will be referred to here as XR+. The inclusion of devices that are related to AR HMDs was intended to capture relevant usability assessments that are applicable for the evaluation of AR HMDs but might not have been applied to the assessment of AR HMDs yet.
Several published reviews have addressed the usability of AR HMDs in surgical applications. A review published in 2004 showed the use of AR in a variety of surgical specialties, with an emphasis on HMDs [as well as heads-up displays (HUDs)] as emerging technology [21]. The most prominent medical specialties in which AR had been applied at that time were neurosurgery, otolaryngology, and maxillofacial surgery. The author acknowledged the limitations associated with nascent AR systems, recognizing that they were a relevant tool only for surgical procedures requiring low-performance surgical dexterity. A literature review of HMDs (as well as HUDs) in surgical applications through 2017 enumerated the specific display devices used in a variety of medical specialties, how they were used, and the medical specialties for which they were used [22]. Special attention was given to use with live human patients. The most common applications of AR in surgery included visualization of surgical microscopes, navigation, monitoring of vital signs, and display of preoperative images. While this review described the broad scope of HMD devices and use cases, an equally thorough treatment of usability assessments has not been presented and would be beneficial for the design of future user studies. AR usability has been reviewed broadly across diverse applications, including medicine, education, navigation and driving, and tourism and exploration [1]. In this article, which included articles from 2005 to 2014, medicine was the third largest category of published literature, behind perception and interaction. The most common uses of AR in medicine at that time were laparoscopy, exposure therapy for phobia treatment, and physical rehabilitation. Common usability assessments included subjective ratings, error/accuracy measures, and task completion time. In the years since the publications included in this review, there has been a continued surge in AR applications and user studies in medicine, which may have been influenced by technological improvements and the availability of consumer AR systems. As a result, it is now possible to identify trends and analyze gaps in user study designs and assessments.
There have also been review articles specific to the use of AR and other XR devices in individual medical specialties, such as neurosurgery [23-26], orthopedic surgery [27] and plastic surgery [28], and for various laparoscopy and robotic surgery applications [15][16][17][18]29]. A recent review of optical see-through head-mounted displays in augmented-reality-assisted surgery noted the important role of human factors for the devices' utility [30].
In this review, our objective was to identify trends in the literature related to the use of AR HMDs and other XR+ devices in open surgery and interventional procedures and the study of their usability. Specifically, we assessed (1) the growth of AR HMD and XR+ applications and the types of devices used for these applications, (2) their relative use in various medical specialties and in pre-vs. intraoperative applications, (3) the reporting of user demographics, and (4) methods for assessing the usability of AR HMDs and other XR+ devices. Usability methods were analyzed to determine which aspects of usability have been addressed, how they have been addressed, and which additional assessments may be informative in future assessments. User studies for surgical planning and procedures were considered regardless of participants, medical specialty, publication year, and number of citations. However, for the reasons stated above, scope-mediated approaches were excluded. Explicit comparisons were not made.

Materials and Methods
A systematic literature review was conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) method ( Figure 1). The PubMed, PubMed Central, EMBASE, and Web of Science databases were searched to find articles pertinent to AR and other XR+ devices in surgical applications. HUDs were referenced in the search terms as these displays are not always reported as XR in the literature and might otherwise be under-represented in our results. The following terms and Boolean logic, modified from the work of Dünser, Grasset [31] and Dey, Billinghurst [1], were applied to the titles, abstracts, and keywords of all available records through September 2022: ("Augmented reality" OR "Mixed reality" OR "Virtual reality" OR "Augmented virtuality" OR "Stereoscop*" OR "Head$ up Display" OR "3D visualization") AND (surg*) AND ("user evaluation$" OR "user stud*" OR "survey*" OR "interview*" OR "questionnaire$" OR "pilot stud*" OR "usability" OR "human factors" OR "user experience$" OR "ergonomic*"OR ("participant$" AND "study") OR ("participant$" AND "studies") OR ("participant$" AND "experiment$") OR ("subject$" AND "study") OR ("subject$" AND "studies") OR ("subject$" AND "experiment$")).
The initial search found 4525 records. After the removal of duplicates, 2748 records remained. Another 2533 records were excluded because they were not in English (54 records), did not describe the use of XR+ for surgery or interventional procedures (318), did not provide a complete description of an original research study (691), only focused on surgical training or education (1211), did not include augmentation or replacement of visual information (31), did not include a user study (28), or focused on a scope-mediated procedure (200).  The initial search found 4525 records. After the removal of duplicates, 2748 records remained. Another 2533 records were excluded because they were not in English (54 records), did not describe the use of XR+ for surgery or interventional procedures (318), did not provide a complete description of an original research study (691), only focused on surgical training or education (1211), did not include augmentation or replacement of visual information (31), did not include a user study (28), or focused on a scope-mediated procedure (200).
For the remaining 215 records, their respective articles were subject to an in-depth, full-text analysis performed by five of the authors (E.J.B., K.F., A.S.K., K.L.K., and H.L.B.). To start, 10 articles were analyzed by each author, and the results were compared and discussed to arrive at a consistent coding approach. Coding involved extraction of data pertaining to the display hardware used, XR+ visualization type, displayed information, medical specialty, task information, user demographics, usability assessments, and dependent variables. The remaining articles were then divided among the group, coded, and later reviewed by one author (E.J.B.) for consistency. The extracted information was organized to interpret the trends in (1) device type and number of studies per year, (2) distribution of applications by medical specialty and in preoperative vs. intraoperative settings, (3) user demographics related to experience level, sex, and age, and (4) usability and related assessments. Information related to medical specialty grouping and terminology was reviewed by a surgeon (B.B.). During this phase, an additional 68 articles were excluded for the above-mentioned criteria, leaving 147 articles for qualitative data synthesis (Supplemental Table S1). Fifty-three articles described assessments of AR HMD devices. For the remaining 215 records, their respective articles were subject to an in-depth, fulltext analysis performed by five of the authors (E.J.B., K.F., A.S.K., K.L.K., and H.L.B.). To start, 10 articles were analyzed by each author, and the results were compared and discussed to arrive at a consistent coding approach. Coding involved extraction of data pertaining to the display hardware used, XR+ visualization type, displayed information, medical specialty, task information, user demographics, usability assessments, and dependent variables. The remaining articles were then divided among the group, coded, and later reviewed by one author (E.J.B.) for consistency. The extracted information was organized to interpret the trends in (1) device type and number of studies per year, (2) distribution of applications by medical specialty and in preoperative vs. intraoperative settings, (3) user demographics related to experience level, sex, and age, and (4) usability and related assessments. Information related to medical specialty grouping and terminology was reviewed by a surgeon (B.B.). During this phase, an additional 68 articles were excluded for the above-mentioned criteria, leaving 147 articles for qualitative data synthesis (Supplemental Table S1). Fiftythree articles described assessments of AR HMD devices.

Device Types
The 147 articles included for data synthesis were published from 1995 to September 2022 ( Figure 2). Fifty-three articles described the use of an AR HMD, the first of which was published in 2004.

Device Types
The 147 articles included for data synthesis were published from 1995 to September 2022 ( Figure 2). Fifty-three articles described the use of an AR HMD, the first of which was published in 2004. Several types of hardware have been used for XR+ visualization. Figure 3 shows the number and temporal distribution of publications for AR HMDs and the additional XR+ devices. Since 2017, AR HMDs have been the most prominent XR+ device type. Articles about AR HMDs and other XR+ devices increased until 2020; it is possible that the reduced article count in 2021 could be attributed to the COVID-19 pandemic, which may have reduced both the ability to conduct in-person user studies and hospital research capacity. Because the end date for the search was September 2022, counts for the year 2022 do not represent a full year of articles. Several types of hardware have been used for XR+ visualization. Figure 3 shows the number and temporal distribution of publications for AR HMDs and the additional XR+ devices. Since 2017, AR HMDs have been the most prominent XR+ device type. Articles about AR HMDs and other XR+ devices increased until 2020; it is possible that the reduced article count in 2021 could be attributed to the COVID-19 pandemic, which may have reduced both the ability to conduct in-person user studies and hospital research capacity. Because the end date for the search was September 2022, counts for the year 2022 do not represent a full year of articles.

Surgical Applications
XR+ surgical applications spanned twelve specialties, nine of which included the use of AR HMDs. The most featured specialties were orthopedic and spinal surgery (40 XR+ articles, 16 for AR HMDs), neurosurgery and interventional neuroradiology (26 and 8),

User Demographics
One hundred and nine (109) XR+ articles (83%) specified the total number of subjects in the user studies. The mean number of subjects was 13.6, with a median of 10 and a range from 1 to 77 subjects. Only 24 articles (18%) specified the number of female users, precluding further analysis. For AR HMDs, 35 articles (80%) specified the total number of subjects. The mean and median sample size were 12.9 and 10, with a range of 1 to 62. Eleven (11) articles reported the number of female users (25%). cent of all analyzed articles (85) and 79% of AR HMD articles (42) were used to perform a real or simulated surgical procedure. XR+ devices were used for planning alone in 49 ar ticles (8 with AR HMDs), although some of these applications involved rehearsal for a rea surgery using a patient-specific simulation [89,91,154]. Thirteen articles describe surgica planning and use of this planning technology during the procedure, with three of these using AR HMDs. The distribution of all pre-and intraoperative applications over time ( Figure 4a) suggests an emphasis on XR+ use during procedures, particularly for AR HMD use (Figure 4b).  Across all XR+ articles and the subset of AR HMD articles, user demographic data based on surgical experience revealed a lack of complete reporting, particularly for studies that included expert users (Table 2). For this review, novices were defined as subjects with no reported medical education, trainees were subjects currently engaged in medical training (e.g., medical students, residents, and fellows), and experts were subjects who were reported as attending or expert surgeons or proceduralists. One hundred and four (104) XR+ articles included expert users, but only 15 of these articles (12%) included statistics on the number of female subjects, while 14 articles (14%) included the age of the users. For articles that included novices, 41% reported the number of female users (13), and 52% reported ages of subjects (16). The reporting percentages for articles that included trainees were similar to but higher than those of articles including experts (35% for number of female users and 33% for age statistics). For AR HMDs, 36 articles included expert users, of which 6 articles (17%) included statistics on the number of female subjects and 9 articles (25%) included the age of the users. In the 12 articles where novice subjects were included, 5 articles reported the number of female users (42%) and 7 reported age (58%). For articles that included trainees, reporting percentages for age and sex were on par with those for novice users (57% for both).

Usability Assessments
For each article, user assessments and dependent variables from the included user studies were summarized and categorized. These results are summarized in Table 3. The included assessments covered open questions of feasibility, hardware and system characterization, and human factors relating to physiological loads on the user. Seven categories of usability assessments and two additional related categories emerged during analysis, listed in descending order of use in XR+ articles. Usability assessments were:

2.
User Experience (80)-Interviews, surveys, or other user-reported feedback about the usability and effectiveness of visualization type or hardware.

4.
Cognition (27)-Assessments of mental and attentional demands or changes in decision making.

5.
Visual Effects (22)-Objective assessments of visualization quality or accuracy, relative effectiveness among visual augmentations or rendering options, visual perception, or adverse physiological events related to visual perception. 6.
Physical loads (1)-Quantification of muscle activity or body movement.
Additional assessments related to usability were:

8.
System Performance (39)-Measurements of visualization hardware and software accuracy and speed. 9.
Validity/Reliability (12)-Comparison of simulators to real situations, or comparisons within and across users or observers. Task performance was most often assessed using a measurement of error from a landmark in the surgical space or from the surgical plan. Other common metrics included the number of task completions within a time frame, the rate of successful task completions across trials, performance ratings based on standardized evaluation tools or observation by experts, general feasibility of device use, or remarks regarding post-surgical patient outcomes. Interestingly, Scherl, et al. [159] used an AR HMD during a live human surgery and noted that a reversible complication that occurred in 3.7% of procedures without the device did not occur during its use. The Objective Structured Assessment of Technical Skills (OSATS), a validated global rating scale developed to assess surgical skills of trainees [174,175], was also used.

XR+ User Experience
As this review focused on articles describing usability assessments, more than half of the reviewed XR+ articles described the use of user experience (UX) questionnaires to gather feedback on the effectiveness, usefulness, or comfort of using various XR+ devices. The System Usability Scale (SUS) is a widely used questionnaire that was featured in several papers [176]. The vast majority of UX questionnaires and usability assessments captured the experiences of single users working alone. In contrast, Willaert, Aggarwal [91] captured single subject and group dynamics by using the SUS along with the Non-technical skills (NOTECHS) for surgeons rating scale [177] and the Mayo High Performance Teamwork Scale (MHPTS) [178]. Whole surgical teams evaluated virtual surgical simulations for patient-specific rehearsal. The application of these or similar assessments for communication and teamwork during multi-user AR HMD-assisted surgeries would provide valuable insight into the safety and effectiveness of AR HMD use in surgical applications. Indeed, real surgeries are almost exclusively performed in teams, and failures in teamwork are a strong predictor of surgical errors [179]. While UX and user preferences are often captured through subjective measures, Harake, Gnanappa [120] endeavored to link survey data to objective measures. To quantify user preferences between 2D and 3D presentations of echocardiograms, the investigators measured time spent viewing each image type and rotations of the images as a metric of engagement. Beyond capturing user preferences, this type of analysis could be used to understand how users interact with the visualized data and which aspects of the visualized data or user interface are of most value or draw the most attention.

Completion Times
Metrics used for completion time included completion times for system setup, system calibration, co-registration of virtual information to the real environment, and preand intraoperative task performance. One of the unique metrics in this category was the time needed for a team to discuss and reach consensus on congenital heart disease diagnoses [121]. This analysis answers the overarching question of potential time reduction but also targets behavioral and cognitive aspects of the clinical process that might be enhanced by novel visualization technologies.

Cognition
To address issues of cognition, some articles measured cognitive load, spatial cognition, attention, and decision making. Cognitive load was assessed through use of the NASA-TLX [6,123] and a surgery-focused variant, the Surg-TLX [50]. The NASA-TLX [180,181] remains a popular assessment tool despite questions regarding the construct validity [182] and interpretability [183,184] of the results. Objective measures may prove favorable alternatives, such as metrics derived from physiological signals [e.g., cortical activity via electroencephalography (EEG) or functional near-infrared spectroscopy (fNIRS) or physiological stress via galvanic skin response (GSR) or heart rate variability (HRV)].
To assess attentional demands, Andersen, Popescu [143] recorded gaze shifts between the novel XR+ visualization and the available standard monitor. Where XR+ devices are purported to increase attention by improving the location of displayed data or providing multiple data streams in the user's field of view, analyses of gaze shift may be used to assess this claim. Another paper compared the display viewing times for subjects who received task instructions on an AR HMD or standard display terminal [153]. The results present some complexity in the assessment of attention, as subjects completed tasks faster with the monitors but spent a higher percentage of time viewing them and shifted attention from the task to the display more frequently. Furthermore, analyses of attentional focus could benefit from additional information on "blindness" or inattention to quantify the extent to which a novel visualization method is distracting or requires such excessive cognitive demand that overall situational awareness is reduced [185,186]. Spatial cognitive demand may be dependent upon the user's spatial reasoning abilities. The Mental rotation test [187], used in Pahuta, Schemitsch [64] and Sadri, Kohen [151], or a similar assessment may help determine which users will benefit most from these technologies.
Another dimension of cognition considered in some studies was the impact of visualization on surgical decision making. Kendoff, Citak [61] showed that intraoperative use of a monitor-based AR technology revealed technical errors not visible with standard visualization and prompted immediate revision in 46 of 248 joint reconstruction surgeries. Others describe changes to the surgical intervention or strategy during planning [7,86,88,92,154]. This evidence speaks to potential increases in the safety and effectiveness of surgical interventions through advanced visualization technology.

Visual Effects
Several articles evaluated visual effects, including depth perception, color rendering and perception, visibility of anatomical landmarks, text readability, and grayscale contrast perception. For example, the work of Hansen, Wieferich [145] dealt with rendering techniques to improve understanding of relative depth of blood vessels in the liver. These considerations may be particularly important across applications dealing with vasculature, nerves, or abundant overlapping structures. In Sadri, Kohen [151], the Stereo Fly Test (Stereo Optical Co., Inc., IL, USA) and the Pseudo-Isochromatic Plate (PIP) color vision test [188] were utilized as pre-tests for depth and color perception, respectively. This baseline information was useful to identify any confounding factors affecting the user's ability to understand and manipulate color-coded 3D virtual heart models displayed via AR HMD. Qian, Barthel [6] demonstrated text readability and contrast perception assessments adapted for AR HMD use. These types of analyses are of particular importance given the transparency of the display, which allows ambient lighting conditions to impact the appearance of virtual content. Contrast perception of grayscale images is also relevant for the display of medical images, such as intraoperative X-ray fluoroscopy used for the orthopedic surgery application presented in the article. Despite the significance of contrast perception and text readability for speed and accuracy of data interpretation, this article presents the only objective, quantitative analysis of these characteristics among the articles included in the present review. Other aspects of image quality and accuracy were addressed by Southworth, Silva [115], such as geometric and rendering distortion and the dynamic range of color and grayscale test patterns under darkened and bright ambient conditions. Additional studies are needed to assess the risks of HMD alignment with the eyes, such as errors in the adjustment of interpupillary distance or subpar placement of the HMD on the head, and consequences for visual perception. Published and anecdotal data suggest that small alignment errors can create significant errors in perceived location of virtual data [189,190].
Prolonged visual loading and certain visual stimuli can cause eye fatigue, dizziness, cyber-sickness, or seizure. Some articles mentioned the potential for these events or gathered feedback on the occurrence of symptoms, but no objective measures of physiological predictors or symptoms were recorded. Potential indicators include blinking, pupil diameter, HRV, gastric activity, GSR, and other physiological stress signals. Additionally, system performance metrics that relate to adverse visual effects could be quantified and assessed for safety. Low frame rates, visual lag, and flicker can cause cyber sickness [191][192][193] as well as errors in task performance [194], and flicker may also cause seizures in susceptible individuals [191].

Efficiency
Efficiency was quantified with quality of movement (e.g., path length when using tools) and minimal use of additional imaging or materials (e.g., X-ray acquisitions or contrast volume used). The enhanced spatial cognition provided by XR+ technologies may lead to reduction in the patient and surgeon's exposure to radiation, improving the safety of procedures requiring intraoperative imaging.

Physical Loads
Measures of physical load such as EMG and motion capture data allow for objective, quantitative analysis and comparison of ergonomics across visualization types by pinpointing altered muscle activity, movements, or postures that could result in chronic pain or injury. One article quantified physical loading or biomechanics of XR+ device use. Zuo, Jiang [152] collected electromyography (EMG) data to detect fatigue of the sternocleidomastoid muscle, a rotator and flexor of the neck, during AR HMD use.

User Demographics
A few articles provided age and sex data, but most articles did not. This incomplete reporting is in stark contrast to the detailed information often provided about the patients in these studies and is particularly troublesome given that trainees as well as experts may be the intended users for these devices. This lack of information limits understanding of the usability results and their generalizability across user groups.
The age and sex of users can influence the safety and effectiveness of XR+ device use. Age-related presbyopia, an inability to focus on close objects, causes blurred vision of close objects. Hardening of the eye lens during natural aging hinders accommodation, or the mechanism of adjusting the eye lens to focus between distant and near images [195,196]. This is especially important to consider for near-eye displays like HMDs [196], for the tasks being performed (surgical tasks are often done within arm's length) [197], and for the intended users (surgeons are often in their middle age or older) [198].
Sex also plays a role in user XR+ device usability. Various studies have shown that women experience cybersickness more often than men [199][200][201]. Some studies have also shown that females tend to have lower visuospatial reasoning ability [202][203][204], although other studies have shown no differences [205][206][207][208] or differences that are diminished with training [209]. In any case, users with lower spatial reasoning skills may benefit most from the improved spatial understanding that AR can provide.

System Performance
System performance was most often characterized by frame rate, system display lag, co-registration error, and calibration error. A major challenge of AR is the integration of virtual imagery with real objects and environments. Accurate image-to-patient co-registration is needed for surgical navigation. Several papers described novel co-registration methods, such as the use of externally affixed fiducial markers [87], manual image placement through hand gestures [79], automatic co-registration based on selected virtual points [8], or tracking of 3D surface features [137]. After initial co-registration, intraoperative physiological movements and shifting of tissue require continuous monitoring and corrections to ensure accuracy throughout the procedure. Physiological movement, such as breathing and heartbeat, can cause rhythmic co-registration errors [210][211][212]. User movement profiles also introduce varying levels of registration error [152]. Compensation for these movements may require advanced algorithms for motion correction. Another source of co-registration error is intraoperative tissue shifting due to handling or removal of tissue. The use of intraoperative imaging has been used to maintain co-registration during these shifts. Ultrasound methods have been demonstrated for brain shift in neuro-oncology [78], orthopedic surgery at the pelvis [33], and pedicle screw placement [51]. El-Hariri, Pandey [33] reported RMS co-registration errors ranging from 3.22 to 28.30 mm, which they deemed promising but insufficient for surgical applications. These results suggest the need for studies of image co-registration to establish error thresholds for intraoperative use. Another aspect of system performance considered for AR HMDs was the effectiveness of the user interface under varying conditions. Kubben and Sinlae [72] evaluated the effects of various lighting conditions and colors of the surgeon's gloves on hand gesture recognition for manipulation of virtual objects. While the authors reported that there were no noticeable differences across conditions for the HoloLens, similar analyses may be informative for other commercially available or custom-built AR HMDs for surgical use.

Validity and Reliability
Metrics of validity and reliability included face validity, content validity, and various metrics of intra-and inter-rater or user variability. For example, Timonen, Iso-Mustajarvi [163] validated the accuracy of a virtual surgical simulation by measuring the distances between anatomical landmarks in a cadaver and in the virtual model and performing Bland-Altman analyses of similarity.

Limitations
The authors acknowledge that the search terms do not include less popular terms, such as "extended reality" or the specific terms for various hardware types; "augmented reality" and "virtual reality" were expected to be present in any articles that also included synonymous or similar terms. The search terms were also broadened to include surgical applications that did not involve AR HMDs. Uses of AR HMDs are emphasized in the text for clarity.
To focus on articles that included user studies, the articles were required to mention "user(s)," "participants(s)," or other common usability research terms in the abstract, title, or keywords. This stipulation may have excluded otherwise eligible articles from consideration. Further, the term "surg*" was intended to find articles related to surgery. No synonyms were included in the search terms.
Scope-mediated procedures were excluded to focus on open surgery and interventional procedure application spaces. However, usability articles for AR in scope-mediated procedures do include methodologies that could be pertinent for AR HMD use; future work could catalogue these methodologies as well. Lastly, the articles included are all published works. Publication bias may have excluded usability assessment information that would be relevant to this review. Because no effect sizes were estimated, a protocol has not been archived, and this review was not registered.

Conclusions
This systematic literature review identified trends in the use of AR HMDs and related technologies for surgical applications, including the number of publications for AR HMDs and related devices, the distribution of AR HMD and XR+ publications across medical specialties in pre-operative vs. intraoperative applications, the user demographics reported in these user studies, and how usability has been assessed. The results show a growth in the published literature over the past two decades for XR+ devices, with AR HMDs particularly gaining prominence within the past decade. Orthopedic applications were most common, followed by neurosurgery and oncology. Whereas XR+ devices have been growing in use in both pre-and intraoperative settings, AR HMDs have most commonly been featured in intraoperative use. We found a lack of objective usability assessments for physiological loading and underreporting of age and sex user statistics. The perceptual, cognitive, and physical loads imposed by XR+ device use are key components of device usability that are often overlooked, as demonstrated by the lack of articles that address the relevant assessment categories. For this review, perceptual loading assessments were limited to vision and related adverse effects due to the lack of user studies for haptic and auditory augmentations in surgery. While it is known that heightened physiological loads can diminish task performance, cause use errors, and negatively affect the user's health [213], only 29% of the XR+ articles directly assessed physiological loads.
The contributions of this work include a list of usability assessments, categories of usability considerations to address, and potential assessments to address common methodology gaps. These findings are intended to inform future usability research for AR surgical applications.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Note:
The mention of commercial products, their sources, or their use in connection with material reported herein is not to be construed as either an actual or implied endorsement of such products by the Department of Health and Human Services.