Augmented Reality for AI-driven Inspection? – A Comparative Usability Study

Inspection in Aerospace industry can, as well as many other industrial applications, benefit from using Augmented Reality (AR) due to its ability to superimpose helpful digital information in 3D, leading to fewer errors and decreased mental demand. However, each AR device has advantages and disadvantages, and not all AR devices are suitable for use in industrial settings. We compare a tripod-fitted-adjustable-arm tablet-based AR solution (Apple iPad Pro) to head-mounted AR (Microsoft HoloLens 2) and a traditional, computer screen-based human-machine interface (HMI), all three designed to guide operators based on previously performed AI-based image analysis. Following an iterative design process with three formative evaluations, a final field test in a real industrial shop floor engaging 6 professional inspectors revealed an overall preference for the tripod-fitted iPad variant which receiving the best scores in most dimensions covered in both a usability-focused SUS questionnaire (score 71) and a NASA-RTLX form focused on perceived workload. More specifically, the tripod-fitted iPad was considered more usable (SUS) than the classic computer display HMI (M=5.83, SD=4.92, p=0.034, N=6); the temporal demand (NASA-RTLX) was considered lower using the iPad compared to both HoloLens 2 and the HMI (M=6.67, SD=4.08, p=0.010; M=10.83, SD=9.70, p=0.040, N=6), respectively.


Introduction
Visual Inspection of manufactured aerospace components is an important operation that requires high operator skills and attention.This demand makes it a suitable task for collaborative automation where part of the operators' skills and cognitive effort to complete the task can be supplemented by a robot and a computer system.The defects to be searched for and identified differ between component types, further increasing the complexity of the inspection task.Inspectors' skill levels vary, and inspectors' employment retention differs between geographical locations in the Aerospace sector.Many products' inspections are fully manual, but automatic visual inspection technology is desired to decrease the time needed and faulty products sold.Currently, the inspection procedure of the cylindrically shaped components that have symmetrical features face challenge locating indications of defects due to the lack of visual features for localization.
In our implemented inspection guidance system, shown in Fig. 1, robotic systems with sensors scan the components and use AI algorithms to analyze the scans.These automatically identified potential defects are then presented to the experienced human inspector for further investigation and actual decision-making through one out of three different devices which we evaluate in this study: an Apple iPad placed on a tripod; Microsoft HoloLens 2 head-mounted display; and a Human-Machine Interface on a computer screen (HMI).In this study, we compare the three above mentioned devices in an industrial environment, evaluated by experienced inspectors.We compare usability, perceived workload, and task completion time to understand which device is the most suitable for visual inspection.

Related work
Augmented Reality (AR) has lately received growing interest across multiple fields ranging from medicine [1,2] to industry [3,4,5].Industrial AR applications are used to guide operators in assembly, maintenance, and quality assessment due to their ability to superimpose digital information on relevant parts of the real world, reducing cognitive load.Additionally, the current Head-Mounted Displays (HMD) allow for hands-free interaction, being an obvious benefit in many industrial scenarios.However, the actual adoption of AR in industry remains limited [6] and studies that prove any advantages are few.

Augmented Reality (AR) guidance techniques
Visual information is dominating in providing information to users in AR [7].Visual highlighting of real-world objects or the presentation of dynamic arrows to guide users to areas where maintenance is needed as in [8] and [9] respectively, are two traditional and widely used AR guidance techniques.Recent applications use (visual) AR as an alternative to voicebased interfaces, as in pick by AR [10] or for visualizing what cannot be normally perceived, such as inner machine processes [8], and robot trajectories [11].
Yet another AR guidance approach involves the usage of avatars which operators either "climb in to" and embody or aim at mimicking by observing from a third person perspective.In some instances, this approach has been found to cause less cognitive strain than AR arrow-based solutions [12].The usage of avatars can be pushed even further by allowing experts to record and transfer their guidance to an avatar to train users with their own tutoring content as in [13].

Evaluation of AR devices in production environments
Several studies evaluated the usage of augmented reality compared to handheld devices in the production environment [3,4,5,14].Remote maintenance using Microsoft HoloLens compared to a telephone was evaluated for example in an experiment involving six maintenance engineers in production environment [14].Describing situations with HoloLens was rated as more efficient compared to the telephone.Furthermore, the participants felt safer taking the advice provided by the expert in HoloLens scenario, and the HoloLens condition required approximately 20% less time [14].A field study with professional machine operators doing a setup task for injection molding machine was reported by [5] where they compared HoloLens 2 HMD to a tablet computer (Dell Latitude) in their experiment.They captured task performance times, number of errors and perceived workload combined with usability questions.Participants felt that HoloLens 2 is more comfortable to wear and noted turning away from the task less frequently compared to the tablet.The task competition times were similar even though participants perceived them as reduced for HoloLens 2.
Three studies focused on inspection tasks [3,4].In the first mentioned, [3] AR inspection tool for worker support was developed in Unity for auxiliary baseplate system, using QR codes for calibration.The experiment was performed on a handheld tablet (Galaxy Tab), comparing two different user groupsengineers and factory workers, capturing usability and perceived workload as well as measured time and number of errors.Marker-less AR HMD approach to printed circuit board assembly inspection was evaluated in [4] comparing the HMD approach to handheld devices (mobile phone and tablet).In their study the participants preferred the proposed HMD system significantly over the handheld devices.Closest to our application, one of the cases presented in [15] used augmented reality on a tablet device coupled with an external tracking system to visualize inspection results on airplane wings.We have not found other examples of mounting handheld AR devices on a tripod such as in our case.

AR inspection tool design and development
We followed iterative design process during the development of both AR versions of the inspection tool.There were three formative evaluations, before the final field experiment.In these formative pilot tests, company employees that were not part of the field experiment provided feedback.After each session, the design was refined, and adjustments made.

Development of the iPad version
The iPad version was developed for iPad Pro 11 in Unity using the Open XR Plugin and AR Foundation.We used the Vuforia SDK and multi target 3D printed cube for tracking and rotation of the component.A simplified virtual version of an actual component was used for manual alignment.

Development of the HoloLens 2 version
The HoloLens 2 version was developed in Unity using the Mixed Reality Toolkit (MRTK).We used the Vuforia SDK and scanning of surrounding using the LiDAR sensor during the development phase on iPad Pro 11.An absolute rotary encoder connected to a Raspberry Pi was used to transmit rotational values, over MQTT protocol, of the component to track user rotating the component.A simplified virtual model of an actual component was used for manual alignment.

Interface design
A minimalistic interface was designed to guide users during inspection as shown in Fig. 2. We paid attention to keep the language and menu features consistent across all devices and to match the previously developed HMI (Fig. 2c).Both versions included: a list of all indications of defects previously identified by visual techniques; inspection mode where users make decisions on the type of defect and whether it should be accepted or not; a photo of a defect previously captured by an industrial robot; an arrow at the bottom of the component for rotation direction guidance; a blue cube that circumferences the area where a defect is; and settings allowing manual alignment and visualization of the virtual, simplified, component copy.

Pilot studies
Three formative pilot evaluation sessions were arranged to assess the usability of the AR prototypes in making.We used the SUS questionnaires to evaluate whether the prototypes reached a good level of usability.The participants were introduced to the problem area and presented to a simplified inspection task of checking three pre-selected indications, same for each participant.
We first evaluated the AR system on an iPad as a handheld device with five participants.In addition to SUS scores, the participants provided feedback on the handedness constraint of the iPad being problematic and difficult to operate during the task.Additionally, there were problems with maintaining precise tracking when the tracking cube was obstructed or far from the iPad camera's field of view.
To tackle this issue, for the next evaluation session that engaged eight other participants trying out both the iPad and HoloLens 2, we placed the iPad on a tripod with a flexible arm.This adjustment led to an increase of the SUS score as can be seen from Fig. 3, where handheld iPad received score 68 in our first pilot study, but a tripod-fitted version received score 81 in this second pilot study.Performing a paired t-test (N=8) depicted as the two rightmost bars in Fig. 3, the SUS usability score differences between iPad placed on a tripod and HoloLens 2 interfaces were significant (M=6.1,SD=7.56, p=0.031).The last formative evaluation was performed by two professional inspectors who tried the tripod -fitted iPad and HoloLens 2. Their interactions with both systems were videorecorded and the inspectors subsequently semi-structurally interviewed, being asked about the inspection procedure, how long the inspection usually takes, and what device they think would be easier to use.The inspectors pointed out that in the prototype shown we examine indications of defects only on the outside, but they usually perform inspection inside of the component as well.They appreciated the feature of using the picture taken by a robot as a part of the report they must submit at the end of inspection procedure.When it comes to inspection time, according to one of the inspectors, complete inspection of a single, big component takes about three hours.The preferred device was, in their opinion, to use iPad for older generations and HoloLens 2 for the young.The screen capture and first participant view, in the case of HoloLens 2, recording of inspectors' actions in the new AR system gave us an input on how inspectors navigate in the interface.For example, a blue arrow showing rotation direction at the bottom of the component, shown in Fig. 2 was added after this evaluation.

Field experiment
We conducted an experiment comparing the HMI, presenting 2D map of the inspected component with marked locations of defects; AR-based interface on an iPad, showing locations of the defects on top of the component, placed on a tripod with adjustable arm; and an interface using Microsoft HoloLens 2 head-mounted display with the same ability to show locations of the defects on the component.All three variants enabled participants to work hands-free in an inspection station at shopfloor at GKN Aerospace Engine Systems, Sweden in Trollhättan.

Experimental design
We used repeated measures experimental design with one independent variable (device) with three test conditions (iPad on tripod, HoloLens 2, HMI).We counterbalanced the order of conditions to minimize the effect of learning.Dependent variables were task completion time, usability score (SUS) and perceived workload (NASA-RTLX) score.Participants performed an inspection task inspecting 12 different defect indications provided in the interface, one after another.The order of defect indications was randomized before each condition.Each participant performed the task three times, once for each condition.Participants' data were anonymized.

Procedure
Before the start of the experiment each participant received a sheet with information about the study.We followed the Swedish Ethical Board 1 recommendations on what such document should contain.After reading the information and having time to ask questions, the participants signed an informed consent followed by a demographic questionnaire.Each participant was introduced to HoloLens 2 by the same training phase.We used the Examples application from Microsoft and let the participants explore the Button Examples and the Scrolling Object Collection examples.This was due to scrolling menus and pressing buttons being dominant features of the inspection interface.After the training phase, when comfortable with the interactions, each participant was guided to an allocated inspection table on the shopfloor to perform the experiment._______ 1 https://etikprovningsmyndigheten.se

Participants
We recruited 6 participants (5 men, and 1 woman), in average 44 years old, all experienced inspectors with an average of 11 years of experience.One used AR before.All participants were Swedish speaking and did the experiment as part of their work, so no compensation was provided.

Measures
We measured task completion time, perceived usability using the SUS questionnaire, and perceived workload using NASA-RTLX both translated to Swedish.SUS [16] is a questionnaire widely used for evaluating usability of various applications, including industrial [3], providing reliable results even with small sample sizes [17].NASA-RTLX [18] is the raw, unweighted version of the NASA-TLX questionnaire.The questionnaire evaluates perceived workload based on ratings on six subscales: mental demand, physical demand, temporal demand, performance, effort, and frustration.We considered measuring error rate; however, it was impossible for us to determine whether the responses provided by the participants were correct or not.Reasons for this are some of the inspector's unfamiliarity with the type of component and unavailability of certain inspection tools during the task, for example to measure the depth of a scratch.

Results
In the following we present the results regarding usability, perceived workload, and task completion time.All analyses have been conducted using Microsoft Excel and IBM ® SPSS with confidence intervals set to 95%.

Usability
To compare SUS scores for each pair of tests completed by our participants in each of the conditions, we performed a t-test on each pair.As Table 1 illustrates, the only pair where the difference in SUS scores is significant is the iPad-HMI (p=0.034,N=6) with the iPad condition being significantly more usable than the HMI based on participants' responses.In Table 1 we present results regarding the comparison of System Usability Scale scores (SUS) for each of the three conditions (iPad, HoloLens 2, and HMI).As can be seen, the scores for iPad, HoloLens and HMI are 71, 68 and 65, respectively.Participants rated usability of the iPad version of our inspection system as over average, HoloLens 2 version about an average and HMI below an average compared to previous studies with the average score of 68 [17].Given the fact that most of the previous research has been done on standardized systems, mainly focusing on the web or mobile apps [17], we can interpret the results as good overall.

Perceived workload
A paired t-test across the three conditions given the participants average ratings in the NASA-RTLX questionnaire for its six subscales, shows that the temporal demand was considered lower using the iPad compared to both HoloLens 2 and HMI (M=6.67,SD=4.08, p=0.010;M=10.83,SD=9.70, p=0.040,N=6), respectively.In fact, in all cases the perceived workload was rated the lowest for the iPad version of the system (Fig. 5), however, no other differences showed statistical significance, see Table 2.

Time
Average task completion times were 9.1, 9.4, and 11.8 minutes for the iPad, HoloLens 2, and HMI, respectively.None of the differences in time represents a statistically significant difference based on a t-test we performed.

Pilot tests
For analyzing the SUS scores for our early version AR prototypes evaluated in pilot tests, it can be beneficial to translate SUS numbers to adjectives.Using an empirically established mapping suggested in literature [19], mean SUS scores of 50.9, 71.4,85.5, and 90.9 correspond to usability adjectives "OK", "Good", "Excellent", and "Best Imaginable" respectively.From Fig 3 .we can conclude that while all three user interface solutions (handheld iPad, tripod iPad, HoloLens 2) were regarded clearly better than "OK", the iPad on tripod was regarded "Excellent".The high level of acceptance gave us a good base for testing the devices in a real industrial environment.Participants mentioned during the interview that HoloLens 2 could be more suitable for younger inspectors, whereas iPad would be better for the older ones.We observed no such preference from our data though.Additionally, the inspectors mentioned being more familiar with using a touch screen compared to virtual buttons in the HMD.

Field test in industrial environment
Previous studies comparing HMD AR with handheld devices often stress the benefit of using HMDs.Advantages reported in the previous work include more efficiency [14], saved time [14] or feeling of saved time [5].In one of the previous studies participants felt safer with the support provided though HMD [14].Perhaps surprisingly, one study mentioned comfort benefits from using HMD device compared to handheld [5].While HMD-based AR has been found to work well in inspection tasks [4] it was beaten in our usabilityfocused study by an iPad placed on a tripod.We observed no statistical difference in task-completion time and therefore cannot confirm saved time as in [14], however, it is possible that with larger data sample, the difference would become evident.Reflecting on the SUS scores rated by the participants during shopfloor evaluation in the field test, we can interpret iPad-based inspection system as "good" and HoloLens 2 and HMI as "OK" [19].Note that generally the experts involved in the shopfloor study rated the system lower overall (71 for iPad on tripod, 68 for HoloLens 2) compared to novices involved in the pilot (81 for iPad on a tripod, 75 for HoloLens 2).This is in line with previous research [20,21], reporting that experts found more usability issues than novices.
The reasons for why the tripod-mounted iPad overperforms the HoloLens 2 for our investigated task can vary.One possible explanation is that by placing the iPad on a tripod we removed otherwise occurring tracking instability [22] and made both devices hands-free, taking away the general advantage of HMDs.Another explanation can be related to the social aspect of the inspection task.Inspectors like to have discussions with their coworkers, something which can be hampered by "cyborg" feelings [22].Furthermore, vision problems such as eye strain [21,23] and blurred view [23] caused by wearing HoloLens can also be reasons for preferring the iPad variant over HoloLens.Finally, participants' iPad preference can also be explained by the mere-exposure effect [24], a phenomenon by which people tend to prefer what they are familiar with, which in our case is the tablet.

Limitations
The main limitation of our study is the small number of participants (N=6), reducing generalizability of the results.This was due to the limited number of inspectors working at the company, a common issue [5,14].Another limitation is involving end-users rather late in the design process.Involving end-users earlier might have improved both the tested prototypes and the experimental procedure.

Conclusion
With the overall aim of identifying the most usable way to support visual inspection in production processes, based on data from a previously performed AI-driven image analysis, we compared two AR-based inspection guidance approaches (iPad and HoloLens 2) with a standard computer display-based HMI.The iterative design process included three formative evaluations and amongst other things made us place the iPad on a tripod for improved ergonomics and increased tracking precision.This tripod-fitted iPad turned out to be the most appreciated inspection support solution amongst the three alternatives, receiving the best scores in both usability (SUS) and perceived workload (NASA-RTLX) although statistical significance was only observed in the following cases: the tripod-fitted iPad was considered more usable than the classic computer display HMI (p=0.034,N=6); temporal demand was considered lower using the iPad compared to both HoloLens 2 and the HMI (M=6.67,SD=4.08, p=0.010;M=10.83,SD=9.70, p=0.040,N=6).

Fig. 1 .
Fig. 1.Overview of the proposed inspection system.In this work, we evaluate usability, perceived workload and task completion times of the different devices used for the inspection system interface.(Marked in orange)

Fig. 3 .
Fig. 3. SUS scores from pilot studies: The purple bar illustrates the average score participants rated the handheld iPad version in Pilot1 (N=5).The green bar illustrates the average score participants rated version with iPad placed on the tripod in Pilot 2 (N=8), and red bar HoloLens 2 in Pilot 2 (N=8).

Table 2 .
Paired samples t-test performed on each pair of the NASA-RTLX subscales regarding mental demand, physical demand, temporal demand, performance, effort, and frustration for the three pairs.