Journal Pre-proof A

ABSTRACT


INTRODUCTION
Surgical innovations are complex and characterized by a development phase where new procedures and devices are iteratively modified and improved [1,2]. This refines processes and outcomes so that innovations are optimized until no further modifications are required. Theoretically, innovations progressing through the translational pathway subsequently undergo randomized evaluation to establish the effectiveness and cost effectiveness, as illustrated in the IDEAL (Idea, Development, Exploration, Assessment and Long-term follow up) framework [1]. In practice, few innovations follow this incremental pathway and full evaluation in a main trial does not always occur before the procedure is widely adopted [3][4][5][6][7][8]. A key factor influencing the development and uptake of a new procedure is the experience delivering the innovation (operator experience). Positive and negative experiences shape the development process. For example, physical hardship caused by poor ergonomics may inspire a device to be re-designed [9][10][11]. Similarly, psychological stress created from a highly complex procedure may prompt improvement by simplifying tasks [12,13]. It is expected that the introduction of new procedures will require additional effort, risks and uncertain benefits compared to routine care [14]. Operators' perception of these risks and potential benefits are viewed through the lens of their experiences and consequently influence their willingness to pursue the development and adoption of innovation.
Measuring operators' experience is therefore integral to understanding how and why surgical innovations are developed and to explore subsequent uptake of innovations. Efforts have been made to capture operators' experience of surgery. Typically, these include observer or self-reported measures of physical, psychological and emotional experiences in routine care. Evaluation of operators' experiences of innovative surgery, however, is inconsistent and lacks a standard measurement instrument [6,7]. This might hinder evidence synthesis, prevent shared learning between investigators, and slow development cycles [1,15,16]. This study aims to identify, critically appraise, and recommend a measure of operators' experience of performing innovative surgery to inform efficient and systematic evaluation of innovation.

Methods were informed by COSMIN (COnsensus-based Standards for the selection of health
Measurement Instruments) guidelines for systematic reviews of outcome measurement instruments that were modified to design this study [17]. These guidelines provide a framework to generate a comprehensive overview of the quality of measurement instruments to support evidence-based recommendations for the selection of the most suitable instrument for a given purpose. There were three phases: 1) identification of measurement instruments and development of a conceptual framework, 2) appraisal of instrument quality, 3) supplemental appraisal of content validity in the context of surgical innovation. A flow chart, illustrating the study design is presented Figure 1.

Definitions
Operators' experience is defined as the self-reported perception of performing an invasive procedure. It may be unidimensional (measuring only one concept) or multidimensional (measuring multiple concepts) and includes, but is not limited to, physical (e.g. comfort), psychological (e.g. mental complexity) and emotional (e.g. anxiety) experiences. Self-reported perceived competence is included, however, excluded are observer assessed measures of competence (e.g. analysis of learning curve). Published definitions for an 'invasive procedure', 'innovative procedure', 'operator' and 'outcome' are used and are all provided in Supplemental File 1.

J o u r n a l P r e -p r o o f
This study used multiple data sources to identify measures of operator experience in studies of early phase innovations and develop a framework of concepts being measured in the context of surgical innovation. This approach was used because scoping work revealed that traditional systematic literature search strategies would not identify relevant articles because key wording with the subject of interest is not available. Three literature reviews were therefore undertaken that were designed to identify 1) author-reported IDEAL innovation studies, 2) studies of known innovative devices from a broad range of medical disciplines, and 3) a sample of studies of colorectal cancer surgical innovation. Detailed methods and results for each review are described elsewhere [6,7,18]. Data sources were supplemented by targeted searches for innovation studies and a scoping search for existing systematic reviews of measures of surgeon experience used in routine surgery and snowball searches of reference lists. Search terms for 'invasive procedures' and 'measurement instruments' were combined with a systematic review filter and applied to the Ovid version of MEDLINE with no restrictions. Included were systematic reviews of studies measuring operator experience. Excluded were non-human and non-English language articles.
Self-reported measurement instruments were selected based on the presence of a development paper, defined by COSMIN as any "qualitative or quantitative study that were performed in order to develop a measurement instrument, including pilot testing of a draft measurement instrument, concept elicitation and/or testing of a new measurement instrument" [19] . Development papers were obtained and snowball reference list searching was used to further identify articles of relevance.

Data extraction and analysis
Outcomes and measurement instruments relevant to operators' experience were extracted from data sources verbatim through line-by-line coding, including details of measurement items and scales. Verbatim outcomes, measurement items and scales were categorized into conceptual J o u r n a l P r e -p r o o f domains by two researchers independently. Conceptual domains were summarized to create a framework of concepts being measured in the context of surgical innovation to inform the appraisal of instrument quality in phase 2.
Characteristics of measurement instruments that underwent formal development were obtained from development papers and summarized using descriptive statistics including number of items (single or multi-item), number and description of dimensions and scope. The scope of instruments is described as generic (designed to apply in healthcare and non-healthcare contexts), healthcarespecific (designed to apply in any healthcare context), surgery-specific (designed to apply to any invasive procedure), technique-specific (designed to apply in specific surgical techniques). All selfreported measurement instruments which underwent formal development were eligible to be brought forward to phase 2.

Phase 2: Appraisal of instrument quality
Quality of identified measurement instruments was appraised to determine which instruments are suitable and of sufficient quality to be taken forward into phase 3. Content validity was evaluated because guidelines consider it the most important measurement property to ensure the instrument is relevant, comprehensive, and comprehensible as to the construct of interest and target population [20]. COSMIN methodology was used to support quality appraisal [19]. COSMIN methodology was developed to review patient-reported outcome measures and it was adapted to this setting.
Each measurement instrument underwent two assessments that were summarized to inform a single overall quality rating. Steps and deviations from COSMIN methodology are detailed below.

J o u r n a l P r e -p r o o f
Assessment of the quality of the development paper The first assessment rated the quality of the measurement instrument development paper. Two reviewers (AM, CH) independently evaluated the quality of instrument development using 35 COSMIN standards that were rated 'very good', 'adequate', 'doubtful' or 'inadequate'.

Evaluation of the content validity of the measurement instrument
Results from the first assessment informed ratings of measurement instrument development against the 10 criteria for good content validity described by COSMIN [17,21]. Ratings considered the context, construct and population of interest as described in the relevant development paper.
A second assessment rated the measurement instrument. Two reviewers independently evaluated the content of each instrument against the construct (conceptual framework of surgical innovation developed in phase 1), population (surgeon innovators) and context (surgical innovation) of interest.
Ratings for each instrument against COSMIN's 10 criteria were provided as above.
Individual reviewer ratings were then reconciled in discussions between authors to produce combined ratings for assessments 1 (rating of development paper) and 2 (rating of the measurement instrument in the context of surgical innovation).

Selection of measurement instruments for supplemental validation.
In a final step, all ratings for each measurement instrument were reviewed jointly by the two reviewers who qualitatively summarized data to subjectively rank instruments according to which was considered a suitable and sufficiently high-quality measure to assess operators' experience for J o u r n a l P r e -p r o o f surgical innovation. Discrepancies between reviewers' ratings were resolved in discussions with the wider study team. Collective review of the reviewers' ranking of instruments by the multidisciplinary study team, alongside further discussions informed decisions on which instruments to bring forward for supplemental appraisal of content validity in phase 3.

Phase 3: Supplemental appraisal of content validity in the context of surgical innovation
Measures brought forward from phase 2 underwent further appraisal to explore any deficiencies in content validity identified during quality appraisal. Semi-structured interviews with multi-national operators with experience of surgical innovation considered whether the instruments' content are adequate reflections of operators' experience of surgical innovation (as defined by the conceptual framework and stakeholders' own experience) by exploring their relevance, comprehensiveness and comprehensibility and views on the most suitable measure for clinical use. Interviews were conducted over video conferencing software (Zoom, MS Teams) by two researchers (AM, colorectal surgeon and CH, social scientist) trained and experienced in qualitative research and with diverse backgrounds to enable triangulation. A topic guide was created and piloted to ensure discussions covered pre-defined areas of interest while being applied flexibly to allow participants freedom to explore new topics. Any arising deviant views were actively explored. A purposive sampling strategy was implemented to ensure participants represented the target population (i.e. operators with experience of surgical innovation) and to maximize variation in participant characteristics by sex, geographic location, experience with surgical innovation and professional self-described clinical specialty. Interview participants were identified through the authors personal network and Interviews and the focus group were audio recorded and transcribed. Principles of thematic analysis were applied using a framework approach [22], whereby transcripts were read and re-read for familiarization, line-by-line coding undertaken to assign meaning to relevant text, themes were identified by collating similar codes and revised through a process of constant comparison with new data and discussion with the study team. The analysis primarily focused on the framework of a priori topics described above, however, an inductive approach was also undertaken to allow any new themes to emerge from the data. Results are presented by theme.

Phase 1: Identification of measurement instruments
A total of 243 studies were included from multiple data sources including 48 author-reported IDEAL studies [7], 128 studies of innovative devices, 51 randomly sampled studies of innovative studies [6], and 16 from supplemental searches, including 1 systematic review [13] (Figure 2). For most outcomes (281, 92% of total outcomes extracted) no further detail of how they were measured was provided. For example, verbatim outcomes described that the innovation was "easy to learn" [23] or that "difficulty during surgery was evaluated" [24], but without any details of how this was assessed or whether a measurement instrument was used.
There were 21 measurement instruments identified which underwent formal development. No measurement instrument was used more than once. Instruments contained at total of 146 items (median 9, range 1 to 40). The most frequently measured conceptual domains were 'Psychology' (112 items across 13 instruments), followed by 'Usability' (34 items across 3 instruments). 'Physical comfort' was represented in three measurement instruments. The conceptual framework, derived from 304 verbatim outcomes and 146 measurement items, is summarized in Table 1.
J o u r n a l P r e -p r o o f   ?  ?  ?  -?  -SWAT  ?  ?  ?  -?  ?  BPD/LED  ----? -*Measurement instruments taken forward to phase 3 a Order of instruments represent subjective ranking according to which instrument was considered a suitable and sufficiently high-quality measure to assess operators' experience for surgical innovation. Two measurement instruments were considered of inconsistent quality: the Spielberger State-Trait Anxiety Inventory (STAI) [38] and the Imperial Stress Assessment Tool (ISAT) [26].
Comprehensiveness of both instruments was rated as insufficient, with indeterminant and inconsistent relevance, and sufficient and indeterminant comprehensibility, respectively. Similarities in rating of these instruments were expected as the STAI forms one dimension of the ISAT with the addition of two physiological measures (cortisol and heart rate). The SURG-TLX, STAI and ISAT progressed to phase 3 however, the STAI was presented within the ISAT to avoid duplication. [

Phase 3: Supplemental appraisal of content validity in the context of surgical innovation
Interviews were conducted between July and November 2020 and lasted between 30 and 45 minutes. A total of 20 professionals (7, 35% female) participated from a range of surgical specialties internationally (see Supplemental File 1, Table S  . "The Surgical Task Load Index seems to be much more comprehensive in nature and much more pertinent to the topic at hand, so I would say that by a long shot." [P21] ISAT "Just from the pragmatic perspective, are we really going to be recommending a tool which suggests that you're going to have to capture cortisol and heart rate? I think realistically … I mean yes in the perfect world but this seems to be much more of a research tool to be honest." [WP6]

Emergent themes Supporting quotations Procedures occur in stages
"So it's no longer just the global procedure and getting a score for everything or getting feedback for everything, it's start to think about how can we break down those steps of the procedure or device procedure into phases and steps that you can then really finesse which parts and which phases of the procedure we[re] particularly difficult and complex." [P24] Patient complexity "I think somehow you've got to be able to know that within that procedure, the general question about was it an average procedure? Or was it more difficult? Or more not? Nothing to do with the instrument but about the patient themself." [P19] Impact of wider operating team "I think...it's important to ask different persons or people from the team."

Relevance
Overall, participants explicitly noted the high relevance of the six concepts measured by the SURG-TLX. Nine (45%) provided unprompted support for the relevance of all six concepts measured by the SURG-TLX to measure operators' experience with innovation. Half used examples to illustrate the relevance of mental demands, physical demands, task complexity and situational stress to innovation without prompting. Task complexity was often referred to as the most relevant concept.
Temporal demands and distractions were described as least relevant by five (25%) participants. The SURG-TLX was considered equally relevant to new procedures and devices.
The relevance of the ISAT's cortisol and heart rate measurement to surgical innovation was questioned by the majority (11, 55%). Participants highlighted practical difficulties in measurement and interpretation. Similar doubts were expressed about the relevance of self-reported anxiety to surgical innovation.

Comprehensiveness
All participants agreed that the SURG-TLX was comprehensive and that concepts generally "capture most of the themes associated with a new procedure"[P29]. Conversely, participants viewed ISAT as insufficiently comprehensive because of the focus on stress and anxiety. Two additional concepts emerged that were not addressed by either measure. Five (25%) professionals described a need to capture overall satisfaction with the innovation. Similarly, six (30%) participants described the value of measuring "usability" of devices.

Comprehensibility
Few concerns were raised about the comprehensibility of either instrument. Two participants described difficulties understanding the SURG-TLX item 'temporal demands' and how it related to innovation rather than surgery in general.
Subjective instrument suitability for practical use J o u r n a l P r e -p r o o f All participants described the SURG-TLX as the more suitable measurement instrument because it was perceived as more relevant and comprehensive, provided richer information about operator experience and data collection was thought to be easier or more practical. Many (10, 50%) professionals felt the physiological components of the ISAT were "more objective", and two (10%) thought it may be a useful research tool, but still favored the SURG-TLX for the routine evaluation of surgical innovation.

Emergent themes
There were nine themes evident from the data that did not fit within the a priori framework (Table   4). Professionals discussed how several contextual factors in a real-world setting may influence the subjective experience with surgical innovation. For example, procedures occur in stages, only some of which may be innovative, and operator experience may be influenced by patient complexity and the wider operating team/environment. Professionals also felt that baseline attitudes towards the innovation, emotional factors and proficiency were important to consider, and that these are likely to evolve over time. Finally, two participants questioned the trustworthiness of professionals completing subjective self-assessments when the results could be reported to the wider surgical community. Measuring and understanding operators' experience of innovation is consistent with recommended methods for the development and evaluation of novel invasive procedures and devices [7,18]. The IDEAL framework describes the process by which interventions move from first-in-human studies, through development phases before definitive randomized evaluation and long-term monitoring.

DISCUSSION
Key to the early phases of this process is a detailed understanding of the innovation to identify when and how modifications are necessary to drive optimization. Feedback from professionals about the physical and psychological experience of using novel procedures and devices may help identify beneficial modifications or better characterize the root cause of complication or failings. In later phases, where innovations have stabilized and no further modifications are necessary, measuring operators' experience may provide some indication of when novice surgeons have become comfortable and achieved some level of proficiency [43]. Selecting a suitable measurement instrument will ensure data collection is standardized and easily comparable. It may also benefit the translational pathway because innovative procedures that lead to 'good operator experience' are more likely to be subsequently used or undergo full evaluation.
This study used robust methodology in accordance with international guidelines [19,20], but there are some weaknesses. Identification of studies of surgical innovation is challenging, and multiple targeted reviews were used to overcome limitations of traditional search methodology in this context. It is possible that measurement instruments were missed, but it is unlikely that any additional instruments would significantly alter the conceptual framework that underpinned the appraisal process.
We modified COSMIN methodology for assessing content validity of measurement instruments and this may have impacted on the results. Step 3b, for example, was modified to include a subjective J o u r n a l P r e -p r o o f judgement summarizing all ratings for the evaluation of content validity to select a suitable and sufficiently high-quality instruments to bring forward to interviews. It is possible that application of unmodified COSMIN methods would have brought forward more instruments for interview, but it was anticipated that these would have been less relevant, comprehensive, and comprehensible to the context of surgical innovation. Interviews in phase 3 also identified themes relevant to operators' experience not represented in the conceptual framework developed in phase 1. Refining the framework to include these themes may have caused some minor alteration to ratings in phase 2 but are not likely to have significantly changed the results. Supplemental validation of highest quality instruments was completed using interviews with operators from a range of locations and specialties, but it was limited to professionals from high income anglophone countries. Examining cross cultural validity in non-English speaking and low-and middle-income countries will be required to ensure generalizability in those settings. Our work did not review the total body of evidence for each instrument because the included instruments were rarely used in the context of surgical innovation. Instead, multiple data sources were used to identify measures of operator experience used in studies of surgical innovation. This implies that overall quality of available evidence (as determined by the GRADE approach through Step 3c of the COSMIN methodology) was not completed and therefore not considered during the selection of a suitable instrument in phase 2.
Synthesizing the total body of evidence for 16 instruments may have been valuable to explore instrument validity in different target populations but was considered unlikely to significantly change the conclusions of this review and exceeded the resources available to complete the work. There is potential value in completing a full, formal COSMIN review when instruments have been more widely validated in the context of a refined conceptual framework of surgical innovation as the subject of future research and may consider findings from the present work.
Content validity of the SURG-TLX has been established in the context of surgical innovation, with other instruments performing poorly. More research is now required to define other robust J o u r n a l P r e -p r o o f measurement properties. Further studies validating the use of the SURG-TLX for evaluating surgical innovation can enable calculation of test-retest, interrater and intra-rater reliability. Responsiveness to change can be evaluated by measuring operator experience before and after known improvements at different times through the innovation lifecycle and interpretation of the SURG-TLX may be improved through studies that define the minimally important difference. Finally, interviews with professionals highlighted some deficiencies of the SURG-TLX with regards to assessing satisfaction and usability of innovative devices. Complementary use of relevant measures identified in this work (e.g. System Usability Scale) can be considered in specific contexts.
The recent development of the COHESIVE core outcome set for all studies of surgical innovation [18] is an important step to enable systematic evaluation of complex, novel procedures and devices.
International stakeholders agreed that operator experience is one of eight domains that is essential to be measured and reported in early phase studies. The present study completes a necessary step to operationalize the core outcome set, however, there is an ongoing need to establish the measurement of the other seven domains.
In conclusion, the SURG-TLX has sufficient validity to be preliminarily recommended for use in studies evaluating surgical innovation. Routine measurement of operators' experiences may facilitate optimization of novel procedures and devices to enable safe and efficient innovation.