Toward an Objective Assessment of Implementation Processes for Innovations in Health Care: Psychometric Evaluation of the Normalization Measure Development (NoMAD) Questionnaire Among Mental Health Care Professionals

Background: Successfully implementing eMental health (eMH) interventions in routine mental health care constitutes a major challenge. Reliable instruments to assess implementation progress are essential. The Normalization MeAsure Development (NoMAD) study developed a brief self-report questionnaire that could be helpful in measuring implementation progress. Based on the Normalization Process Theory, this instrument focuses on 4 generative mechanisms involved in implementation processes: coherence, cognitive participation, χ 22 =22.5, P ≤.05). However, the difference is small and possibly not outweighing the practical relevance of a total score and subscale scores combined in one hierarchical model. One item was identified as weak (λ CA.2 =0.10). A moderate-to-strong convergent validity with 3 control questions was found for the Collective Participation scale (.47≤ r ≤.54, P ≤.05). Conclusions: NoMAD’s theoretical factor structure was confirmed in Dutch mental health settings to acceptable standards but with room for improvement. The hierarchical model might prove useful in increasing the practical utility of the NoMAD questionnaire by combining a total score with information on the 4 generative mechanisms. Future research should assess the predictive value and responsiveness over time and elucidate the conceptual interpretability of NoMAD in eMH implementation practices.


Background
More than 2 decades of research has shown that psychotherapy delivered through the internet, also referred to as eMental Health (eMH) interventions, can be an effective way to treat patients with common mental disorders such as depression and anxiety disorder [1]. Several examples of clinics routinely offering innovative and new eMH services exist, such as the Australian MindSpot clinic [2], GGZ InGeest Mindway [3] and Interapy in the Netherlands [4], Internetpsykiatr in Sweden [5], and Internetpsykatrien in Denmark [6,7]. Despite these examples, and although the technical infrastructure seems to be in place, large-scale use of eMH interventions in routine care is still lower than expected [8]. Knowledge on factors hindering or facilitating implementation is maturing [9,10]. However, measuring implementation outcomes reliably remains a challenge [11,12]. We conducted a psychometric validation study of a recently developed theory-informed implementation measurement instrument: the Normalization MeAsure Development (NoMAD) questionnaire.

Theoretical Underpinning
Various frameworks and theories for understanding implementation processes and evaluating outcomes exist [13,14]. For example, models such as the Knowledge-to-Action model [15] have been specifically designed to describe and guide implementation processes. Determinant frameworks such as the Consolidated Framework for Implementation Research (CFIR) [16] provide taxonomies of barriers and hindering factors to aid the evaluation of implementation outcomes. Similarly, the Reach Effectiveness-Adoption Implementation Maintenance framework [17] summarizes key indicators for implementation success to inform policy and decision making. Classic psychological behavior change theories such as the Theory of Planned Behavior [18] have been used to study the role of attitudes and intentions in the behavior of individuals involved in and affected by implementation processes. Although such theories can be useful in describing behavior change mechanisms and explaining how change in individuals involved in implementation processes occurs, they do not necessarily consider what people actually do when implementing innovations in health care practice but rather focus on beliefs and attitudes. The Normalization Process Theory (NPT) [19,20] aims to fill this void by specifically looking at the process of implementation.
NPT ( Figure 1 [19]) states that a normalization process is a process of embedding and integrating health care innovations in routine care as a product of action of individuals and groups. It focuses on the things that people individually and collectively do to normalize an innovation, that is, for it to become part of routine health care practice. NPT is a heuristic tool to understand the work of implementation, embedding, and integration of new practice and the contribution and roles of individuals and groups to this work. According to the theory, 4 mechanistic constructs play a central role in generating the work of implementation: • Coherence (CO) of the innovation with the goals of daily routine. Individuals and groups go through a process of sense-making to establish the meaningfulness of the innovation for normal service delivery goals and practices.
• Cognitive participation (CP) as a process of enrollment and engagement of individual participants and groups involved in the implementation processes, through which they become committed to the normalization of the innovation.
• Collective action (CA) by individuals and groups to apply the innovation in daily routine. Here, applying an innovation has certain implications as to what and how normalization should be achieved, which requires investments of effort by the participants.
• Reflexive monitoring (RM) through which participants in the implementation process evaluate and appraise the use of the innovation in practice.
These 4 constructs are influenced by group processes and social conventions as well as the organizational factors and social structures people operate in. In turn, this social and organizational context defines factors that promote or inhibit the work of individuals and collectives in implementing innovations in daily routines.
Earlier work showed that NPT has good face validity in designing and evaluating implementation processes of innovations [21]. A recent literature review of 108 studies indicated that NPT successfully aids in the conceptual understanding of implementation processes and outcomes across a wide variety of health care settings [22]. In alignment with the general approach of NPT, the NoMAD study developed a brief self-report questionnaire for the purpose of determining factors likely to affect normalization processes [23][24][25]. Ultimately, the questionnaire aims to enable (1) assessment of progress toward normalization over time in an implementation project and (2) comparison of normalization (progress or outcomes) between sites in multicenter studies. The NoMAD is intended to be used by people involved in the implementation of innovations in a health care setting and aims to be neutral to the implementation object. The target populations of the instrument are the deliverers and facilitators of the innovation being implemented, such as medical specialists, general practitioners, therapists, nurses, administrators, and managers.
A pool of 46 construct items was generated, appraised, and validated in 5 UK and 1 Australian samples of health care staff (n pooled =413) involved in 6 different implementation projects [23][24][25]. A psychometric evaluation of the initial item pool resulted in a 20-item questionnaire of which the theoretical model approximated the data acceptably and appeared to have good internal consistency (total Normalization Process Scale (NPS): alpha=.89, CO: alpha=.71, CA: alpha=.78, CP: alpha=.81, RM: alpha=.65) [25].

Objective
We translated the questionnaire into Dutch and aimed to confirm the theoretical factor structure in mental health professionals working to implement eMH in Dutch mental health care settings. We tested 3 factor structures: (1) A 4-factor model to confirm the theorized model, summarizing item scores per construct; (2) A unidimensional model to test whether the items in the questionnaire can be summarized by 1 single factor score; and (3) A hierarchical model to test whether the 4 first-order factors can be expressed in 1 second-order factor. Where the first model aims to capture a more detailed view on implementation processes, the second model might lend support for practical comparison of those processes. The third model might provide a more detailed understanding of normalization processes on the construct level combined with the practical value of the overall total normalization score in 1 measurement model. Conforming to the English validation study [23][24][25], we explored potential improvements and the questionnaire's convergent validity with 3 control questions.

Sample and Recruitment
Using a cross-sectional design, mental health professionals with various occupational backgrounds involved in implementing eMH in Dutch routine mental health care practices were invited to complete the NoMAD questionnaire. We defined involvement in implementation as the situation in which respondents were in the early stages of using eMH in their occupational tasks. By this, novelty to the respondent in applying such interventions in routine care was assumed. Following the English NoMAD study, an open sampling strategy was applied to obtain a sample of 300 respondents. Considering the commonly applied rule of thumb of 7 to 10 complete cases per item with a minimum of 100 complete cases, we expected this target sample size to provide satisfactory statistical power and precision for estimating the model's parameters [26,27]. Recruitment targeted mental health professionals involved in using novel eMH interventions in (1) primary care for patients with mild symptomatology (general practitioners or general practice-based mental health nurse specialists), (2) basic care for patients with moderate symptomatology, and (3) specialized care provided by specialists to patients with severe mental health complaints. A total of 3 groups of Dutch mental health professionals were identified as suitable for recruitment: • Group 1: mental health care professionals in 4 large regional mental health organizations for common mental disorders and post-traumatic stress disorders.
• Group 2: general practice-based mental health nurse specialists, in the context of the national electronic health (eHealth) Monitor survey conducted in 2016 for which panels and profession associations were sampled [28]. A total of 3 different recruitment strategies were applied. Sample 1 was obtained through convenience sampling by which participants were recruited through key contact persons in various mental health organizations. Sample 2 was obtained through existing respondent panels and professional associations in the context of the national eHealth survey. Participants for samples 1 and 2 were invited by email providing general information about the study, a link to more in-depth information, and an anonymous link to the Web-based survey. Sample 3 was recruited through an information kiosk and leaflets at the annual CBT conference.

Translation
The classical Brislin approach to questionnaire translation [29] was used to translate the English NoMAD questionnaire into Dutch. A small (N=3) sample of experts in implementing and using eMH interventions were asked to verbalize their thoughts while interpreting the translated items in a cognitive group interview [30]. The interview focused on the interpretation of the questions, the response scales, and the identification of terms that needed to be adjusted and/or rephrased. Problematic items were rephrased to form the final version of the Dutch NoMAD instrument. Back translation by a blinded professional translator confirmed equivalence of semantic meaning of the corresponding individual items by the principal investigator (TF) of the English NoMAD. The final Dutch translation of the questionnaire is included in Multimedia Appendix 1.

Data Collection
The questionnaire was administered via a commercial Web-based survey system (NETQ Internet Surveys 6.5 [31]). The research team tested the survey for sequencing of the items, technical reliability, and data export procedures. Participants were asked to provide consent for using their (anonymized) data in this study. They provided this digitally through the survey platform before they were allowed access to the survey.

Normalization MeAsure Development Questionnaire
The NoMAD questionnaire in this study consisted of 3 parts: Part A tapping basic demographic information, Part B collecting general normalization ratings about the current use and likelihood of using the intervention in the future, and Part C comprising 20 items measuring the 4 NPT constructs. Users of the questionnaire are required to tailor the implementation object (ie, intervention) to the context of its application. In this study, the terms (the intervention) were replaced with "eMental health." Part A: Demographic variables. In line with the English NoMAD, basic demographic variables were included in the first part of the questionnaire, including gender, age, years of working experience, professional job category, and relevant care sector.
Part B: General normalization items. Part B contained 3 questions addressing perceptions of respondents regarding past, current, and future normality of the intervention. The 3 questions were scored on a 1 to 10 Visual Analogue Scale [32]. To increase comparability to the UK study, these 3 items were added to the questionnaire as control questions to assess its convergent validity, that is, the 3 questions are not to be regarded as an integral part of the core of the NoMAD questionnaire [23][24][25].
Part C: NPT constructs. Part C consisted of the 20 items representing the NPT constructs in 4 subscales with the following allocation: CO: 4 items; CP: 4 items; CA: 7 items; and RM: 5 items. The 20 original items are listed in Textbox 1.
The items were rated on a 5-point Likert scale (1=completely agree to 5=completely disagree), with an additional response option to indicate if a statement was applicable (0=not applicable). Item 2 (CA.2) in the CA scale is negatively formulated; all other items were formulated in a positive sense. Respondents were required to rate all statements.
The Dutch translations are in Multimedia Appendix 1.

Scoring
Scale scores were calculated by taking the mean of answered items of a scale. A minimum of 2 items within a scale had to be rated to calculate a scale score. Items rated as "not applicable" were excluded from the calculation. The total NPS score was calculated by taking the mean of all answered items for which complete cases were considered to have less than 15% missing data.

Data Analyses
Descriptive statistics were calculated to summarize the item and scale scores. Internal consistency of the total score and the 4 theoretical constructs were analyzed by calculating the Cronbach alpha [27] for the pooled dataset. The quality of the construct structure was further assessed by applying a confirmatory factor analysis (CFA) using Structural Equation Modelling (SEM). A total of 3 models were evaluated: (1) the theorized 4-factor model, (2) a unidimensional model, and (3) a hierarchical model. All 3 models included the 20 items from Part C of the questionnaire. The items were scored on a 5-point Likert scale resulting in an ordinal ordering of the data. The sum scale score of the items approximates a continuous scale by which we expected the latent constructs to be normally distributed. The CFA was run with the robust Weighted Least Square Means and Variances (WLSMV) estimator using polychoric correlation matrices [26]. Model fit was assessed by estimating the misfit between the observed and implied covariance matrices using the chi-squared test (χ 2 ≤3df). This was supplemented with 4 other fit estimators to strengthen the basis for our conclusions: the Standardized Root Mean Square Residual (SRMR≤0.08) as an absolute index of the average discrepancy between the correlations in the implied model and the observed data; the Root Mean Square Error of Approximation (RMSEA≥0.95) providing a population-based goodness-of-fit indication corrected for model complexity; the Comparative Fit Index (CFI≥0.95) providing an index of goodness-of-fit relative to a null model (ie, no covariances between items); and the Tucker Lewis Index (TLI≥0.95) as an index of goodness-of-fit relative to a null model corrected for model complexity [26,33,34]. The 3 models under evaluation are expected to be nested. We applied the scaled chi-square difference test (χ 2 diff test, analysis of variance) to compare the fit of the 3 models [26].
Potential improvements to the factor structure were explored by identifying low item-factor loadings (λ<0.3) to ensure that items are meaningfully related to the respective factors [26]. Modification indices (modification index [MI], χ 2 diff ≥3.84) were assessed to identify item-item error covariances that might improve the model fit.
In the absence of a gold standard for the assessment of normalization, we exploratively used the 3 general normalization items (part B) to assess the convergent validity of the theorized model. We assessed the Pearson correlation coefficients for all 4 constructs and general normalization items and applied the following strength indicators for the correlations: 0≤r<.3 is weak, .3≤r<.5 is moderate, and r≥.5 is strong [35]. These quality indicators were applied in all correlation assessments.

Sample
Over a period of 10 months (May 2016 to February 2017), 262 respondents completed the questionnaire. Table 1 provides an overview of the samples and participant characteristics. On a pooled level, participants were middle-aged (M= 45, SD=11), female (81.7%), and had over 11 years working experience in their respective fields (52.9%). The response rate for group 2 was 22.8% (125 out of 547) [28]. For sample groups 1 and 3, response rates are not available because of the convenience and open sampling approach. The time required to complete the questionnaire was 7.56 min on average (SD=6.48, n=134, based on questionnaire log files). Figure 2 shows the distributional characteristics of the scale scores for the combined samples. The 4 subconstructs (CO, CP, CA, and RM) and the NPS follow similar response patterns.

Scale scores
Considering the length of the boxplot for the scales, respondents vary less in responses to items for the CO construct and more for CP and CA. The distributions of 3 subscales appear to have a slight tendency toward agreement with item statements where CA received mostly neutral responses. Most outliers are in the disagreement end of the scales. Table 2 shows the mean scale scores, indicating that respondents on average agreed with the item statements. Respondents disagreed considerably with item CA.2, indicating that they did not find the intervention disruptive to working relations ( Figure  3).    For the total NPS scale, a maximum of 15% missingness was allowed. For the sub-scales, a minimum of 2 rated items were needed to calculate a mean. b n varies because of item nonresponse. c Low and High represent the lowest (1) and highest (5) score, respectively, rated by the respondents.

Internal Consistency
Considering the number of items, the internal consistency of the translated NoMAD questionnaire is good for the total score (alpha NPS =.85) and ranges from questionable to acceptable for the subscales (.62≤alpha≤.75; Table 3). Internal consistency improved to good when items were dropped. Table 4 summarizes the CFA results and the fit indices for the 3 models: (1) the first order 4-factor model in which normalization is defined by 4 correlated constructs, (2) the first order unidimensional model, and (3) the hierarchical model in which a second-level factor accounts for the correlations among the 4 first-order factors. Considering the number of items, all 3 models fitted the data reasonably well. Both the 4-factor model and the hierarchical model represented the observed data significantly better than the unidimensional model (respectively: χ 2 6 =220.7, P ≤.05, and χ 2 4 =198.1, P ≤.05). The 4-factor model performed better than the hierarchical model (χ 2 2 =22.5, P ≤.05) with less discrepancy between the obtained and implied data (χ 2 164 =559.7, SRMR=0.12), better fit per variable (RMSEA=0.10), and better fit relative to a baseline model (CFI=0.90, TLI=0.88). Notwithstanding the significance, the difference for the chi-square test statistic and the fit indices is small and potentially not outweighing the practical relevance of a total summary score and subscale scores combined in one measurement model. Therefore, the factor structure of the hierarchical model is displayed in Figure 4.

Potential Model Improvements
We explored possibilities to improve the measurement accuracy and reliability of the proposed hierarchical model. From evaluating the item-factor loadings, it can be concluded that item CA.2 has a weak relationship with CA (λ=0.10), indicating that less than 1% of the variance in this item is explained by this factor. This is confirmed by the "Cronbach alpha-if-item-dropped" statistic indicating an improvement in the measured internal consistency of the theorized model when this item is removed (Table 3). MIs were consulted for the 4-factor model and 2 error term covariances-CA.3 to CA.4 (MI=76.9, δ=0.56) and CP.3 to CP.4 (MI=51.1, δ=0.59)-were identified as potential improvements to the model. For indicative purposes, the CFA was performed for an adapted hierarchical model in which the weakest item (CA.2) was removed and the 2 error terms were added. The modified model performed slightly better than the unmodified models (Table 4).

Convergent Validity
Following the UK study, we also explored the convergent validity of the original 20-item 4-factor model by correlating the observed mean factor scores with the mean scores for the 3 general normalization questions. Table 5 summarizes the findings. Weak correlations with the general normalization items were found for CO, CA, and RM (−.02≤r≤.27, 0.09≤P≤.81). The CP subscale had a moderate-to-strong correlation with the normalization items (.42≤r≤.59, P≤.05).

Principal Findings
Accurate and reliable instruments for measuring implementation factors and progress are currently few but required to improve the uptake of eMH interventions in routine care [11,43,44]. For this study, we translated NoMAD from English to Dutch and sought to confirm its theorized 4-factor structure in mental health care settings.
Our suggests that the NoMAD can be used reliably in assessing normalization processes in Dutch mental care settings. Using structural equation modelling, the CFA showed that the 4-factor model fitted the observed data best. This finding points in the same direction as the English psychometric study (CFI=0.95, TLI=0.93, RMSEA=0.08, SRMR=0.03, estimator: maximum likelihood) [25]. The hierarchical model might prove useful in increasing the practical utility of the NoMAD questionnaire. It offers implementation practitioners and researchers with an overall normalization score enabling comparisons across implementation projects. In addition, the subscales scores can provide a more fine-grained understanding of normalization processes and aid in identifying specific areas for improvement.
Considering the factor loadings of both the 4-factor and hierarchical models and the measured internal consistency, improvement of the theorized model seems desirable. Most notably, the explained variance in item CA.2: [the intervention] disrupts working relationships, was below validation standards (see Figure 3). Even though the extent to which people are using eMH interventions in practice might influence their perceived disruptive nature, a possible explanation might be found in the limited variance in ratings for this item as 70% of the respondents strongly disagreed with this item. This might stem from differences in linguistic interpretation by the respondents. For example, it could be that the translation of the term disrupt in CA.2 has a more negative connotation among the Dutch respondents than it has among English native speakers, leading to a tendency toward negative responses in the Dutch sample. However, this is speculative, and we feel it is too early to discard the item. We suggest further deliberation on the theorized meaning of the latent and observed variables to determine the influence of sample characteristics, implementation objects, and linguistic differences in the item formulation before conclusive decisions about possible improvements to the theorized model can be made [26,33]. In addition, we observed an increase of responses rating items in the RM scale as not applicable. Given the novelty of eMH to the care setting, it might be that the respondents have had limited exposure to the intervention to reflect on its implementation. This corresponds with a sequential interpretation of the NPT constructs but is not the only approach to the ordering of NPT mechanisms [22], and we did not measure the stage or type of implementation trajectory that respondents were currently engaged with, making it too early to draw any conclusions on the item response patterns at this stage.

Limitations
In view of the heterogeneity in implementation objects and health care settings included in current and previous NoMAD validation studies, the relevance of items should be taken in to account when administering the questionnaire to specific groups of respondents [24,25,45]. Although an open recruitment strategy was used for this study, it may be that the respondents had a natural inclination to partake in research or had a pioneering standpoint toward implementing eMH. This could have led to certain trends in the data that are not necessarily representative of the wider mental health care community involved in implementing eMH interventions. In this respect, it must be noted that sample groups 1 and 3 (mental health specialists and attendees of a national annual CBT congress, respectively) were convenience sampled. Sample 2 (48% of the pooled sample used in the CFA) was obtained through surveying members of a national professional association of general practice-based mental health nurse specialists [28]. We aimed to reduce selection bias by including these 3 sampling sources but results need to be interpreted with care because of the open design.
For the questionnaire, a 5-point Likert scale has been used for scoring the items. It goes beyond the scope of this study to discuss the consequences of this choice in depth, but it is important to note that we approached the individual responses as ordinal data because the numbers in Likert scales represent verbal statements and not numeric entities. The mean is commonly applied to summarize data allowing for calculating SDs and CIs. However, these indicators can be biased by outliers in non-normal distributions, possibly resulting in a distorted indication of the centrality of the data [46]. In addition, the approach to item nonresponse (or missingness) should be considered. Item nonresponse means that even though the respondent has participated in the study, data for certain items are unavailable for analysis [47]. In this study, respondents needed to either rate their agreement with the NoMAD items or indicate the item as not applicable.
For calculating a scale score, 2-rated items per scale were required. This approach might be limited in informing normalization processes as 50% (more for scales of more than 4 items) of the items in the same scale could be rated as not applicable. One possibility to reduce this nonignorable form of nonresponse is to apply a forced-choice approach by removing the option for respondents to rate an item as "not applicable" from the questionnaire. However, there is a risk that forcing a rating might lead to an artificial response when a respondent feels they do not have a choice. Another possibility is to apply theoretically defined and empirically confirmed cutoff for allowable missingness in calculating the scale scores.
A further methodological limitation of this study relates to the fit indices used to evaluate the models in the CFA. As the fit indices we used were developed for maximum likelihood-based parameter estimators, they should be interpreted with caution for ordinal data using robust Weighted Least Square estimators such as the WLSMV that we applied. It is argued that the distribution of the data and sample size have a consistent influence that might lead to overestimation of fit indices with ordinal data [48].

Future Research
With this study, we have successfully contributed to the ambition of NoMAD in delivering a generic implementation measurement instrument for measuring normalization processes across different health care settings, including mental health [11,12,[23][24][25]49,50]. Future research should assess relative predictive value and add to the practical interpretability and utility of the questionnaire. The hierarchical model provides the added value of a singular score for situations that require comparative evaluations of different implementation processes, while retaining the possibility to assess context-specific implementation processes at the construct level for understanding where implementation challenges exist in the development of effective and efficient implementation activities.
However, and although interpretability of the sub-scale scores and the total NPS score does make sense from a mathematical perspective, the meaning and normativity of the scores in practice need to be established before these scores can serve implementation research and practice meaningfully. Future research should establish normative data and assess the implied factor structure of the hierarchical model in different datasets.
To increase comparability with the UK psychometric study, the 3 general normalization items were added to the questionnaire solely for assessing convergent validity [23][24][25]. Although this gives some indications of correlation of the NoMAD items with similar scales, the status of these 3 items is unclear. However, they do not constitute to the core of the questionnaire, and users are advised to disregard them. Instead, different measures of comparable constructs should be examined to establish a stronger assessment of convergent validity. Preferably, a multitrait-multimethod matrix should be used to strengthen conclusions about construct validity by using different methods such as organizational data on normalization success [34,51,52].
Test-retest reliability should be assessed to examine responsiveness of the questionnaire over time, to establish the ability of the questionnaire to measure changes when they occur. Responsiveness can be considered a measure of longitudinal validity and can be assessed by testing the predefined hypothesis about expected differences in changes between known samples at different time points [27]. As the duration to achieve implementation success can vary across context implementation object and implementation activities, careful consideration is needed regarding an appropriate time frame for repeat testing to assess responsiveness of the NoMAD questionnaire [53]. Applying a large-scale, stepped-wedge randomized controlled trial, NoMAD is used to measure change in normalization processes over time in the ImpleMentAll project (study protocol forthcoming) to test the effectiveness of tailored implementation compared with usual implementation activities for eMH interventions.

Conclusions
Accurate and reliable assessment of implementation processes are needed to advance the implementation of eMH interventions in routine care. The translated NoMAD questionnaire proves to be a promising instrument in measuring implementation processes of innovative interventions in Dutch mental health care settings. The theorized 4-factor model approached the observed data acceptably, but there is room for improvement. The hierarchical model might prove useful in increasing the practical utility of the NoMAD questionnaire. Future research should add to the practical utility of the questionnaire by establishing normative data and assess the relative predictive value and responsiveness of the questionnaire over time.