Test sensitivity in assessing competencies in nursing education

The identification of effects of vocational education and training conditions on competence development in nursing education requires longitudinal studies. An important precondition is the availability of a test of nursing competence which is economical in use, measures a homogeneous construct throughout years of nursing education and across nursing specializations, and can detect increases in the required competence, hence allowing for sensitive testing. This article describes a cross-sectional study that aimed to optimize a computer-based test measuring nursing competence in care for the elderly—the TEMA test—through the selection of items on the basis of measurement error, differential item functioning, and item difficulty. Evidence of the test sensitivity of the optimized TEMA-L instrument is presented for the second and third year of nursing education. The total sample consisted of n = 133 German nursing students from clinical and geriatric nursing. The resulting instrument includes two test booklets consisting of 36 (WLE = 0.72) and 35 items (WLE = 0.70) respectively for the second and third year of training. The cross-sectional data indicate that the test likely has good properties for sensitive testing of nursing competence in a future longitudinal study. Hence, it might be used to study factors contributing to increases in nursing competence in German VET and serve as an example for similar studies in other countries. Limitations of the current study and related subjects of future research are discussed.

the quality of instruction (Naumann et al. 2019) in both the theoretical and the practical sphere (Deutscher and Winther 2018). Naumann et al. (2017) consider test sensitivity, understood as the overall variation of test scores across time points or groups, to be a prerequisite for identifying instructional sensitivity. This concept, which is the focus of our paper, implies that the test measures growth on a homogeneous construct over time.
In contrast, item sensitivity-understood as a relative measure-can be defined as the degree to which the sensitivity of the respective item deviates from overall test sensitivity; it is usually measured through differential item functioning (DIF; Naumann et al. 2019) and should be low in a sensitive test.
There are only few existing domain-specific competence testing measures for VET that are suitable for larger samples (Abele et al. 2021) and allow for longitudinal application (e.g., Deutscher and Winther 2018). This is particularly true of nursing education, which in Germany is mostly conducted in non-academic settings and, while not officially part of the dual system of VET, is also an example of a dualistic non-academic form of VET with school-based instruction on the one hand and practical on-site training in care institutions on the other (Bals and Wittmann 2009;Lehmann et al. 2014). In this field, most of the internationally available measurement instruments for competencies have consisted for many years of either self-reports (Wu et al. 2015;Yanhua and Watson 2011) or clinical evaluations in real-world settings (e.g., objective-structured clinical examinations; see Solà-Pola et al. 2020). There has been a lack of systematically and consistently developed, valid and reliable assessment instruments in clinical practice (Immonen et al. 2019). Whereas examining nursing competence in real-world situations is preferable to self-reporting in terms of validity (Kajander-Unkuri et al. 2016), it is not only inefficient with larger samples but also deficient in terms of standardization and reliability, particularly in the case of repeated long-term testing. This is likely a reason why longitudinal studies are rare (e.g., Fan et al. 2015). One way to address these issues is through computerized testing, which the National Council Licensure Examination (NCLEX) requires for nursing licensure in the United States. To address standardization issues in these admission examinations for vocational nursing practice, Woo and Dragan (2012) carried out item sensitivity analyses for content relevance to subgroups based on DIF analyses. However, we could not find any study of nursing competence in the international and national literature conducted with the purpose of testing this construct across years of nursing education or even preparing for its sensitive and economical longitudinal testing. We aim to lay the foundation for such testing in the study presented in this paper.
To address issues of valid and reliable testing of nursing competencies in larger samples, we developed a computer-based test on nursing competence in care for the elderly using a video-based situational judgment approach. We reported in Kaspar et al. (2016) on the measurement quality of the TEMA test in a calibration study, using empirical evidence from a cross-sectional large-scale assessment with 402 geriatric nursing students at the end of nursing education. The test construction supports its curricular and content validity to test nursing competence across geriatric and clinical nursing. However, we were not able to examine its suitability for testing across years of nursing education. Hence, the TEMA test could be used reliably to determine and compare the results of apprentices in geriatric nursing at the end of VET across the expected capability range for students but not (yet) to determine progress throughout VET. In addition, the TEMA test comprises 77 items, requiring almost two hours of testing time, which restricts its economical application in combination with other instruments, such as measures of the quality of VET.
With the cross-sectional study presented in this paper, we therefore aim to further optimize this computerized instrument in two ways. This involves, first, enhanced test economics through a reduction in the number of items and, second, the design of an instrument that allows for tracing progress on a homogeneous core construct of nursing competence in care for the elderly. In preparation for a future longitudinal study, we present evidence of the intended test sensitivity of the TEMA test for the second and third year of nursing education and across nursing specializations (clinical and geriatric nursing). Hence, our research questions in this study are whether it is possible (1) to create an economical short form of the TEMA test providing for acceptable reliability, (2) to maximize test sensitivity by reducing the number of items whose relative item sensitivity deviates substantially from overall test sensitivity by applying differential item functioning (see Naumann et al. 2016Naumann et al. , 2017, (3) to create a test enabling us to account for increases in achievement according to years of education, specifically to avoid floor effects. We pursue these targets while at the same time maintaining curricular and content validity and being fair across years of nursing education and nursing specializations. The purpose is to create an economic, reliable, and homogeneous test in which item difficulties balance out across the test for these subgroups, that is, preconditions other than increasing overall achievement on the core construct. With the resulting instrument, it should be possible to examine its aptitude for longitudinal analysis of competence development or to establish instructional sensitivity in a future study, for example by linking test results to the quality of VET (Wittmann et al. 2022;see Naumann et al. 2019).

The TEMA test
Against the background of the increasing relevance of care work for the elderly, implying specific foci such as multimorbidity or cognitive decline, we developed the TEMA test in order to evaluate the learning outcomes for nursing students regarding care for the elderly. To achieve this goal, we proposed a conceptual model of geriatric care competencies to guide the selection of a set of care situations and specific nursing behaviors for competence testing and to define a statistical model for estimating proficiency on the basis of test data. The TEMA test refers to competent action and interaction with care recipients and family members. 1 The instrument is intended to acknowledge care as a continuing mutual relationship with the care recipient and to align with the central elements of the care process, including diagnosis, intervention, and reflection .
The test is provided in the form of a video-based situational judgment test. Since competence assessment relies critically on the adequate representation of situations calling for the required behavior, we defined and validated a sampling space of everyday demands and challenges in care for elderly persons by means of systematic curricular analysis and expert interviews and refined it on the basis of Hundenborn's (2007) concept of care situations. The test environment provides a set of care situations from three institutional fields of practice covering three major incidents of care for the elderly (dementia, chronic diseases, end of life): (1) long-term group care (LTC) for patients with dementia (dementia hostel), (2) outpatient care (OTC) with a focus on chronic diseases and multimorbidity, and (3) institutional palliative care (PAL); they include five hypothetical care recipients with multiple care needs as cases. Within the fields of practice, we developed an overall set of twelve situations referring to care affordances identified as typical on the basis of curriculum analyses and expert interviews, such as wound and pain management, care planning, nutrition counseling, and emergency measures, among others, providing for item prompts. The situations were transformed into short video sequences of about 1 or 2 min each, with the filming monitored by trained nurses to enhance authenticity of the settings and the acting . Curricular and content validation of the test comprised the breadth of nursing education relevant to care for the elderly in Germany, meaning geriatric and clinical nursing, as well as a generalized program curriculum comprising both specializations since 2020 (see Wittmann et al. 2022). Table 1 provides an overview of the institutional fields of practice, major incidents, and situations.
High test proficiency levels should represent respondents' complex cognitive appraisals, which can serve as a basis for nursing activity in real care situations, including interaction and communication. We thus operationalized them with systematic reference to recognition of emotion, communication of empathetic understanding, and control of emotional expression (for details, see Kaspar and Hartig 2015), as well as bioscience knowledge required for competent diagnostics of the care recipients (Abele et al. 2021).
Item formats cover typical care activities (Fichtmüller and Walter 2007), such as the selection of one of several possible appraisals of situations or states (e.g., Which information will you make use of …?), behaviors and action plans (e.g., How would you respond to …? How would you proceed …? How would you prioritize …?), or evaluations of observed behavior (e.g., How would you evaluate/interpret …?). During test construction, the video-based situational stimuli were checked by nursing students in two pilot studies for issues such as undue exaggeration, stereotyping, and lack of consistency with the adjoining test items. To provide for standardized scoring, item responses must be given in closed format (multiple choice, true-false, image map, right order). In Table 2, we list the number of situations and items as well as maximum point scores for each of the curriculum-and content-validated activity fields in the TEMA test. Since respondents can achieve up to three points for items in the true-false format, a maximum score of 95 points is possible for the entirety of 77 items .
To estimate the psychometric qualities in the original calibration study, we asked 402 geriatric nursing students from 24 German schools at the end of VET to respond to the computer-based test. Multi-dimensional item response theory (IRT) modeling served as a means of estimating proficiency. The standardized computer-based testing (CBT) measures nursing students' client-directed care competence with acceptable precision (WLE = 0.76) in an optimized test version using 64 items, and does so across the whole range of observed proficiency levels. Test items from all proposed institutional fields of practice substantially contribute to the overall test reliability, supporting its structural validity (Messick 1987(Messick , 1995. As must be noted, the test should be expected to be rather demanding, since test subjects in the original calibration study attained only 45% of the maximum test score ).
Another recent cross-sectional study carried out by Ries (2020) on a sample of 408 students in clinical nursing supports the conclusion that the test can be meaningfully applied in clinical nursing as well; it possesses even higher reliability (WLE = 0.87) than in the calibration study, which may indicate that the test works slightly differently in geriatric nursing than it does in clinical nursing. Similarly to the original calibration study, attainment averaged 44% of the maximum score, raising the question of how to apply the test meaningfully to second-or first-year students while avoiding floor effects. The findings therefore underscore the need to determine how items can be selected for longitudinal testing with the TEMA assessment for the breadth of non-academic nursing education as it relates to the elderly.

Methods
We aim to select items, particularly anchor items, to be able to use the TEMA test efficiently for the purpose of a future longitudinal study, while at the same time validating its fit for students in both clinical nursing related to care for the elderly and geriatric nursing. To achieve this goal, we merged two samples from clinical nursing and geriatric nursing respectively, leading to an overall cross-sectional sample of 133 nursing students. Our sampling strategy involved selecting two classes per year of nursing for each of the subsamples, with second-and third-year data collected at the same schools, in order to create comparable data sets for these nursing education subgroups. 2 The combined sample slightly skews towards geriatric nursing (57.1% vs. 42.9% as opposed to 51.5% vs. 48.5% in federal nursing student population data). As the test had proved in the previous studies to be rather difficult for students at the end of nursing education, we expected the test might be too difficult for first year students and therefore did not include them in the study. Since class sizes are mostly smaller in the third than in the second year due to dropout, roughly 60% of the students in the overall sample were in the middle of their second year when the test was conducted. Our final sample largely matches the sample from the original calibration study, where 83.1% of respondents were female and the age was heterogeneous, varying from 19 to 54, with an average of 29 years, and 29.4% of respondents originated from families in which languages other than German were spoken . With regard to the overall nursing student population, male student nurses were somewhat underrepresented (25.1% in the overall nursing student population; BMBF 2021), and the oldest age group (> 25) was, due to its high share in the geriatric nursing subsample, considerably overrepresented (27.8% in the overall nursing student population; Federal Statistical Office of Germany 2021). While gender played no role as an explaining factor in the original calibration study, proficiency slightly increased with age ). This may point to larger issues of heterogeneity, particularly in geriatric nursing, and should be taken into account when interpreting our results. Table 3 provides an overview of the sample. Congruent with previous findings, respondents averaged 42.26 of a maximum of 95 points on the test. First, we used one-dimensional Rasch modeling and iteratively excluded items with measurement error in mind, particularly those with low item-total correlation, while preventing one-sided item exclusion by analyzing distribution across the fields of practice and the situations in the TEMA test. Second, we carried out differential item functioning (DIF) analysis to ensure that the test items could be used for sensitive testing in the second and third year of nursing education while being fair across the different nursing education specializations, meaning geriatric and clinical nursing education. To assess subgroup invariance in our sample, we refer to common recommendations from the NEPS study for assessing DIF (Pohl and Carstensen 2012). Thus, we consider absolute differences in estimated difficulties greater than 1 logit to be very strong DIF, and absolute differences between 0.6 and 1 to be worthy of attention for further investigation (Pohl and Carstensen 2012). While the overall test should differentiate between years of nursing education, item DIF should be low. Hence, items that showed a strong subgroup difference were discarded. In the final step, we used the results of these analyses for selecting items to avoid floor effects when testing second-and third-year nursing apprentices, again applying curricular and content validity as well as reliability considerations. Analyses were carried out with ConQuest 2.0.

Item reduction through measurement error minimization
Two items had to be excluded since they were constants, meaning that all item answers were false, and therefore no diagnostic value could be obtained. Rasch scaling of the remaining 75 items led to a reliability score of WLE = 0.75. In a first step, we selected items iteratively with reliability in mind in an effort to minimize measurement error (see Appendix Table 7). We considered three measures for this purpose: weighted mean square (WMNSQ), t-statistics, and corrected item-total correlation. Applying common rules of thumb, we considered values of WMNSQ < 1.15 as indicative of a close item fit, 1.15 ≤ WMNSQ < 1.20 as a small item misfit, and WMNSQ ≥ 1.20 as a considerable item misfit (Smith et al. 2008;Gnambs and Nusser 2019). By conventional standards, we interpreted t-values greater than + 2 or less than − 2 as less compatible with the model than expected (p < 0.05) (Bond et al. 2021). While WMNSQ lay within a range of 0.8 to 1.2 for all items, suggesting an acceptable model fit (Pohl and Carstensen 2012), the t-value was outside the range of − 2.0 to 2.0 for only one item, indicating that the observed data were less consistent with the model than expected (p < 0.05) and suggesting that the item should be omitted. In addition to the WMNSQ and t-value, we evaluated itemtotal correlations. According to common rules of thumb for evaluating the correlations of the item score with the total score, values > 0.20 are considered acceptable (Pohl and Carstensen 2012). With curricular and content validity in mind, we excluded items only if their item-total correlation was less than 0.15. Items with an item-total correlation less than 0.15 may indicate a low discrimination between low and high performers; furthermore, the low correlation can be interpreted as a problem of construct validity. This resulted in the exclusion of 18 items. The remaining item pool contained 56 items with acceptable values on all three criteria and an improved reliability of WLE = 0.78. In order to evaluate whether, while increasing reliability and test economics, items were excluded to the detriment of curricular and content validity, we analyzed the content and the distribution of the excluded items with regard to fields of practice and situations. The content analysis shows that the remaining items continue to reflect the fields of practice and the situations. Table 4 also indicates that items are distributed fairly across fields of practice and situations, indicating that the 56 item pool maintains curricular and content validity.

Differential item functioning
DIF analyses serve to ensure that test items show the same order of difficulty for varying subgroups in the overall sample. DIF exists for an item if its difficulty interacts with subgroup membership (Osterlind and Everson 2009), meaning that test subjects with the same ability score vary in the likelihood that they will answer an item depending on subgroup membership, and implying that a test containing the respective item discriminates against at least one of the subgroups. The main intent in our study is to ensure that no such discrimination occurs for second and third year nursing students and across both geriatric and clinical nursing (Koller et al. 2012). It should be emphasized again that the rationale for this is to create a homogeneous test suitable for tracing overall achievement on the core construct of nursing competence over years of nursing education. Item parameters were fixed at 0. With N = 133 nursing students, we conducted both local, that is, item-based, and global, meaning across-item, DIF analyses.
While geriatric nursing students scored slightly higher than their counterparts in clinical nursing, global DIF scores indicated no significant DIF between the geriatric nursing and clinical nursing subgroups (see Appendix Table 10). Item-level analyses indicated that twelve items varied significantly between the program subgroups (A05d, A16d, A19d, A21d, A25d, A47d, A49d, A54d, A70o, A77d, A78d, A81d), six of them strongly (A05d, A19d, A47d, A54d, A70o, A78d), with seven of the items being more difficult for geriatric nursing students (A21d, A25d, A49d, A54d, A70o, A77d, A81d) and five more difficult for clinical nursing students (A05d, A16d, A19d, A47d, A78d; see Appendix  Table 11). Curriculum analysis for the six items with strong DIF (A05d, A19d, A47d, A54d, A70o, A78d) indicated that differences could consistently be attributed to differences in comprehensiveness or explicitness within the curriculum, with the exception of item A78d, which requires bioscience knowledge and might therefore be easier to answer for students from clinical nursing (see Friedel and Treagust 2005). While we therefore excluded these items with strong DIF to generate an instrument for tracing increasing overall achievement in nursing in general, they may still be suitable for a test intended to trace longitudinal development regarding effects of specializations. Differences in the curriculum are of particular significance with regard to item A19d, an item concerning biography-oriented activity, which is granted considerably more time in the geriatric nursing specialization. As was to be expected, third-year students scored higher on the tasks than secondyear students, albeit only slightly. However, DIF analyses regarding the year of education did not lead to significant global DIF either (see Appendix Table 8). On the item level, an inhomogeneous picture arose, with some items being solved more frequently by secondyear students and some more frequently by third-year students. Three items showed strong DIF, with two of these items being more frequently solved by third-year students (A03d, A77d) and one more frequently by second-year students (A22d) (see Appendix  Table 9). Item A22d refers to restricting freedom in a dementia case, an explicit part of the second-year curriculum for both clinical and geriatric nursing and therefore possibly easier recalled in the second year. Differences for item A03d concerning planning measures for food intake could not be sufficiently traced to curriculum but may be related to the amount of practical experience. In contrast, item A77d, which refers to hypoglycemia symptoms, requires bioscience or medical knowledge and may be better recalled close to the final exam. Overall, this supports the assumption that these items with strong DIF may be excluded from an instrument aiming at measuring progress in nursing competence.
To ensure that the new TEMA-L instrument will be fair for the subgroups and that the competencies of the different subgroups can be compared with each other, we excluded nine items with strong DIF from the test as a result of the DIF analysis (Pohl and Carstensen 2012, p. 12). Table 5 displays the distribution of the omitted and the remaining tasks according to the fields of practice and the situation. Items remain fairly distributed across the fields of practice, and our analyses show that they continue to cover all situational content, except for the biography-oriented activity mentioned above (situation 1.4).

Item selection for measurement in the second and third year of nursing education
Our final step in the construction of a sensitive test was the selection of test items out of the remaining 47 items which can be applied for assessment in the second and third year of nursing education. The criterion for the item selection was that the items chosen should have the most diagnostic information value for different points of measurement in a future longitudinal study, creating an efficient test while at the same time maintaining a satisfying level of reliability. To this end, items should be selected for two measurement points in the second year and the third year of education, respectively. Hence, our purpose was to avoid floor effects at a first point of measurement in the second year of nursing education and ceiling effects at the second point of measurement in the third year (Rost 2004). Additionally, we intended a fair distribution specifically across the fields of practice in caring for the elderly, but also across situations, in order to maintain curricular and content validity.
To accomplish this, we first identified a set of 30 items of medium difficulty and good fit values as anchor items. Another 11 items were distributed according to the intended points of measurement, leading to an overall 36 and 35 items respectively for each of them. We chose five items that were solved by a higher share of third-year students and with slightly higher levels of difficulty for measurement in the third year of nursing education. Six items, which were generally solved at a high rate, were selected for measurement in the second year of nursing education. The results are depicted in Table 6, which shows that the items selected remain fairly distributed across fields of practice and situations, except situation 1.4. While biography-related items still remain part of the instrument, this means that biography-oriented activity was entirely excluded from the final instrument. Again, it must be emphasized that the purpose of our analysis is to generate an instrument which is fair for the different nursing specializations in the tracing of overall achievement. We would, however, strongly suggest integrating items regarding situation 1.4 into test environments for tracing increases of competence in the geriatric nursing specialization and taking this aspect of nursing education into account for future instrument development. The selected items continue to reflect the situational content of the remaining situations.
To ensure that the test booklets were fair at both intended measurement points, in a final step we performed DIF analyses between the second and third year of nursing education and between the two nursing specializations for the test booklets of the two measurement points. The DIF analyses did not result in a significant global DIF for either of the test booklets. Students in the third year of nursing education and geriatric nursing performed better on the tasks than students in the second year of nursing education and clinical nursing in the test booklet for the third year of nursing education, albeit only slightly (Appendix Table 16). In addition, the item-level results show that no item exhibited strong DIF (see Tables 12,13,14,15,16,17,18,19). Overall, the reliability of WLE remains at a satisfactory level, with WLE = 0.72 for measurement in the second and WLE = 0.70 for measurement in the third year of nursing education.

Discussion and limitations
The purpose of this paper was to optimize the TEMA assessment of nursing competencies in care for the elderly in preparation for longitudinal testing in the second and third year of nursing education, under the condition that it must be sufficiently reliable and economical with regard to testing time. To this end, the resulting assessment should be a sensitive test with regard to the core construct: it should reflect learning progress with regard to nursing competence over time without inherently preferring one programmatic group over another or varying the construct according to years of experience. Using Rasch scaling, we managed to identify a body of items which contribute to itemtotal correlation and reliability, are fair across existing nursing education specializations and across years of nursing education, and generally maintain the curricular and content validity, which formed the basis of the construction of the TEMA test. As the new test version results in only 36 and 35 items respectively per point of measurement, it has the benefit of reducing the number of test items by more than 50%, leading to significant gains in testing time in a longitudinal design.
The sampling strategy involved only two classes per group, and a future study should be broader in scope. Overall, the results of this study must also be interpreted with caution due to the heterogeneity of the sample. While the sample somewhat underrepresents males with regard to the general nursing student population, we consider this to be of lesser importance, as gender was not significantly linked with TEMA measurement of nursing competence in our previous large-scale assessment calibration study, conducted with a much broader sample of third-year nursing students. Since age was a significant factor linked slightly positively with proficiency in the calibration study, the high share of students older than 25 in the second year of the geriatric nursing subgroup may contribute to third-year students performing only slightly better than second-year students. Generally, while third-year students scored better than second year students in the resulting third-year booklet, global DIF between third-as opposed to secondyear nursing education subgroups was not significant. As this is likely due to the small sample size, significant differences might be expected in a larger sample. However, the study was carried out as a cross-sectional study, and longitudinal testing will be necessary to exclude the possibility that the DIF found in this study with regard to the years in the program are an artifact of the cohorts and do not represent individual development throughout nursing education. While the advantage of our cross-sectional study is that it is sample-conserving and avoids repetition effects, our intent in a subsequent study using the TEMA-L instrument is to disentangle such effects using an experimental longitudinal design with a larger sample. Furthermore, while the sample was comprised of nursing students from both clinical and geriatric nursing and retains satisfactory reliability for our overall sample, and although the curricular and content validity of the test was examined against the generalized nursing curriculum in place since 2020, empirical evidence for the new cohort that started in 2020 has yet to be collected and should be a subject of future study. Finally, sensitive items may indicate variations in either schoolbased instruction or practical training and might be used in future studies in a controlled manner to elucidate possible origins of variance.

Conclusions
In our cross-sectional study, we developed a reliable and economical technology-based assessment of nursing competence in care for the elderly that is sensitive to years of nursing education. These findings lead us to assume that the instrument can be validly applied in a future study for longitudinal testing of nursing competence concerning carerelated action and interaction with clients, patients, and family in the second and third year of nursing education. Moreover, this instrument can be used to conduct systematic research on factors improving nursing competence in the context of VET, both in the theoretical and in the practical sphere. Since our cross-sectional approach to testing sensitivity avoids problems of repeated testing attached to longitudinal studies, it might also serve as an empirical example on making sense and use of cross-sectional data in preparing longitudinal test designs.    et al. Empirical Res Voc Ed Train (2022) 14:3