Applicability and validation of the Reaction to Tests Scale (RTT) in a sample of Portuguese medical students

Test anxiety is a crucial factor in determining academic outcomes, and it may lead to poor cognitive performance, academic underachievement, and psychological distress, interfering specifically with their ability to think and perform during tests. The main objective of this study was to explore the applicability and psychometric properties of a Portuguese version of the Reactions to Tests scale (RTT) in a sample of medical students. A sample of 672 medical students completed the RTT. The sample was randomly split in half to allow for independent Exploratory Factor Analysis (EFA) and to test the best fit model—Confirmatory Factor Analysis (CFA). CFA was used to test both the first-order factor structure (four subscales) and second-order factor structure, in which the four subscales relate to a general factor, Test Anxiety. The internal consistency of the RTT was assessed through Cronbach’s alpha, Composite reliability (CR) and Average Variance Extracted (AVE) for the total scale and each of the four subscales. Convergent validity was evaluated through the correlation between RTT and the State-Trait Anxiety Inventory (STAI-Y).To explore the comparability of measured attributes across subgroups of respondents, measurement invariance was also studied. Results from exploratory and confirmatory factor analyses showed acceptable fits for the Portuguese RTT version. Concerning internal consistency, results indicate that RTT was found to be reliable to measure test anxiety in this sample. Convergent validity of the RTT with both state and trait anxiety STAI-Y’s subscales was also shown. Moreover, multigroup analyses showed metric invariance across gender and curriculum phase. Our results suggest that the RTT scale is a valid and reliable instrument for the measurement of test anxiety among Portuguese Medical Students.


Introduction
Much research has been done regarding the role of emotion on performance, with anxiety-usually characterized by sentiments of tension, worry and negative physiological reactions-being the key variable of interest in comprehending the role of emotion in performance [1]. Higher levels of anxiety are often manifested in situations in which we are evaluated. These scenarios are part of our routine, both at the academic and at the professional level, and can arise as anxiety and stress enhancers [1,2]. Although anxiety can be useful, encouraging learning and motivating students, extreme levels can have health repercussions both at the mental and physical levels [2]. One of the most common anxiety situations reported by students is test anxiety [3]. Regarding educational settings, test anxiety is frequently described by context specific stimuli and academic subject specific reactions, being distinguished from other forms of anxiety through its focus on evaluative circumstances [1]. In an academic context, college students (and particularly, medical students) are no exception to the rule [4]. Test anxiety is a situation-specific personality trait generally regarded as having two psychological components: worry and emotional stimulation [5]. Test anxiety is considered as a broader "evaluation anxiety" construct and is composed of cognitive, emotional, behavioural, and bodily responses that are associated with concerns about potential negative outcomes or failure when on evaluative situations [1,[6][7][8]. Test anxiety is a crucial factor in determining academic outcomes, and it may lead to poor cognitive performance, academic underachievement and/or psychological distress [8].
Research on this topic is not new: the first studies on this subject date back to 1914, with the concept arising in 1952, when Text Anxiety Questionnaire was published by Mandler and Sarason [9]. In Liebert and Morri's [10] early designation, test anxiety was analyzed as a bidimensional construct involving two components: worry and emotionality. Worry reflects the cognitive aspect of test anxiety and refers to concerns relating to performance during the exams, while emotionality encompasses students' physical reactions experienced during the testing situation. Perhaps the most important contributions to test anxiety research were the distinction between anxiety as a temporary state and as a personality trait [5,9] and the distinction between two basic dimensions in anxiety: worry and emotionality [5,6,9]. Later, in the 80's, an influential theoretical shift into a multidimensional view of test anxiety emerged. Sarason and Wine postulated that test anxiety is a complex phenomenon that consists of cognitive, emotional, behavioral and bodily discriminable components [11,12].
Considering test anxiety as a multidimensional construct, Sarason [11] established a four-factor Reactions to Test Scale (RTT) to assess this matter. Later, Benson [13] developed a shorter scale where they combined the RTT scale [11] and the Text Anxiety Inventory [14], and removed items that were redundant or incapable of loading substantially in any factor, creating, therefore, a scale that combined the strengths of both tools with a total of 18 items. They then pursued to enhance the precision of the scale by adding new items, especially in the Bodily symptoms component. The outcome was the 20-item RTT scale [6,15].
In this work, we aimed to answer the call of Benson and Bandalos [13] for more validation studies of RTT 20-item with other populations to analyze the stability and generalization of the first and second-order factor models of RTT 20-item. Overall, there were three goals to the present work: (1) to understand the occurrence of test anxiety in a sample of Portuguese medical students, (2) determine if the validity and reliability of the RTT could be replicated in a Portuguese sample and (3) obtain data regarding the convergent validity of the RTT and STAI questionnaires. The tested hypothesis was that the RTT scale is a valid and reliable way to measure test anxiety in Portuguese medical students.

Sample
Data was collected from pre-clinical (the first three) and clinical (the last three) years of medical college students from the School of Medicine of the University of Minho (UM) and the Nova Medical School in Lisbon (NMS/ UNL). Regarding missing values, a Little's MCAR test was performed to understand whether missing values were randomly distributed. Considering that the test revealed no statistical significance, we assume that the data is missing completely at random and, therefore, we proceeded to replace missing values by the variable median. The final sample comprised 672 medical students (553 in pre-clinical phase and 119 at the clinical phase). Age ranged from 17 to 39, with a mean of 20.6 (SD = 2.75. Of these, five hundred and eleven were females (76%), and one hundred and sixty-one were males (24%).Of the 672 students in this sample, 393 belong to UM and 279 to NMS/UNL. At NMS/UNL, 21.9% of the students are male and the rest are female (n = 218). This tendency remains when we refer to UM, where 25.4% of the students are male and 74.6% are female. The average age in both universities, 19.35 (SD = 2.28) at NMS/UNL and 21.43 (SD = 2.73) at UM. The higher mean age at UM may be explained by the fact that between the two universities, only UM has students in the last 3 years of the course, i.e., in the clinical phase (n = 119).
EFA was performed on a randomly selected half of the data to examine the factor structure of the scale. A CFA was conducted in the other half of the sample. The gender distribution of the EFA sample is 24.7% male (n = 83) and 75.3% female (n = 253). The sample has an average age of 20.4 years (SD = 2.76), with 188 students belonging to UM, and the rest being students from NMS/ UNL (n = 148). Considering the clinical phase, 83.9% of the participants were in the preclinical phase (first 3 years) and the remaining were in the last 3 years of the academic pathway. For the CFA sample, the mean age was 20.67 years (SD = 2.74), with 23.2% being male and 76.8 female (n = 78 and 258, respectively) 0.188 students were students from UM and 148 from NMS/UNL and about 80.7% of them were in the preclinical phase (first 3 years of the medical degree).

RTT Scale
Given the complexity of the test anxiety phenomenon, various instruments have been developed for its determination and analysis. The Reaction to Tests Scale (RTT) (see additonal files) is a measure of test anxiety based on the interference model proposed by Sarason [11] and it represents the first shift to a multidimensional view of test anxiety. The RTT evaluates four dimensions of test anxiety: (a) tension, (b) worry, (c) test-irrelevant thinking and (d) bodily symptoms. Specifically, the (a) tension assesses feelings of muscle tension; (b) worry evaluates the presence of distracting worrying thoughts related to test performance; (c) test-irrelevant thinking factor contains items that measure the frequency and intensity of thought that are irrelevant to the testing situation, and (d) bodily symptoms includes physiological symptoms of anxiety. Meanwhile, based on the theoretical four-factor dimensionality proposed by Sarason [11], Benson and Bandalos [13] developed a shorter 20-item scale revision by reducing redundant items of the 40-items RTT. Participants are asked to rate each item on a four-point Likert format scale (1 = not all typical of me, 2 = only somewhat typical of me, 3 = quite typical of me and 4 = very typical of me), except item 20 that is reverse coded. Five items measure tension, six evaluates worry, five measure test-irrelevant thinking and four measure bodily symptoms. This shorter version presented high reliability, ranging from 0.68 for bodily symptoms, 0.85 for test-irrelevant thinking, 0.82 for worry, 0.91 for tension and 0.90 for the total scale.

STAI-Y
The STAI-Y [16] is a measure composed by two subscales with 20 items allocated to each of them. The State Anxiety subscale (S -Anxiety) evaluates the current state of anxiety asking how individuals feel "right now". The trait anxiety subscale (T -Anxiety) assesses relatively stable aspects of "anxiety proneness". Each item is scored on a scale of 1 to 4, based on the intensity and frequency. Range of scores of each subscale is 20-80, the higher score representing greater anxiety.

Procedure
The study was conducted in two moments: (1) translation of the English version into Portuguese and (2) validation of the Portuguese version. The translation process of RTT was conducted according to the following steps: (a) translation of the English version into Portuguese by one person without prior knowledge of the subject and two people with knowledge in the area; (b) direct comparison of the translated versions and synthesis of a single Portuguese version of RTT, after solving discrepancies through consensus; (c) back-translation and (d) pilot test of the pre-final Portuguese version on a randomly selected sample of medical students (n = 67). After modifications in RTT, the final version (additional file 1) was applied to the participants in the validation phase of the study. No changes were made to the scoring system and the rating criteria of the original instrument.

Data analysis
Data regarding the sample and RTT psychometric characteristics was analysed using IBM SPSS version 26. RStudio Version 1.2.5042 was used for reliability analysis, EFA, and CFA. Descriptive statistics of the RTT scale included the mean score, standard deviation, skewness (Sk) and kurtosis (Ku). Values higher than three for Sk and 10 for Ku were considered as severe violations of normal distribution of the items (Kline, 2011). The sample was randomly split in half to allow for independent EFA and to test the best fit model-CFA. The factorial structure was explored in RStudio by performing an EFA using the GPArotation package [18] and a CFA using the Lavaan package [19]. Parallel analysis was used to determine the number of extracted factors. CFA confirmed the best fit model, and it was also used to test both the first-order factor structure (four subscales) and secondorder factor structure (four subscales related to a general factor, Test Anxiety). As in the study by Benson and El-Zahhar [6], the present study sought to see if the correlation between factors could translate into a more general construct-test anxiety.
In order to evaluate model fit, chi-square by degrees of freedom ratio (χ 2 /df ), Comparative Fit Index (CFI), Tucker-Lewis index (TLI), Root Mean Square Error of Approximation (RMSEA) and Akaike Information Criterion (AIC) were used. The model was considered to have an acceptable fit if the value of χ 2 /df was less than five, RMSEA < 0.08 [20], and if CFI and TLI > 0.9 [20]. For AIC, the model with the lowest values fits the data better [21].
Modification Indices were also analyzed to identify correlations among errors, considering values above 11. From the values above this limit, the one that had the most significant value was added. Internal consistency of the RTT was assessed through Cronbach's alpha (α) and McDonald's Omega (ω) for the total scale and each of the four subscales. Composite Reliability (CR) and Average Variance Extracted (AVE) were also computed, considering the criteria of ≥ 0.5 as acceptable value [21]. Convergent validity was assessed by testing correlation matrix between the RTT and STAI-Y. Both Scales were implemented simultaneously to UM students.
Gender and curriculum phase differences were studied by applying an independent t-test to the RTT Total scale (test anxiety), and multivariate analysis of variance (MANOVA) to RTT subscales. The effect size, along with the 95% confidence interval for gender and pre-clinical/ clinical years differences in each RTT subscale and total scale was also calculated, considering benchmarks of effect sizes proposed by Cohen (small: d = 0.2; medium: d = 0.5; large: d = 0.8) [22,23].
To explore the comparability of measured attributes across subgroups of respondents [24], measurement invariance was tested for gender and curriculum phase. Five nested models with gradually constricted parameters were tested: Model 1 tested for Configural invariance (basic model structure), Model 2 tested for Metric invariance (same loadings across groups), Model 3 for Scalar invariance (constrained factor loadings and item intercepts) and Model 4 for Residual invariance (same measurement errors) [25]. The differences between nested models regarding CFI and RMSEA indices were considered acceptable for the following values: ΔCFI ≤ − 0.02, Fig. 1 Parallel analysis ΔRMSEA ≤ 0.03, for tests of factor loading invariance and ΔCFI ≤ − 0.01 and RMSEA ≤ 0.01 for testing scalar invariance [26].

Descriptive statistics
Descriptive statistics for RTT items are presented in Table 1. Items' sensitivity was assessed through Sk and Ku analysis, with values higher than 3 and 10, respectively, indicating severe deviance from normal distribution of the items [17]. All items show acceptable Skewness (ranging from − 0.86 to 2.10) and kurtosis (ranging from − 1.13 to 3.78).

Exploratory factor analysis
The dataset was split into two random samples. EFA was performed in 336 randomly selected individuals. To estimate the number of factors to retain, a parallel analysis was performed. Parallel analysis estimated four factors, as seen in Fig. 1.
Successively, a four-factor solution was inspected with an EFA with a principal axis factor analysis using a promax rotation, with loadings below 0.30 suppressed. EFA revealed a four-factor structure, similar to the one reported by Benson and Bandalos [13]. The four factors explained 45.0% of the variance. Standardized factorial weights and individual item's reliability for the model are presented in Table 2. The factors were construed as Test Irrelevant Thinking, Tension, Worry and Bodily Symptoms.

Confirmatory factor analysis
CFA confirmed the other half of the initial sample (N = 336), which supported the four-factor structure for first-order factor, with almost all items loading substantially on hypothesized factors. Loadings ranged from 0.22 to 0.85. Only item 20 presented a value below 0.40. (Cf. Table 3).
Fit indices suggested that the model provided a good fit for the data, as seen in Table 4. The first-order model revealed satisfactory fit indices (χ 2 /df = 2.9, CFI = 0.90, RMSEA = 0.075, SRMR = 0.059), although TLI did not meet the previously stipulated criteria for acceptable fit (TLI = 0.89). The Modification index revealed a correlation between errors in the tension subscale, which were added to the previously computed model, resulting in a new modified model with satisfactory fit indices (Cf. Figure 2).
Regarding the second-order latent factor model (test anxiety), only χ 2 /df reached acceptable values. The correlation between errors was added (e2 and e4) (Cf. Fig. 2) improving overall model fit, except for TLI, whose value did not reach the target level (TLI = 0.89).    Concerning AIC values and goodness of fit indices, the first-order factor model with correlated errors presented the lowest value for AIC and best-fit indices.

Reliability
In the present study, the RTT scale demonstrated high internal consistency with a Cronbach's α of 0.90 for the total scale (Cf.  Figure 3 shows the correlation matrix between the subscales and the total scale. Reported correlations between subscales Worry (W), Tension (T) and Bodily Symptoms (BS) range between 0.55 and 0.65, while those between Test Irrelevant Thinking (IT) and other factors were much lower, ranging from 0.17 and 0.43. For the correlation between subscales and total scale (TA), all values presented satisfactory correlations, with the lowest being for Test Irrelevant Thinking.

Convergent validity
The RTT intercorrelations with STAI Inventory for the UM students (n = 393) are summarized in Fig. 4. Concerning state anxiety subscales, significantly and positively correlations with RTT subscales and total score were found. All the RTT subscales and total score are significantly positively associated with trait anxiety subscales, thus pointing to its convergent validity. The positive correlations between the measures are expected, as reported in Bados et al. [15].

Measurement invariance
For Measurement invariance testing, a series of multigroups CFA were conducted between groups: curriculum phase (pre-clinical vs clinical) and gender (male vs female) (Cf. Table 6). Regarding the second order CFA modified model, comparisons between gender show that configural invariance was supported, as it revealed good indexes of fit (CFI = 0.90, RMSEA = 0.075, AIC = 15,406, BIC = 15,902), meaning that configural invariance is maintained. We also found support for metric invariance (ΔCFI = 0.001, ΔRMSEA = 0.002, p = 0.595). Concerning Scalar invariance, the model is acceptable, as it fits in the suitable ranges (ΔCFI = 0.002, ΔRMSEA = 0.002, p = 0.838). Although strict invariance is supported by almost all fit measures for gender (ΔCFI = 0.01, ΔRMSEA = 0.0019), the p value is < 0.001, which may suggest that it does not fit our data particularly well.
The same models were applied to the total sample (n = 672) (Cf. Table 7). Regarding the second-order CFA modified model for the total sample, only strict invariance for gender was not supported (ΔCFI = 0.010, ΔRMSEA = 0.001, p value < 0.001).

Gender and year of medical training comparison
Gender and phase differences for the total scale were analyzed through an independent samples Student's T-test. Concerning gender, significant statistical differences with medium effect sizes were found in the total scale (t (670) = − 5.1; p < 0.001, d = 0.46(0.28-0.64)) with female medical students reporting higher scores than male students. No statistical differences were found concerning phase (t (670) = 1.3; p = 0.189, d = 0.13(0.065-0.331)).
A MANOVA was conducted to test the hypothesis that there would be differences between gender and phase at a subscale level. For gender, we obtained significant results: the Tension, Worry and Bodily Symptoms subscales were significantly different in terms of gender (F (1, 670) = 42.4, p < 0.001; F (1, 670) = 28.1, p < 0.001 and F (1, 670) = 12.9, p < 0.001). Only Test Irrelevant thinking showed no differences for gender (F (1, 670) = 0.007, p = 0.934). Concerning the curriculum phase, with a MANOVA analysis no significant differences were found.

Discussion
The three main objectives of this study were to understand the occurrence of test anxiety in a sample of Portuguese medical students, to determine if the validity and reliability of the RTT could be replicated in a Portuguese sample and to obtain data regarding the convergent validity of the RTT and STAI questionnaires. The EFA reinforced the expected 4-factor structure. CFA validated the first and secondorder factor model [13]. Nevertheless, even though both demonstrated good model fit, the TLI of the modified second-order factor was slightly below the acceptable values (TLI = 0.89). In terms of internal consistency and composite reliability, the results suggest that the sensitivity, construct validity and reliability of the RTT scale were acceptable. Only the Bodily Symptoms subscale presented a Cronbach's α value below 0.70, but in an acceptable level (≥ 0.60), which can be explained by the fact that Cronbach's alpha is influenced by the number of items [27] and this subscale is the only with 4 items. These results are similar to those reported by Benson et al. [13] and Bados et al. [14]. For AVE, the Worry and Bodily Symptoms subscales and the total Test Anxiety scale presented values below 0.5. This can be because the worry subscale has one item (20-Após um teste, digo para mim mesmo "Acabou e eu fiz o melhor que pude") that has loading inferior to 0.3. However, since the Worry subscale and the total Test Anxiety scale CR value were above 0.7, this value is acceptable (Fornell et al., 1981). In contrast, the Bodily Symptoms subscale reached neither the AVE nor the CR value. The scale has only four items, one of which (item 1-A minha boca fica seca durante um exame) has a low loading value (0.38), which may explain this result. Concerning convergent validity, the trait anxiety subscales for the UM students presented significantly and positively correlations with RTT subscales and total score, evidencing the expected convergent validity. Regarding gender, female medical students showed higher test anxiety scores compared to male students. There were no differences relating to curriculum phase in any of the subscales or total scale. Finally, as expected, support for metric invariance was found across different groups (gender and phase). Although strict invariance for gender was supported in almost all fit measures, the p value was < 0.001, which may indicate variance. Concerning phase, although the p value did not allow to prove the thesis of scalar invariance, the rest of the indices contradict this conclusion. However, chi-square statistic often has higher power to detect minor model misspecification with larger sample sizes, and since all the other parameters are according to the stipulated values, we can accept invariance [29].
This study allowed us to understand the test anxiety patterns of medical students at two Portuguese universities. Additionally, the validation of this scale will allow an increase in the standardization of results among countries where the scale is already validated. A potential limitation of our study is the fact that participants were only from one university course (medicine), which should be considered when generalizing results. Second, our study has a cross-sectional design and does not allow to analyze the stability of the RTT 20-item version over time. Third, the sample is mainly composed of female students and pre-clinical students, which limits the measurement invariance across gender and curriculum phase.

Conclusions
To the best of our knowledge, our study is the first attempt to validate the RTT 20-item version in a sample of Portuguese medical students. Additionally, it responds to the call of Benson et al. [13] for validation studies of RTT 20-item with other populations to analyze the stability and generalization of the first and second-order factor models of the RTT 20-item version. The results support the validity and reliability of the Portuguese RTT 20-item among medical students and confirm the factor structure of the four-factor model (first order model) and the second order factor model. Given the challenges in applying test anxiety instruments across various cultures, the present study is a preliminary indicator that the RTT scale may prove a useful cross-cultural instrument.