Structural validity and invariance of the Feedback Perceptions Questionnaire

Despite a growing interest in instructional feedback, students ’ feedback perceptions received limited attention. We examined the structural validity and measurement invariance of the Feedback Perceptions Questionnaire (FPQ). The FPQ measures feedback perceptions in terms of perceived fairness, usefulness, acceptance, willingness to improve, and affect. Secondary school students ( N = 1486) received a fictional scenario containing Concise General Feedback or Elaborated Specific Feedback by a fictional peer. Students rated their perceptions as if they had received the feedback themselves. Confirmatory Factor Analysis (CFA) supports the structural validity of the FPQ and its invariance for the two types of peer feedback, gender, four grade-levels and two tracks. Perceived fairness of peer feedback was a strong positive predictor of willingness to improve and affect, whereas perceived usefulness and acceptance of peer feedback showed a more complex pattern in predicting willingness to improve and affect.


Introduction
Feedback is one of the most powerful instructional techniques (e.g., Evans, 2013;Hattie & Clarke, 2018;Hattie & Timperley, 2007;Jönsson, 2013;Narciss, 2008;Van der Kleij, Feskens, & Eggen, 2015;Winstone, Nash, Parker, & Rowntree, 2017).Feedback can be defined as all post-response information which informs the learner on their actual state of learning and/or performance, in order to help detect if their current state corresponds to the learning aims (Narciss, 2008).Feedback provided by an external source can be designed and delivered in many ways and unfolds its impact depending on a complex interplay of individual and situational factors of a given instructional context.
Thus, despite the considerable volume of research on feedback, many aspects of how feedback operates are still poorly understood.The historical review and meta-analysis by Kluger and DeNisi (1996) on external evaluative feedback of task or process aspects of performance revealed, for example, that one-third of external feedback interventions actually reduced performance.They proposed the Feedback Intervention Theory (FIT) and added the metacognitive, motivational and affective planes to explain (non)processing of feedback by the recipient.Although considered by Kluger and DeNisi (1996), the review by Butler and Winne (1995) synthesized feedback and self-regulation explicitly.They highlight the interplay between internal feedback (within oneself) and external feedback (by others), as a result of which the recipient may (a) ignore the external feedback, (b) reject the external feedback, (c) judge the external feedback irrelevant, (d) consider the external and internal feedback as unrelated, (e) re-interpret external feedback to make it conform to the internal feedback, and (f) make superficial rather than fundamental changes.Building on both reviews, (Narciss (2006(Narciss ( , 2008(Narciss ( , 2013(Narciss ( , 2017) ) developed the Interactive Tutoring-Feedback (ITF) model.The ITF-model adopts a multidimensional view on feedback by differentiating the core internal and external factors and processes that influence how the content of feedback (e.g., evaluative and/or informative remarks) provided by an external source (e.g., teacher, peer, computer-based learning environment) is perceived, processed and used for further learning.It provides a heuristic framework for examining how the interplay of the internal and external factors contributes to the effects of interactive feedback strategies on learners' cognition, metacognition and motivation.Although slightly different in their assumptions and external feedback components compared to Narciss, Hattie and Timperley (2007), Shute (2008) and Yang and Carless (2013) also advocate a multidimensional view on feedback.
In line with the increased attention for the interplay of internal and external feedback and resultant (non)processing of external feedback, recent research emphasizes the agency and active role of the recipient in processing feedback (Boud & Molloy, 2013;Price, Handley, & Millar, 2011;Strijbos & Müller, 2014;Winstone et al., 2017).In fact, the emphasis on recipience processes stresses Mory's (2004) urge for research into feedback perceptions as an explanatory, moderator and/ or mediator of feedback processing.Feedback perceptions refer to the outcomes of how recipients spontaneously experience the feedback content as provided by an external source-or the feedback process as a whole-in terms of cognitive, metacognitive, motivational, and/or affective reactions.Feedback perceptions may vary depending on various feedback content (e.g., evaluative and/or informative remarks) and/or various feedback sources (e.g., teacher, peer, computer-based learning environment), and thus unfold their moderating or mediating impact in various ways.Investigating issues related to feedback perceptions and their role for feedback processing requires reliable and valid instruments to measure feedback perceptions across various feedback processes and contexts (cf.Brown & Harris, 2018).In this paper we will first outline how feedback content and source may influence feedback perceptions in the processing phase.Secondly, we revisit how prior work has measured feedback perceptions in the processing phase.Finally, we will describe the Feedback Perceptions Questionnaire (FPQ) and investigate the structural validity and invariance of the FPQ.

Impact of feedback content and source on feedback perceptions and processing
The external feedback content, characteristics of the source (e.g., Leung, Su, & Morris, 2001) or a combination of external feedback content and source characteristics (Berndt, Strijbos, & Fischer, 2018;Raemdonck & Strijbos, 2013;Strijbos, Narciss, & Dünnebier, 2010) can influence feedback perceptions in the processing phase.However, there might be differences in perceived credibility of various feedback sources (e.g., teacher vs. peer; supervisor vs. co-worker) due to a perceived implicit and/or explicit power differential (Boud & Molloy, 2013;Carless, 2006;Leung et al., 2001;Raemdonck & Strijbos, 2013;Steelman, Levy, & Snell, 2004;Strijbos & Müller, 2014;Strijbos et al., 2010;Winstone et al., 2017;Yang, Badger, & Yu, 2006).Most studies in education predominantly focus on the teacher as the source of feedback, which is not surprising given that the teacher is (historically) considered a subject area expert.However, the increased interest in peer feedback signifies that the 'peer' is also a potential and relevant feedback source in education.Nevertheless, as students are not experts in a subject area, peer feedback is susceptible to variation in content (Lockhart & Ng, 1995).Moreover, students often doubt their own and their peers' knowledge within a subject area, and their own and their peers' evaluation and feedback skills (McConlogue, 2012;Rotsaert, Panadero, Estrada, & Schellens, 2017;Strijbos, Ochoa, Sluijsmans, Segers, & Tillema, 2009;Van Gennip, Segers, & Tillema, 2010).Additionally, peer feedback can vary due to relationships whereby students prefer not to assess their peer too harshly (Cheng & Warren, 1997;Panadero, Romero, & Strijbos, 2013).As such, feedback from a peer might differentially affect the recipients' feedback perceptions in the processing phase-possibly even more so than teacher feedback.

Existing questionnaires that measure multiple dimensions of feedback perceptions in the processing phase
In line with a multidimensional view on feedback it is evident that measurement of feedback perceptions should be multidimensional.Thus, we compiled an overview of existing instruments that measure multiple dimensions of feedback perceptions in the processing phase.Since feedback can be provided by various sources, we adopted a broad perspective for our review and included also questionnaires developed for a workplace context.Studies measuring only a single dimension (e. g., perceived usefulness, fairness, timeliness, helpfulness, etc.) were excluded.Given our quantitative questionnaire orientation we also required that (a) psychometric information on reliability and/or validity was provided for all (sub)scales, (b) (sub)scales consisted of multiple items, and (c) the phrasing of all items was available in the main text, a table or an appendix.Our literature search identified twelve questionnaires of which eight met these three criteria and are summarized in Table 1.
First of all, the questionnaires are more or less well based on a definition of feedback perceptions.Four out of eight were developed in an iterative and bottom-up process, and the aspects of feedback perceptions measured by the (sub)scales has to be derived from the itemcontent (AEQ, AFQ, IFOS, and SCoF).In contrast, the FES is based on an a priori model of feedback environment factors, and the FOS is rooted in the construct of feedback orientation.Only the FS was based on a brief, yet explicit, definition of feedback perceptions: "Feedback perceptions are thus concerned with how a learner perceives the feedback, which is assumed to be influenced by the feedback message, characteristics of the feedback provider and the frame of reference of the feedback receiver" (De Kleijn et al., 2013, p. 1014) and items were based on Hattie and Timperley (2007) and Shute (2008).Finally, the development of the FPQ was guided by the ITF-model (Narciss, 2008), yet the definition of feedback perceptions was only implicitly provided through item-content.Second, the eight questionnaires differ in their main focus: five measure perceptions of the overall feedback practice and/or environment (AEQ, FES, IFOS, FOS, and SCoF), whereas three measure perceptions of specific feedback on a specific task (AFQ, FPQ and FS).This is reflected in phrasing of items as evidenced by the sample item per (sub)scale in Table 1.The questionnaires adopting an overall orientation contain items measuring feedback perceptions across a range of tasks and situations (i.e., trait-like), whereas questionnaires with a specific orientation contain items measuring specific perceptions of feedback by a specific source in relation to a specific task in a specific situation (i.e., state-like).This key difference between trait-like and state-like items and scales is well-known in questionnaire research on emotions (Pekrun, Goetz, Frenzel, Barchfeld, & Perry, 2011), a topic that is gaining ground in research on feedback (Goetz, Lipnevich, Krannich, & Gogol, 2018).The distinction between trait-like and state-like is important because how feedback is perceived in general is much less (or not all) explanatory for how specific feedback on a specific task by a specific source in a specific situation is perceived, and, thus, might lead to biased inferences.Third, the questionnaires differ in the degree to which multiple sources are either included in and/or considered while developing the instrument (FES, FPQ, and SCoF), and whether it can be used in multiple contexts (FPQ and FOS).
With respect to psychometric quality of an instrument it is important to determine whether the instrument is adequate (i.e., the usability in terms of reliability and validity to answer research questions and detect a desired effect size, e.g.small to medium in an educational context).Table 1 shows that for all questionnaires at least an exploratory analysis of the assumed subscales structure was reported (i.e., PCA, PAF or EFA; typically in combination with Cronbach's alpha).For three questionnaires structural validity (i.e., the degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured; see Mokkink et al., 2010) was determined through confirmatory factor analysis (i.e., CFA).However, psychometric quality also depends on whether the instrument is robust (i.e., whether the instrument performs the same in various contexts and populations) and any numerical comparisons require that the instrument is invariant between them.When invariant, items are interpreted in a conceptually similar manner and will be answered in the same way by different subgroups in the sample.Since students typically receive different feedback depending on their work, it is important to determine that a questionnaire is at least invariant for different types of feedback.Furthermore, there is some evidence that points to gender differences in the processing of feedback (Dweck, Davidson, Nelson, & Enna, 1978;Roberts & Nolen-Hoeksema, 1989;Schmidt, 1995;Turner & Gibbs, 2010).Finally, in order to enable cross-sectional and longitudinal studies of feedback perceptions it is important to ensure that the questionnaire is invariant for grade level (a proxy for age) and different tracks (a proxy for cognitive ability; dependent on the educational system of a specific country).If achieved, invariance signifies that (a) group differences in factor means are unbiased and (b) group differences in observed means are directly related to group differences in factor means and not contaminated by differential response bias (Gregorich, 2006).If subgroups have statistically similar characteristics, then it can be concluded they have been drawn from the same population and thus comparison of scale mean scores can proceed (Wu, Li, & Zumbo, 2007).In sum, invariance can provide a strong(er) foundation for claims based on subgroup comparisons which is a prime focus in educational research.Yet, none of the eight existing questionnaires were tested for measurement invariance.
Out of these eight, we will focus specifically on the FPQ.First of all, the FPQ distinguishes two broad dimensions: (a) perceptions that relate to the cognitive function of feedback and the degree to which its contents are perceived in terms of perceived fairness, usefulness and acceptance, and (b) perceptions that relate to the motivational function of feedback and the degree to which its contents motivate the recipient to improve their performance while weighing affective reactions prompted by the feedback.Second, the FPQ measures feedback perceptions as a state-like phenomenon (i.e., specific perceptions of feedback on a specific task in a specific situation) whereas most instruments measure feedback perceptions as a trait-like phenomenon (i.e., general perceptions of feedback across a range of tasks and situations).Since most studies investigate student perceptions in relation to specific tasks in a specific situation, traitlike measures are problematic.Third, the FPQ has been used in various studies investigating perceptions of peer feedback (Berndt et al., 2018;Dijks, Brummer, & Kostons, 2018;Huisman, Saab, Van Driel, & Van den Broek, 2018;Peters, Körndle, & Narciss, 2018;Strijbos et al., 2010) teacher feedback (Agricola, Van der Schaaf, Prins, & Van Tartwijk, 2020), both peer and teacher feedback (Prins, De Kleijn, & Van Tartwijk, 2017), and feedback on study progress (Fonteyne et al., 2018).

Research aims
The present study reports on the structural validity and measurement invariance of the Feedback Perceptions Questionnaire (FPQ) developed by Strijbos et al. (2010).In light of the increased attention for the peer as a feedback source and its impact on feedback perceptions, the study was conducted in the context of peer feedback.Finally, as questionnaires are likely used for subgroup comparisons, measurement invariance is crucial as violations of invariance may preclude meaningful interpretation of compared data.Hence, we investigated three research questions: (a) Can peer feedback perceptions be measured adequately and robustly with the FPQ?, (b) Are perceived fairness, usefulness and acceptance predictive of willingness to improve and affect?, and (c) Is the FPQ invariant for two types of peer feedback, gender, four grade levels and two tracks?

Sample
The initial sample consisted of 1535 secondary education students in the Netherlands from 132 schools.There were 817 female and 713 male students (five students did not reveal their gender).Their mean age was 15.75 (SD = 1.19).The data was collected in classrooms from four grade levels (9-12) and two tracks (Senior general, Academic).In the Dutch educational system, 'track' represents students' cognitive ability, and the curriculum which is tailored to the track.In each school data was collected in three classrooms, covering all disciplines ranging from arts to sciences, in which four students per classroom were randomly selected by their teacher to complete the questionnaire.Student participation was based on informed consent.In all, 422 schools were approached of which 132 (31.28 %) agreed to participate, 222 (52.60 %) declined, and 68 (16.11 %) did not respond.The classroom response rate was 98.40 % and the participating schools were spread across different regions in the Netherlands to avoid a bias towards urban areas.

Materials
As part of the large scale questionnaire study, secondary school students were presented with a scenario in which a fictional student received feedback by a fictional peer.The scenario was embedded in the task of 'writing a formal (business) letter', which is part of the Dutch language curriculum and an authentic task for all students.In addition to the peer feedback the students received the evaluation criteria for a (business) letter (main criteria: components, content, spelling and style) and a fictional 'letter assignment'.Two feedback scenarios were designed in line with Narciss' (2008) feedback classification.The feedback was either Concise General (CGF) or Elaborated Specific (ESF), which are very common in classroom feedback and they constitute two extremes in terms of feedback content.CGF contained only general remarks regarding the performance, whereas ESF provided the position and error type, as well as information on how to proceed.To enhance comparability, ESF was constructed as an elaboration of CGF.Appendix A shows CGF and ESF, and error types according the classification by Narciss (2006Narciss ( , 2008)).

Measures
We used the Strijbos et al. (2010) multi-dimensional 18-item Feedback Perceptions Questionnaire (FPQ).The FPQ measures feedback perceptions in terms of the perceived Fairness (FA, 3 items), Usefulness (US, 3 items), Acceptance (AC, 3 items), Willingness to Improve (WI, 3 items) and Affect (AF, 6 itemsthree items measuring positive affect and three measuring negative affect).Items were measured on a 10 cm visual analogue scale from 0 (fully disagree) to 10 (fully agree).Negatively phrased items were recoded.All items were phrased using English, German and Dutch adjectives, and by translation and re-translation we ensured that all items addressed the same semantic aspects regardless of language.Appendix B displays the subscales and items of the FPQ.

Procedure
Participating schools were informed prior to the study and information about the study was provided in a letter for the parents.Secondary schools were visited by research assistants who distributed the questionnaire in classrooms.The purpose of the study was explained beforehand and students received written instructions on how to answer the questionnaire.Students created an anonymous code and were informed that participation was voluntary and that they could stop whenever they wanted.When presented with the scenario students were asked to consider the peer feedback as if they had received it themselves, and indicate how they perceived the peer feedback in terms of fairness, usefulness, acceptance, willingness to improve, and affect.The reading of the scenario and answering the FPQ took about 10 min.

Data-analysis
The structural validation was conducted via Confirmatory Factor Analyses (CFA) using Structural Equation Modelling (SEM) to determine (a) the adequacy and robustness of the factor structure by Strijbos et al. (2010), (b) whether feedback perceptions in terms of fairness, usefulness and acceptance adequately predict willingness to improve and affect, and (c) whether the factor structure is invariant.CFA and SEM was performed in R version 3.6.1 with the lavaan package (Rosseel, 2012).Measurement Invariance was tested in EQS version 6.1.We used the unbiased 'wishart' estimator in R to make the output as similar to EQS as possible (Rosseel, 2012).

Confirmatory factor analysis
To interpret a model's fit, it is customary to use strict criteria for excellent fit.But, as Browne and Cudeck (1992) have demonstrated, the sample size places a soft upper-bound on potential fit.We use the typical indicators of excellent fit (e.g., Byrne, 1998), and use the sample-size constrained soft-upper bound as an indicator of adequate fit.As such the following indicators were used: Standardized Root Mean-square Residual (SRMR) and Root Mean Square Error of Approximation (RMSEA) below 0.10 is considered adequate fit and below 0.05 an excellent fit, and Comparative Fit Index (CFI) scores above 0.90 indicate adequate fit and above 0.95 excellent fit.Moreover, we also used the Gamma Hat (Fan & Sivo, 2007) to determine model fit.Gamma Hat (γ) does not penalize small or simple models as RMSEA does and in the case of small models it is less sensitive to model misspecification compared to CFI (Fan & Sivo,200).Since the χ2 statistic becomes increasingly unreliable in large sample sizes > 250, the χ2 statistic will not be used as a criterion for model fit (Byrne, 1998;Putnick & Bornstein, 2016).
Negatively worded items in scales can result in two-factor models, or unidimensional models with an underestimate of reliability due to statistical artifacts because of wording-effects.The fit and subsequent reliability of scales with negatively worded items can be improved by either correlating the errors of negatively or positively worded items, or by loading the negatively worded items on an (extra) negative wordingeffect factor (Distefano & Motl, 2009).In this study we correlated the error terms of the two negatively worded items in the AC subscale and three positively worded items in the AF subscale.

Invariance tests
The FPQ was designed to compare populations that received different types of feedback (CGF or ESF), and it is to be expected that the FPQ will also be used to compare between gender, grade levels, and track.In order to ensure that such comparisons are valid, the invariance of the FPQ was tested for two types of peer feedback, gender, four grade levels and two tracks.Invariance relates to the requirement that quantitatively measured constructs have the same meaning across groups, and that group comparisons of sample estimates (e.g., means and variances) reflect true differences free from contamination from group-specific attributes that are unrelated to the constructs of interest.The CFA-J.-W.Strijbos et al. framework allows invariance to be tested in a sequence of increasingly strict invariance.The most basic form of invariance is dimensional (baseline) invariance, when the same number of factors can be found in all groups, regardless of the underlying configuration of items and factors.The most commonly tested form of invariance is configural invariance, in which it is tested whether the same factors are associated with the same items in all groups.Because factor-intercorrelations and factorweights per item can differ across groups, configural invariance is insufficient to defend quantitative comparisons between groups (Gregorich, 2006).The next level of invariance is metric invariance (sometimes called weak-factorial invariance), which tests whether (common) factors have the same meanings across groups-expressed as equal factor loadings.The first level of invariance that is sufficient to defend quantitative group-comparisons is scalar invariance (sometimes called strong-factorial invariance), when factor-loadings and factor intercorrelations are equal across groups (meaning that common factors have invariant meaning across populations) and there is no differential response bias between populations (i.e., equal item and factor intercepts).Scalar invariance allows for meaningful comparisons between populations because (1) group differences in estimated factor means are unbiased and (2) group differences in observed means will be directly related to group differences in factor means and will not be contaminated by differential additive response bias.Finally, the most stringent form of invariance is strict factorial invariance (sometime called residual invariance), when item-residuals are equal across populations.This would allow researchers to not only meaningfully means between populations, but also their variance estimates (Gregorich, 2006).Whereas Cheung and Rensvold (2002) recommended ΔCFI to evaluate invariance as it outperformed Δχ 2 , we will also report ΔRMSEA and ΔSRMR in line with recent recommendations by Putnick and Bornstein (2016), and we will additionally add ΔGamma Hat.We will use the following critical values to assess invariance: -0.01 ΔCFI for metric and scalar invariance and .015for strict invariance, 0.015 ΔRMSEA for metric and scalar invariance, 0.030 ΔSRMR for metric invariance and 0.015 ΔSRMR for scalar invariance (Putnick & Bornstein, 2016;Chen, 2007).Since there is not yet consensus of a critical value for change in Gamma Hat (γ), we adopted the -0.01 used for ΔCFI also for ΔGamma Hat (γ).Finally, it should be noted that-akin to using multiple fit-indices to determine model fit-none of the change indicators and the proposed critical value rules supreme (Putnick & Bornstein, 2016).

Data-inspection
All variables were examined for data entry accuracy, missing values, and fit between their distributions.Variables had between 16-19 missing cases (between 1 % and 1.2 %).No variables had missing values over 5 %, and there was no pattern to the missing data (MCAR's χ 2 (9) = 1.50, p = .997).Missing values were replaced by EM-estimates (see Musil, Warner, Yobas, & Jones, 2002) based on all other variables in the dataset.No continuous variables deviated from the normal distribution.No univariate extreme cases (z > |3.00|) were found for US and AF.In the variables FA (N = 2), AC (N = 5) and WI (N = 10) extreme values ranged between -3.59 ≤ z ≤ 3.01.Forty-nine cases were identified via Mahalanobis distance as extreme multivariate outliers (p < .001)and about equally distributed for gender.These outliers were removed from subsequent analyses.The final sample consisted of 1486 students (796 female, 685 male, five missing; mean age = 15.74,SD = 1.19).

Descriptives
We computed scale means for the five subscales, which shows common variance to a considerable degree as reflected in moderate to high correlations (Table 2).

Confirmatory factor analysis
As a baseline test a single factor multilevel CFA with school as the clustering variable was run with all 18 items sharing one common factor.This yielded a very poor fit, χ 2 ( 270       than 10 which indicates essentially no support for the reversed model in comparison to the theoretical model.

Invariance tests
Tests for measurement invariance were conducted for FA + US + AC and WI + AF on (a) type of feedback, (b) gender, (c) grade level, and (d) track.No multilevel models that included all five latent factors converged.Removing school as a cluster variable still yielded no models that converged.For this reason measurement invariance was tested without a multilevel component for FA + US + AC and for WI + AF separately.Table 3 presents the findings for all tested models for FA + US + AC.Table 4 presents the findings for all tested models for WI + AF.

Invariance for type of peer feedback
Testing for the hypothesized baseline model for FA + US + AF yielded an excellent fit in both the CGF (model F1a) and ESF condition (model F1b).Incremental addition of constraints for equal factor loadings and covariances (model F3), and observed variable intercepts (model F4), indicated scalar invariance.Constraints on all estimated error terms did not fit satisfactory on all fit indices (Model F5).Testing for the hypothesized baseline model for WI + AF yielded an excellent fit in the CGF (model F6a) and ESF condition (model F6b).Incremental addition of constraints on factor loadings and covariances, observed variable intercepts and on all estimated error terms showed an adequate fit (Model F10), providing evidence for strict invariance on all fit indices.

Invariance for gender
Testing for the hypothesized baseline model for FA + US + AC yielded an excellent fit for male (model G1a) and female (model G1b) students.Incremental addition of constraints provided evidence for strict invariance (model G5).The only fit index rejecting strict factorial invariance was ΔGamma Hat (0.011) which was slightly above our cutoff of -0.01.All other indicators are far below the cutoff thresholds.Testing for the hypothesized baseline model for WI + AF yielded an excellent fit for male (model G6a) and female (model G6b) students.Incremental addition of constraints provided evidence for strict invariance (Model G10).The only fit index rejecting scalar and strict factorial invariance was ΔGamma Hat (-0.012 and -0.013 respectively) which were slightly above our cutoff of -0.01.All other indicators are far below the cutoff thresholds.

Invariance for grade level
Testing for the hypothesized baseline model for FA + US + AC yielded an excellent fit in Grade 9 (model L1a), Grade 10 (L1b), Grade 11 (L1c), and Grade 12 (L1d).Incremental addition of constraints provided evidence for strict invariance (L5).The only fit index rejecting strict factorial invariance was ΔGamma Hat (-0.014) which was above our cutoff of -0.01.All other indicators are far below the cutoff thresholds.Testing for the hypothesized baseline model for WI + AF yielded an excellent fit for Grade 9 (model L6a), Grade 10 (model L6b), Grade 11 (model L6c), and Grade 12 (model L6d).Incremental addition of constraints provided evidence for strict invariance (L10).The only fit index rejecting scalar factorial invariance was ΔGamma Hat (-0.017) which was well above our cutoff of -0.01.All other indicators are far below the cutoff thresholds, even the delta chi-square.

Invariance for track
Testing for the hypothesized baseline model for FA + US + AC yielded an excellent fit in both the Senior general (model T1a) and Academic track (model T1b).Incremental addition of constraints on factor loadings and covariances, observed variable intercepts and on all estimated error terms showed an adequate fit (Model T5), providing evidence for strict invariance on all fit indices, even all chi-square tests, save for strict invariance, χ 2 (14) = 37.79; p < .001.Testing for the hypothesized baseline model for WI + AF yielded an excellent fit in the Senior general (model T6a) and Academic track (model T6b).Incremental addition of constraints on factor loadings and covariances, observed variable intercepts and on all estimated error terms showed an adequate fit (Model T10), providing evidence for strict invariance on all fit indices, even all chi-square tests, save for scalar invariance, χ 2 (7) = 21.9;p = .003.

Discussion
In line with the increased interest in instructional feedback and the multidimensional view on feedback and feedback perceptions, we examined the psychometric quality of the Feedback Perceptions Questionnaire (FPQ; Strijbos et al., 2010).We focused specifically on the FPQ because it distinguishes two broad dimensions of perceptions that relate to the cognitive function of feedback (perceived fairness, usefulness and acceptance) and perceptions that relate to the motivational function of feedback while weighing affective reactions (willing to improve and affect), measures feedback perceptions as a state-like phenomenon, and has been used in various studies investigating peer and teacher feedback.More specifically, we investigated (a) the adequacy and robustness of the FPQ, (b) whether perceived fairness (FA), usefulness (US) and acceptance (AC) predict willingness to improve (WI) and affect (AF), and (c) whether the FPQ is invariant for two types of peer feedback, gender, four grade levels and two tracks.
Perceived fairness, usefulness, acceptance, willingness to improve and affect were found to be correlated-yet distinct-measures, comprised of two dimensions: the FA + US + AC part of the FPQ relates to the cognitive function, whereas as the WI + AF part of the FPQ relates to the motivational function.Perceived fairness, usefulness, and acceptance were confirmed as predictors of willingness to improve and affect, and was a far more likely model than the reverse.Perceived fairness is a strong positive predictor of willingness to improve and affect, whereas perceived usefulness and acceptance showed a more complex pattern.More specifically, a student perceiving the peer feedback as potentially useful will hold slightly more positive affect (0.12), but as it also signals that the performance was not yet good enough it reduces their willingness to improve (-0.38).Similarly, a student with an accepting perception of the peer feedback will be willing to improve their performance (0.86), however, doing so also implies acknowledging that the performance was not up to standard and leads to more negative affect (-0.41).The results clearly underscore the need and added value of a multidimensional view on feedback as well as feedback perceptions.
Furthermore, we found strict factorial invariance for gender, grade level and track for both the FA + US + AC and WI + AF part of the FPQ, and scalar invariance for type of peer feedback for FA + US + AC and WI + AF.Both the FA + US + AC and WI + AF parts are not only robust, but they are also similarly interpreted across peer feedback types, gender, grade levels and tracks.Moreover, the scalar level of invariance for types of peer feedback, gender, grade level and track implies for researchers in the social sciences that the comparisons of group means are meaningful.Scalar invariance is statistically important to meaningfully defend comparisons of factor and observed means.FA + US + AC was found to be strictly invariant for gender, grade level and track, which makes not only comparisons of group means defensible, but also comparisons of observed variances and covariances.WI + AF was strict invariant for types of peer feedback, gender, grade level and tracks.Although strict invariance is deemed desirable for quantitative comparisons, it is usually considered as excessively stringent (Byrne, 2006) and scalar invariance a more readily attainable goal (Gregorich, 2006), and in most cases adequate for comparative research.Moreover, to counter statistical artifacts due to wording-effects and an underestimate of reliability, we correlated the errors for two negatively worded items in the AC subscale and for three positively worded items in the AF scale (cf.Distefano & Motl, 2009).In sum, our results underline that students' peer feedback perceptions-in terms of FA, US, and AC, and WI and AF-can be adequately and robustly measured by the FPQ.

Limitations
It should be noted that the reported structural validity and invariance are strictly speaking limited to peer feedback and students in secondary education as the peer feedback recipient.However, as the items are not source specific, we consider it likely that a similar pattern in terms of estimates and invariance could be observed with the teacher as the feedback source or in other educational contexts; especially given the scalar invariance, and in 7 out of 8 cases even strict factorial invariance.Nevertheless, some of the factor loadings of individual items are somewhat lower than preferred and these signal areas for further improving the FPQ.Future research could examine the psychometric quality of the FPQ in the case of teacher feedback to (partially) replicate our findings.Likewise, the study could be repeated in higher education and, for example, compare responses by students from different disciplines.
We acknowledge that the use of scenarios might be considered artificial, however, scenarios have been shown to invoke almost identical reactions from persons, comparable to real situations (Robinson & Clore, 2001).In fact, a scenario's proximity to real-life settings and opportunity to (at least partially) control contextual variables, contributes to scenarios' high internal and external validity (Atzmüller and Steiner, 2010).Moreover, given the vast diversity in external feedback content, intrapersonal, interpersonal and situation factors, and associated diversity in feedback perceptions, scenarios including worked-out task exemplars with typical errors offer more control for investigating feedback perceptions, and feedback effects (Narciss, 2013).Finally, the present study examined only one scenario due to time constraints.Future research could include multiple scenarios (see e.g., Atzmüller and Steiner, 2010), for example, comparing peer and teacher feedback.Nevertheless, future research can reaffirm the FPQ's structural validity and measurement invariance when measuring feedback perceptions on a recipient's actual own performance; thus enhancing ecological validity (enhanced realism)-albeit to some extent at the expense of internal validity.

Implications for practice
Given the increased interest in (peer) feedback, students' feedback perceptions could be a crucial determinant of how they process the (peer) feedback and possibly help to uncover why elaborated (peer) feedback types are not always more efficient (see Berndt et al., 2018;Raemdonck & Strijbos, 2013;Strijbos et al., 2010).Moreover, due to the state-like measurement of feedback perceptions, the FPQ can inform teachers directly on how feedback by a specific peer on a specific task is perceived by a particular student or how feedback by a teacher on a particular student's task performance is perceived.Although the findings of the present study are limited to a setting with peer feedback, the established degree of invariance show promise for obtaining identical psychometric quality in a future study of the FPQ using teacher feedback.In sum, the FPQ offers researchers a structurally valid, reliable and invariant questionnaire to investigate relations between (peer) feedback perceptions, performance and feedback efficiency, and teachers a questionnaire to assess student responses to (peer) feedback and adjust instructional support accordingly.

Declaration of Competing Interest
None.Everything that should be in the letter is included, it is a bit extensive though.KR + KM (general) Sometimes things are included that do not apply, such as the bank account number.

Appendix A. Concise General Feedback (CGF) and Elaborate Specific Feedback (ESF) components
KR + KM + KH (implicit) Marieke does ask not for a financial refund, thus it is not necessary to write the bank account number.

KR + KM + KH Spelling
There are spelling errors in the letter.KR + KM (general) In the last sentence of the first paragraph for example, the word "voltooid" is written with a t and not with a d, even though the present perfect applies.

KR + KM + KH (implicit) Style
Sometimes there are style errors in the letter.KR + KM (general) In business letters, for example, you cannot use the "&" symbol.
KR + KM + KH (implicit) You can also not start a sentence with "And".
KR + KM + KH (implicit) It is better not to write "Again a disappointing telephone conversation".KR + KM + KH Note.KR (knowledge of result/ response), KM (knowledge of mistakes) and KH (knowledge on how to proceed).See (Narciss (2006(Narciss ( , 2008) )  Note.In addition to the passive phrasing suited for a scenarios study (i.e., "would"), actively phrased items (i.e., "will") for the measurement of feedback perceptions on one's own performance in a classroom setting can be obtained from the first author (it should be noted that these items are not yet validated).

Fig. 1 .
Fig. 1.Structural equation model of perceived fairness, usefulness and acceptance as predictors of willingness to improve and affect.

Table 1
Overview of existing questionnaires and psychometric indicators for quality of measurement.

Table 2
Means, standard deviations and correlations between fairness, usefulness, acceptance, willingness to improve, and affect (p < .01).
ComponentsThere are errors in technical components, like in the letter head.KR + KM Also the address and date are not written correctly.KR + KM The paragraphs are not neatly aligned, although this should be the case.KR + KM + KH (implicit) Content

Feedback Perceptions Questionnaire (FPQ) subscales, items and Cronbach's alpha for the present study
for more detail.