A systematic review of effective quality feedback measurement tools used in clinical skills assessment

Background: Objective Structured Clinical Examination (OSCE) is a valid tool to assess the clinical skills of medical students. Feedback after OSCE is essential for student improvement and safe clinical practice. Many examiners do not provide helpful or insightful feedback in the text space provided after OSCE stations, which may adversely affect learning outcomes. The aim of this systematic review was to identify the best determinants for quality written feedback in the field of medicine. Methods: PubMed, Medline, Embase, CINHAL, Scopus, and Web of Science were searched for relevant literature up to February 2021. We included studies that described the quality of good/effective feedback in clinical skills assessment in the field of medicine. Four independent reviewers extracted determinants used to assess the quality of written feedback. The percentage agreement and kappa coefficients were calculated for each determinant. The ROBINS-I (Risk Of Bias In Non-randomized Studies of Interventions) tool was used to assess the risk of bias. Results: 14 studies were included in this systematic review. 10 determinants were identified for assessing feedback. The determinants with the highest agreement among reviewers were specific, described gap, balanced, constructive and behavioural; with kappa values of 0.79, 0.45, 0.33, 0.33 and 0.26 respectively. All other determinants had low agreement (kappa values below 0.22) indicating that even though they have been used in the literature, they might not be applicable for good quality feedback. The risk of bias was low or moderate overall. Conclusions: This work suggests that good quality written feedback should be specific, balanced, and constructive in nature, and should describe the gap in student learning as well as observed behavioural actions in the exams. Integrating these determinants in OSCE assessment will help guide and support educators for providing effective feedback for the learner.


Introduction
During their undergraduate education, medical and health sciences students are subjected to numerous clinical practical assessments in order to evaluate their performance 1 . Feedback is a fundamental and important learning tool in medical education 2 . Good and effective feedback assists students in accomplishing both learning and professional development, enhancing student motivation and satisfaction [3][4][5] . The Objective Structured Clinical Examination (OSCE) is a commonly utilized clinical skills assessment in medical and health sciences that has a positive impact on medical education 6 . OSCE is useful in the field of medicine for evaluating student performance for a variety of reasons; the OSCE will simulate the realities of clinical practice, enhancing students' confidence and ensuring safe clinical practice, with assessment based on objective determinants 7-10 . The OSCE is a valid and reliable assessment tool in a variety of fields, including medicine 9,[11][12][13][14][15] During the OSCE, examiners are requested to input the students' observed marks on score sheets (without knowing total marks) and can also provide their professional opinion on students' performance using the Global Rating Scale (GRS) (Fail, Borderline, Pass, Good, Excellent) based on experience 16,17 . Previous research has shown a mismatch between observed Marks and GRS 18 . For example, the student may score a high result in the observation section, but receive a 'fail' for their Global Rating Score and potentially vice versa.
However, written feedback for OSCE is optional, despite previous research showing it to have a significant positive impact on student's learning outcomes 19,20 . It is argued that many examiners may find it difficult to offer detailed or useful written feedback during OSCE evaluation 21,22 due to time constraints as well as a 'judgement dilemma' of not knowing how much feedback or the type of feedback to give 23 .
Even if written feedback is provided, to date there is no recognised objective measurement scoring tool that measures the quality of written feedback from the OSCE. Measuring the quality of written feedback will help examiners to improve their skills in feedback delivery, as well as encourage students to understand the OCSE marks they received and where they can improve in the future. In other fields of education, feedback quality measurement tools are used effectively to improve the quality of written feedback for students [24][25][26] . In order to develop such a tool, it is necessary to identify the determinants that result in effective written feedback. The main objectives of this systematic review are to identify and evaluate studies that have measured written feedback quality.

Study design
This a comprehensive systematic review to identify the most relevant determinants that describe good and effective written feedback in the field of medicine. Measurement tools that measure feedback quality both quantitatively and qualitatively will be included.
We sought assistance from a university librarian to enhance our search strategy. The reference section of initially selected studies was also searched thoroughly for any additional relevant publications. A bibliographical database was created to store and manage the references.

Selection of articles
Each author independently screened retrieved articles against inclusion and exclusion criteria, and as a team agreed on the included studies. Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines 27 , only studies written in the English language, published in the last 10 years in the field of medicine were included. Both quantitative and qualitative studies were included. Studies were included if they described the quality of good/effective feedback in clinical skills assessment, or attempted to evaluate

Amendments from Version 1
In this updated version of the manuscript "A Systematic Review of Effective Quality Feedback Measurement Tools Used in Clinical Skills Assessment", we have made several revisions to address the reviewers' feedback.
Firstly, we have clarified our study scoring methodology in the paper. The process involves a panel of four reviewers independently evaluating each study. If a determinant was explicitly addressed within the study, a plus score (+) was assigned, with multiple plus scores (++ or +++) indicating a stronger presence of a determinant. Conversely, an unaddressed or absent determinant was assigned a minus score (-), with multiple minus scores (--or ---) indicating a noteworthy absence. Each study received an aggregate score based on these evaluations. This revised description enhances the transparency of our scoring methodology.
Secondly, we improved the clarity of the caption for Figure 1. The revised caption now reads: "Flow diagram illustrating the study selection process employed in this systematic review, guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines 27 ." This change provides a clearer explanation of the figure's content and its relation to the PRISMA guidelines.
Lastly, we added a summarizing statement about what the study brings to the field. We underscored the importance of the quality of written feedback in clinical skills assessments, highlighting its distinctive value as an independent instructional tool. Even though written feedback does not involve direct conversations with learners, it still requires specificity, constructiveness, and actionability to guide learners' self-improvement effectively. This statement illuminates the importance of ensuring these parameters in written feedback, reinforcing its role in the holistic educational experience of learners in the medical field.
These updates have strengthened our manuscript and clarified previously ambiguous points or lacking detail. the quality of feedback in clinical skills assessment (e.g., OSCE), or described the quality of written feedback by enumerating determinants of effective feedback involving undergraduate students and postgraduate trainees. Exclusion criteria included papers not written in English language, case reports, 'grey literature' (which includes conference proceeding studies) and commentaries. Due to different cognitive demands and scopes of practice, publications relating to nursing, paramedical disciplines, pharmacy, and veterinary education were excluded ( Figure 1). Reference lists of included studies were also explored to identify any additional studies.

Data extraction
Four independent reviewers (A.A, D.L, M.N and T.K) identified and extracted the determinants used to evaluate written feedback from the included studies (Table 1). An accumulated list of identified determinants and their respective definitions was compiled. Each reviewer then scored each of the included studies against the accumulated list of determinants. The approach involved a panel of four reviewers who independently evaluated each study. A determinant was assigned a plus score (+) if it was explicitly addressed within the study, with multiple plus scores (++ or +++) for a stronger Table 1. The 10 determinants of feedback quality measurement identified.

Determinant of feedback measurement
General description 1 Specific Detailed information of what was done well or poorly.

Balanced
Contains both positive and negative comments.

3
Behavioural Observed action in exam (not personal).

4
Timely Given immediately after assessment is completed.

5
Constructive Supportive feedback identifying a solution to area of weakness they may have.

6
Quantifiable Feedback that can be used to develop detailed statistical data.

7
Focused Feedback that is given around key results.

8
Described task Focuses the knowledge and skills associated with a task: sufficient or insufficient.

9
Described gap Detailed about what is missing in the task.

10
Described action plan Detailed plan of action needed to reach one or more goals.
presence of a determinant. Conversely, an unaddressed or absent determinant was assigned a minus score (-), with multiple minus scores (--or ---) if the absence was particularly noteworthy. This led to each study receiving an aggregate score.

Statistical analysis
Level of agreement scores (%) and Kappa coefficients between the reviewers were calculated for each determinant included. Both level of agreement and kappa were measured using an online calculator (http://justusrandolph.net/kappa/) 28 . Percentage agreement calculates agreement by chance which is corrected for by calculating kappa. The average Kappa coefficients were interpreted as follows: <0 indicates no agreement, 0.01-0.20 indicates slight or poor agreement, 0.21-0.40 indicates fair agreement, 0.41-0.60 indicates moderate agreement, 0.61-0.80 indicates substantial agreement, and 0.81-1.00 indicates almost perfect agreement 29 . Determinants with the highest Kappa were identified as being most useful for providing written feedback for OSCE. In addition, included studies were assessed to identify which studies were best for measuring the quality of feedback.

Risk of bias and certainty assessment Two independent reviewers used the ROBINS-I (Risk Of Bias
In Non-randomized Studies of Interventions) tool to assess the risk of bias of each included study (Table 3). Confounding, selection, classification, intervention, missing data, measurement, and reporting were all checked for bias. The ROBINS-I was used to assess the certainty in the body of evidence in the context of GRADE's (Grading of Recommendations, Assessment, Development and Evaluations) approach 30,31 .
When there were any conflicts, the entire review team was consulted, and the disagreements were then addressed by consensus.

Search results
The initial search yielded 2441 studies (Figure 1). After the duplicates were removed, 1330 studies remained. 1290 studies were found to be irrelevant to the main topic after scanning the title and abstract. The 40 remaining articles were thoroughly evaluated by reading the full text. A further 26 articles 2,7,21-23,32-44 were removed leaving 14 studies for inclusion in this systematic review ( Figure 1).

Determinants identified
A total of 10 determinants to assess the quality of written feedback were identified from the combined 14 studies (Table 1).
Each reviewer then scored each of the 14 included studies against the accumulated list of determinants ( Table 2).
The number of determinants identified in the individual studies ranged from 7 to 10 respectively. The determinants with the highest agreement (kappa values) among reviewers were Specific (0.79 -substantial agreement), Described gap (0.45 moderate agreement), Balanced (0.33 -fair agreement), Constructive (0.33 -fair agreement), and Behavioural (0.26 -fair agreement)  respectively. All other determinants had low agreement (kappa values below 0.21 -slight or poor agreement) indicating that even though they have been used in the literature, they might not be applicable for good quality feedback. The identified determinants with highest level of agreement among reviewers were included in seven of the ten studies of which the study by Abraham et al. 19 had the highest level of agreement 19,28,32,33,35,38,39 .

Risk of bias
We utilized the ROBINS-I score method to analyse bias across confounding bias, selection bias, classifications bias, intervention bias, bias due to missing data, measurement bias, and reporting bias to assess the possible risk of bias (Table 3). Almost all the studies included had low confounding, selection, and measurement biases. The overall risk of bias was low to moderate for the included studies, which is understandable considering the nonrandomized character of the research and dependence on self-reporting measures. The remaining post-intervention biases were variable, ranging from mild to moderate.

Certainty in body of evidence
In systematic reviews, the GRADE working group has created a widely-accepted approach to evaluate the certainty of a body of evidence-based on a four-level system: high, moderate, low and very low. The current GRADE strategy for a body of evidence  linked to interventions starts by categorizing studies into one of two groups: randomized controlled trials (RCT) or observational studies (also non-randomized studies, or NRS). The body of evidence begins with high certainty if the relevant research is randomized trials. The body of evidence begins with a low level of certainty if the relevant study is observational 31,58 .

Discussion
The aim of this systematic review was to evaluate studies that have measured the quality of written feedback for clinical exams and identify which determinants should be used to provide quality written feedback. Improving the quality of written feedback for students in the field of medicine will improve student performance.
Four independent researchers critically appraised 14 studies using 10 identified determinants. The five determinants with the highest Kappa values were: (1)  The other five determinants were conversely deemed to be lacking in agreement, showing some form of confusion and complexity amongst the reviewers in its ascertainment. Hence, these determinants may not be considered to be a good qualitative measurement element of feedback quality.
The number of determinants in each individual study ranged from seven to ten respectively. The five key determinants appeared in seven of the ten studies in this systematic review. It may be worthwhile to consider including these five key determinants in feedback and performance assessments.
Feedback delivery is influenced by a number of factors. One of them, according to research, is that the examiner lacks the ability to translate his observation into detailed, non-judgmental, and constructive feedback 3,60 . Therefore, feedback will ultimately be ambiguous and meaningless to students seeking to improve their performance 60 .
Effective feedback tools, from the perspective of educators, should include determinants that aid in the learning process, such as helping students comprehend their subject area and providing clear guidance on how to enhance their learning. Structuring feedback by using the five identified determinants will improve alignment between GRS and observed marks. That will lead to a better understanding GRS in observed marks.
Developing a digital tool to evaluate written feedback from OSCE will help in cases where there is a discrepancy between observed marks and the Global Rating Scale result. In these cases, written feedback can be utilized in case of a pass/fail decision. This could demonstrate the significance of feedback in decision-making, as well as how written feedback is viewed as a learning tool that leads to improvements in student performance. The five identified determinants with the highest kappa values could be used as a method to quantify written feedback. Tutors and educators should be made aware of these determinants prior to the OSCE so that they provide beneficial feedback to students. Having a structured comments section could also help overcome the writing challenges tutors currently face when marking the OSCE.
Further study is needed to categorize determinants and sub classify them as part of a quantification approach. This digital measurement tool in medical education will help improve students' performance and knowledge acquisition.
This systematic review had some limitations. For example, grey literature was not included in the study and we reviewed only English studies which may mean results are not generalizable.
Another limitation is the focus on written feedback in one type of clinical skills assessment (OSCE). Future research should also consider other training feedback in postgraduate training as well as undergraduate training.

Conclusion
This work suggests that good quality written feedback should be specific, balanced, and constructive in nature, and should describe the gap in student learning as well as observed behavioural actions in the exams. Integrating these five core determinants in OSCE assessment will help guide and support educators in providing effective and actionable feedback for the learner. Lastly, This study underscores the importance of the quality of written feedback in clinical skills assessments, highlighting its distinctive value as an independent instructional tool. While many studies have evaluated verbal feedback, this research brings to the fore that written feedback exists. However, not being part of direct conversations with learners still demands specificity, constructiveness, and actionability that effectively guides learners' self-improvement. Our work illuminates the criticality of ensuring these parameters in written feedback, reinforcing its role in the holistic educational experience of learners in the medical field.

Data availability
All data underlying the results are available as part of the article and no additional source data are required. improving the quality of written feedback and also using tools to evaluate the quality of these narratives. I agree that faculty training around writing high quality and impactful narratives would be much more useful to learners than simple ratings. Specificity of narratives would guide learners in self-reflection and formulating performance improvement plans. Appropriate methodology and data collection for the purpose of the study, followed guidelines for systematic review and reaching their conclusions.

Reporting guidelines
Only minor comments: use of words like insightful feedback will raise the question-insightful as perceived by whom and to whom. Perhaps actionable is more appropriate.
○ 'measurement' of the quality of written feedback. The implication is that only numbers can be 'objective'. This seems contradictory when we are talking about written feedback being more meaningful than just numerical ratings ○ Lastly, I would like to have seen a sentence on what this study adds to the field. Many studies on verbal feedback have included all these criteria of quality-specific, constructive, actionable etc. Just a sentence that these have been stated for verbal feedback conversations and equally important when clinical teachers are required to provide written or narrative feedback where they are not having conversations with learners.

Are the rationale for, and objectives of, the Systematic Review clearly stated? Yes
Are sufficient details of the methods and analysis provided to allow replication by others? Yes

Is the statistical analysis and its interpretation appropriate? Yes
Are the conclusions drawn adequately supported by the results presented in the review? Yes

Reina Abraham
Nelson Mandela School of Medicine, University of KwaZulu-Natal, Durban, KwaZulu-Natal, South Africa Thank you for requesting me to review this paper: A systematic review of effective quality feedback measurement tools used in clinical skills assessment This was an interesting article evaluating studies that have measured the quality of written feedback in clinical exams to identify determinants for providing quality written feedback. This is significant because developing effective feedback quality measurement tools that include these determinants will improve the quality of written feedback provided by educators, thereby assisting students in their learning process to improve their clinical skills performance.
The introduction builds a logical case and context for the problem statement. The problem statement is well articulated and the rationale for, and objectives of, the Systematic Review is clearly stated. The literature review is up to date, well integrated and critically appraised.
The study design is stated and is suitable for a systematic review. The search strategy using the PRISMA guidelines is described. Please provide a rationale for why you are limiting your search from Jan 2010? Why not search all? The publication sources and data extraction process are sufficiently described and referenced. Measures to ensure quality of the publication sources are explained.
The statistical analysis is quite robust, and its interpretation are reported correctly and appropriately.
The search results are organized and easy to understand. The determinants of quality feedback is clearly indicated and appropriate. However, the way the reviewers came about scoring each of the 14 studies against the list of determinants in Table 2, as one +, two +, three +, four + or one -ve, two -ve etc is not explicitly described. Two tables and a figure are presented and agree with the text. I would suggest rephrasing the text for figure 1: Flow diagram showing study selection process based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines 27.
The discussion is in the same order as the result findings. The feedback quality determinants are correctly interpreted and compared with the literature. The conclusion is drawn adequately and is supported by the results presented in the systematic review. The study limitations are explained. Practical significance or theoretical implications are discussed and guidance for future studies is offered.
Title and abstract -The title is representative of the content and breadth of the study. The title captures the importance of the study and the attention of the reader. The abstract is complete. The results in the abstract are presented in sufficient detail. The conclusion in the abstract is justified by the information in the abstract and the text. No inconsistencies between the abstract and the text.

Is the statistical analysis and its interpretation appropriate? Yes
Are the conclusions drawn adequately supported by the results presented in the review? Yes Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 10 Jun 2023

Dear Dr Reina Abraham
We are grateful for your insightful feedback and recommendations on our manuscript "A Systematic Review of Effective Quality Feedback Measurement Tools Used in Clinical Skills Assessment". Your comments have been instrumental in helping us improve our work.
In response to your query regarding our decision to limit our search to papers from January 2010 onward, our primary aim was to capture the most recent trends and developments in written feedback for clinical skills assessments. We chose this timeframe based on preliminary research indicating a significant surge in publications related to our topic of