Formative assessments during COVID-19 pandemic: an observational study on performance and experiences of medical students

Background Because of COVID-19, the 2020 written medical examinations were replaced by mandatory formative online assessments. This study aimed to determine students’ performance, self-assessment of performance, and perception about the switch from a summative to a formative approach. Methods Medical students from year 2 to 5 (n=648) were included. They could repeat each test once or twice. They rated their performance after each attempt and were then given their score. Detailed feedback was given at the end of the session. An online survey determined medical students’ perception about the reorganization of education. Two items concerned the switch from summative to formative assessments Results Formative assessments involved 2385 examinees totaling 3197 attempts. Among examinees, 30.8% made at least 2 attempts. Scores increased significantly at the second attempt (median 9.4, IQR 10.8), and duration decreased (median -31.0, IQR 48.0). More than half of examinees (54.6%) underestimated their score, female students more often than male. Low performers overestimated, while high performers underestimated their scores. Students approved of the switch to formative assessments. Stress was lessened but motivation for learning decreased. Conclusions Medical students’ better scores at a second attempt support a benefit of detailed feedback, learning time and re-test opportunity on performance. Decreased learning motivation and a minority of students repeating the formative assessments point to the positive influence of summative assessment on learning.


Introduction
The Faculty of Medicine at the University of Geneva, like most medical schools internationally, had to adapt rapidly to the changes brought about by the COVID-19 pandemic, and set up online teaching (Dost et al., 2020;LeBlanc III, 2022;Stoehr et al., 2021).With regard to assessment, most medical schools decided to cancel summative examinations or to defer them until the end of lockdown (Jodheea-Jutton, 2021).Others resorted to different approaches to administer summative exams such as online random multiple-choice questions (MCQ) and open-book examination (Birch & de Wolf, 2020;Pather et al., 2020).In March 2020, the Geneva Faculty of Medicine decided to cancel all the summative oral and written examinations, except for the high-stake selection examination taking place at the end of the first year.The decision was based on educational, ethical, and public health-related considerations.The delay was short to organize valid and reliable distance summative assessments.As medical students were solicited for clinical and logistical work during the health crisis, equity towards students engaged in much needed volunteer activities was a concern.
The main aim of summative assessment is to evaluate whether the required skills and knowledge have been acquired at some definite timepoint.Assessments are usually high-stakes as they are used for promotion decisions.Formative assessment is a process of evaluation and feedback.The objective is to support and accompany the learning process.Formative assessments are typically low-stakes, ongoing, and seek to determine where the students are in their learning, what they need to improve and how they can do it (Dixson & Worrell, 2016;Dolin et al., 2018).
Any kind of assessment however contributes to and drives learning (Crossley et al., 2002;Schuwirth & Van der Vleuten, 2011;van der Vleuten & Schuwirth, 2005).Formative assessments are described as having a positive impact on learning in medical education (Dunlosky et al., 2013;Roediger & Karpicke, 2006), especially when they are associated with a detailed feed-back (Agius & Wilkinson, 2014).Despite the constraints associated with the COVID-19 pandemic, maintaining some form of assessment seemed crucial to evaluate students' competencies, to stimulate and support their learning, and as a benchmark for students to measure their progress.Therefore, the summative examinations at the Geneva Faculty of Medicine were replaced with mandatory formative online assessments (i.e.students were required to participate, but not required to pass).
There is a paucity of data regarding students' performance and perception when summative examinations were substituted with formative assessments in response to COVID-19.In one study, first-year medical students felt that formative online tests were useful for their learning (Snekalatha et al., 2021).The aim of this study was to determine medical students' performance at the formative assessments, their self-assessment of performance and their perception about the switch from a summative to a formative approach.
In June 2020, an online survey was conducted among all the medical students from year 2 to 6 to determine how students were organizing their activities, and the impact of the pandemic on their personal life, training and professional identity.(Wurth et al., 2021) The current study focused on formative assessment during COVID-19 pandemic in the same cohorts of medical students.

Methods
We conducted an observational study including all the medical students from year 2 to 5 (n=648) of the Geneva Faculty of Medicine affected by the switch from written summative exams to formative assessments.
At the Geneva Faculty of Medicine, the 6-year curriculum is divided into Bachelor years (1 to 3) and Master years (4 to 6).Year 6 is a clinical year.After completing their clinical clerkships, students sit the Swiss federal licensing exam.The medical students from year 2 to 5 were concerned by the online formative assessments and were included in the study (Table 1).Participation was mandatory to obtain the ECTS (European Credit Transfer and Accumulation System) credits and validate the current year.

Amendments from Version 2
In this revised version of the manuscript, we added a table in the Methods section describing the written formative assessments (disciplines) during Bachelor and Master years.
We modified the Figure 5: pie chart was replaced by bar chart.
Finally, in the Discussion section, for limitation and strength, we changed the order starting with strengths followed by limitations.
Any further responses from the reviewers can be found at the end of the article At the end of each teaching unit, a written and/or oral exam is given to medical students.During the pandemic, all the written MCQ exams were maintained as formative assessments and conducted online on a secured platform of the Faculty of Medicine (Moodle Platform version 2.1.0).In Bachelor years, five written formative assessments were held for respiration, osteo-articular system, infectious diseases, integration, and community dimensions.In Master years, ten written formative assessments were held for surgery, internal medicine, primary care, paediatrics, gynaecology and obstetrics, psychiatry, radiology, pathology, ophthalmology, and emergency medicine and intensive care (Table 2).We collected the data for all the written formative assessments.Objective Structured Clinical Examinations (OSCEs) and oral examinations were cancelled, or adapted and conducted at distance, and were not included in the present study.

REVISED
In June 2020, we conducted an online anonymous cross-sectional survey (Wurth et al., 2021) of all the medical students from year 2 to 6 to determine their perception of the impact COVID-19 had on the curriculum.Data about students' perception of the change from summative to formative assessment were included in this study.

Formative assessments
To reach our aim -i.e., support students' drive for learning through assessment -we relied on a combination of incentives and formal requirements.The tests took place during the period of time originally devoted to the examination session.They were similar to the usual exams concerning content and format.The tests had the same length (number of questions, examination time) and the content was chosen according to the blueprint of each discipline.The questions were not specifically developed for the formative tests, but they were taken from the pool used for the summative examinations.Previous exams were used in different disciplines -i.e., in Bachelor years respiration, osteo-articular, and infectious diseases units and in Master years psychiatry, gynaecology-obstetrics, internal medicine, primary care medicine, paediatrics, ophthalmology, radiology and pathology.So, the formative tests offered to the students were valid and reliable.
Once students had begun a test, they had a limited amount of time to complete it, identical to the usual examination time.
They had to answer the questions sequentially, i.e. they could not go backwards and modify the answers of previously answered questions.After completion of an attempt, students were given their scores.The same set of questions was used for each attempt.At the end of the whole assessment session, they were given detailed feed-back for each test.They had access to all the questions, the correct answers, and their own answers.Master students were also provided with the following summary: their own score for each attempt, and for comparison the lowest and the highest observed scores, and the 10%, 25%, 50%, 75%, 90% quantiles.
In Bachelor years, three written formative tests, namely respiration, osteo-articular, and infectious diseases, were opened for two periods of 24 hours separated by a 24-hour break.
Only one attempt was authorized per period, for a total of two attempts.Two disciplines, namely integration and community dimensions, were organized according to a different modality: three attempts were authorized within a time span of 5 days.
In Master years, ten written formative assessments were held and students could access the test for each discipline within a predetermined time span of 72 hours, with a maximum of 3 attempts allowed.
Before each test, students were given recommendations about how to benefit from the formative assessments.They were instructed to take the test in conditions favouring concentration, to answer the questions on their own, to look for the information that would have been missing after each attempt, and to use the detailed feedback to supplement their learning based on identified gaps.
Students were asked to complete a short questionnaire at the end of each formative test.The questionnaire included one item about self-assessment of performance ("I think my score in this formative test is: 1) less than 20%, 2) between 20 and 40%, 3) between 40 and 65%, 4) between 65 and 80%, or 5) more than 80%") and one item about their wish for support in learning ("I would like to have feed-back or advice about how to study").The choice of answers was: 1) "not at this stage" 2) "maybe" 3) "yes"

Survey
An online survey (evasys software, version 8.2) was conducted among all the medical students from year 2 to 6 (N=803) in June 2020 (Wurth et al., 2021).The aim was to explore the students' perception about the reorganization of teaching and the switch from summative to formative assessments.In this previous study, the data concerning formative assessments were not analysed.Two items concerned assessment: 1) "Most examinations have been reorganized since March and switched from summative exams (pass/fail with grades) to mandatory formative assessments.I think this was a good decision (1 strongly disagree, 4 strongly agree)"; 2) "What impact did the switch from summative to formative assessments have on you? (free text)" Medical students from year 2 to 5 (n=648) were affected by the switch to formative assessments and participated in this research.We used a conventional content analysis approach to analyse item 2. Two researchers (VL, ME) independently read the participants' comments.A list of seven codes was developed, and then independently applied to all the comments.Coding discrepancies were identified and resolved by consensus.

Statistical analysis
Fifteen formative assessments were considered for the analyses: five Bachelor exams and ten Master exams.
For each attempt, the test scores were computed by dividing the number of points of each examinee by the maximum number of points achievable (e.g., 50 if half of the points were achieved).The scores for each exam were standardized (mean 100 and standard deviation 10).The duration of an attempt was defined as the interval between the beginning of the attempt and either the final validation of the answers by the examinee or the automatic validation of the answers by the program whenever the time allowed was elapsed.Since the duration of the formative assessments differed among the disciplines, it was standardized (mean 0 and standard deviation 10) when the association between the scores and the duration of each attempt was investigated.
We used chi-square tests to investigate the association between categorical variables and analysis of variance to investigate the association between a numerical variable and a categorical variable.We used linear mixed effect model to investigate the association between the standardized scores and standardized duration as well as gender (fixed effects), and students (random factor).
No data imputation method was used to estimate missing values.Tests were used with a Type I error rate of 0.05.All analyses were made using R software, version 4.1.1(The R Foundation for Statistical Computing, Vienna, Austria).For Master students the average interval between the first and the second attempt was 16 hours (median 7 hours), and 18 hours between the second and the third attempt (median 14 hours).Apart from a few outliers (very short or very long durations) there was a trend of a negative association between the score and the duration of the attempt (p<.0001 for the t value associated with the estimate of the slope; Figure 1).

Subgroup of examinees with several attempts.
For the subgroup of examinees who made at least two attempts (734 examinees; 30.8%), the average increase in score between the first and the second attempt was +10.5 (from -41.6 to +51.7; median 9.4, IQR 10.8).The average decrease in duration of the attempt was 33.1 min (from -123.0 to +53.0; median -31.0,IQR 48.0).
The highest score was most often achieved at the second attempt (85.4%), the other situations being equally distributed between the first (7.5%) and the third attempt (7.1%) (Figure 2).In the most common situation (529 examinees, 72.1%) the highest score was reached at the second attempt and the longest attempt duration was the first.

Self-assessment and score
Examinees self-assessed their scores for 2998 attempts (93.8%).More than half of them (54.6%)underestimated their score.They correctly estimated their score in 40.7% of cases, and they overestimated their score for a minority of attempts (4.7%).More female than male students underestimated their scores (60.7% vs. 45.5%;p<0.0001).Students who performed best were more likely to underestimate their scores, while low performers overestimated their scores (p<0.0001).More high performing female students underestimated their scores compared with their male counterparts (p=0.0066)(Figure 3).Bachelor students were less likely to underestimate their score than Master students (48.0%vs 57.8%; p<.0001).

Wish for support in learning
Across all attempts, 1388 examinees (43.4%) indicated that they had no wish for support at this stage, 900 (28.1%) that they may like to have support, and 500 (15.6%) that they did wish for support.A minority of examinees (n=409; 12.8%) did not respond.There was no association between the wish for support and self-assessment of score (p=.365), or the actual normalized score (p=.109; Figure 4).

Survey results
Out of 648 medical students, 390 (60.2%) students answered the questions about assessment.The majority (87.9%) strongly agreed or agreed with the switch from summative assessments to mandatory formative assessments.A minority (11.0%) disagreed or strongly disagreed, and 1.0% had no opinion.
Among the respondents, 305 (78.2% ) answered the open-ended question about the impact of the switch to formative assessments.The responses are in French and have been translated into English.We did not code 57 comments (19.6%) either because they were made by 6 th -year students or because they were not related to the question.Six themes emerged: 1) alleviation of stress, 2) decreased motivation for study, 3) no impact, 4) increased motivation for study, 5) fear of gaps in knowledge, and 6) free time for other activities (Figure 5).
Representative quotes for each theme are presented in Table 3.The most common perceptions were relief (33.8%), and decreased motivation for study (33.4%).The two feelings were often associated in a same comment: "Less stress, but consequently more difficult to get motivated and study".A minority of students (11.5%) experienced enhanced motivation: "I enjoyed learning not for an exam, but to increase my own knowledge, out of the desire to understand the processes".Some students wondered about potential gaps in knowledge and how it could impact their future professional curriculum.Others appreciated the extra free time, which they used for other activities, especially volunteer work for the COVID-19 health crisis.

Discussion
In this study we investigated medical students' performance and perception after summative examinations were replaced by mandatory formative assessments.
For each test, 2 and 3 attempts were authorized in a predefined time frame for Bachelor and Master students, respectively.Among the 30.8% of examinees who repeated the test, the average score increased, and the average duration of the attempt decreased, which supports a benefit of the combination of

Increased motivation for study
More motivated to study.More time to gain in-depth knowledge.
Renewed motivation.I again enjoyed learn new things, for the future, and not just for an exam.
I learned more out of curiosity and pleasure, and less according to the learning objectives.

Fear of gaps in knowledge
Anxiety about going on with my studies, without any evidence of my competencies and capacity to pass to the next level.

Less involvement in studying, less knowledge acquired
All our 4 th year exams cancelled, hence fear for the Federal Licensing Exam

Free time for other activities
Since the assessments were not summative, I could engage in volunteer work at the hospital.
It allowed me to participate a little more in the health crisis.

I have free time for other activities, it makes me happy.
*the original responses were in French, and were translated for the manuscript detailed feedback, learning time, and re-test opportunity.
Taken together with the time duration between two attempts, it suggests that the students who repeated the test studied the learning contents.This is consistent with previous findings about formative assessments during the COVID-19 pandemic where many students (72.2%) indicated that feedback, given after the tests, motivated them to study (Snekalatha et al., 2021).Students who repeated a test were more often in Master than in Bachelor years.One reason for the difference could be the subject matter.In the survey, Master students felt motivated to study topics they considered clinically useful, and Bachelor exams mainly evaluate fundamental knowledge.
It was disappointing that a majority of students -3 out of 4 in Bachelor years and 2 out of 3 in Master years -did not repeat the test.Although their average score at the first attempt was better than the score of the other students, they missed a learning opportunity.Other studies showed that personal study coupled with a second test improves the anchoring of knowledge and the ability to use it (Agarwal et al., 2012;Roediger & Butler, 2011).This retrieval effect can persist in time and knowledge retrieval is an important aspect of clinical work.
Students' self-assessment of scores were sub-optimal since less than half of the examinees' estimations were correct.High-performing students underestimated their scores, while poor performers overestimated them.Female students and students in Master years underestimated their scores to a greater extent than male students and students in Bachelor years, respectively.Inaccuracy in self-assessment in medical students and a difference between female and male students have been observed (Blanch-Hartigan, 2011;Gordon, 1991).
Students can be inaccurate due to the complexity of selfassessment per se and the difficulty in identifying gaps in a specific domain (Eva & Regehr, 2005;Regehr & Eva, 2006).
Overestimation by poor performers and underestimation by good performers was described (Colthart et al., 2008).Explanations can be a regression to the mean and a lack of the cognitive skills needed for accurate self-assessment in the less competent individuals.Female students tend to be more anxious and less confident than their male counterparts (Blanch et al., 2008;Colbert-Getz et al., 2013), traits that can be related to a marked underestimation in high performers.The difference can also be related to gender bias.In the clinical vignettes used for teaching at the Geneva Faculty of Medicine gender professional roles tend to be stereotypes, e.g.physicians are more often male (Arsever et al., 2023).To depict women in leading positions would strengthen the image of women as competent professionals.The assessed subject matter rather than a difference in self-assessment capacity could explain the difference we found between Bachelor and Master students.Master students were asked to evaluate their performance in clinical reasoning whereas Bachelor students evaluated their competence in fundamental knowledge.Students may be more familiar with this kind of content and therefore more capable of assessing their performance.
The proportion of examinees who wished for support and those who did not was similar, and wish for support was not associated with self-assessment of performance or the actual scores.De Jong et al. found a relationship between feedback-seeking behaviour and performance: high-performing students were more motivated and self-determined to seek feedback compared to low-performing students (de Jong et al., 2017).The study setting was different from ours since it included veterinary medicine students in their final year who underwent a summative evaluation at the workplace.The medical students of the Geneva Faculty of Medicine were in their preclinical and clinical years and assessment was formative.
Students had extensive online feedback on their performance and were offered support for learning.et al., 2018;Kulasegaram et al., 2017).
Our findings underscore the need for formative assessment to be part of a larger assessment system (programmatic assessment) to motivate and benefit medical students best.The role of scholar defined in the CanMEDS (https://www.royalcollege.ca/ca/en/canmeds/canmeds-framework.html) or PROFILES (https://www.profilesmed.ch)frameworks requires lifelong learning and a planned approach to learning from physicians.In a competency-based curriculum, students must be supported to acquire the expected autonomy in self-improvement.Regular self-evaluation of scores at formative and summative assessments could help medical students better estimate their competences.Overestimation of performance is a marker for potentially less competent students.Recurrent overestimations could be used to identify medical students in need of support in learning.
This study has several strengths and limitations.The formative assessments offered to medical students of the Geneva Faculty of Medecine were robust: the questions were taken from the pool used for the summative examinations, students were given detailed feedback, and they had the opportunity to do the test again and monitor their progress.However, we do not have any information about the learning behaviour of the students who made only one attempt.They may have studied after the test to fill knowledge gaps and have benefited from the feedback offered.The participation in formative assessments was mandatory, so we do not know how the students would have behaved if the assessments had been optional.However, we observed a low participation rate in a couple of early formative assessments which were kept optional because of short decisional and organizational timelines.

Conclusion
Medical students in the Geneva Faculty of Medicine welcomed the switch from summative to formative assessments during the COVID-19 pandemic.A drawback was decreased motivation in learning.Despite formal requirements, thoughtful test organization and detailed feedback, a sizeable majority of students did not take the opportunity of repeated testing.
Students who did take the test more than once significantly improved their performance.

Ethics approval and consent to participate
The study was submitted in February 2022 to the University Committee for Ethical Research at the Geneva University (CUREG) which waived the need for review because the study was an analysis of anonymized collected data.
All methods were carried out in accordance with relevant guidelines and regulations.
Written informed consent was obtained from participants in the survey.A general written informed consent is given by all the medical students from year 2 to 6 to use the anonymized results of assessments for research and quality purposes.

Consent to publish
The authors confirm that there is no information which can potentially identify a participant.

Reflexivity
VL is a biologist and responsible for the Bachelor examinations.ME is a physician and responsible for the Master examinations at the Geneva Faculty of Medicine.ME conducted or was involved in previous qualitative research projects.VL and ME had no direct contact with the students apart from sending written information related to the taking of the formative assessments.They organized the formative assessments and ensured that they ran smoothly.
reviewers but the write-up probably need major revision before being accepted for publication.

Introduction
1 st para-adequate 2 nd para-the authors tried to explain summative assessment and formative assessment as general terms (giving references to other journals).Suggestion is to include the actual definition/operational of summative and formative assessments used by the Faculty of Medicine, University of Geneva, so that readers able to appreciate the actual summative and formative assessment done by the faculty during this study period.These were partly explained in the Methods section and took a while to understand how the assessments were being conducted at the faculty.
3 rd para-this paragraph stated that there are 1) students' performance and 2) perception of switching from summative assessment to formative assessment.Unfortunately, they don't match with the Abstract as in the abstract, the aims are to determine 1) students' performance, 2) selfassessment of performance and 3) perception of students towards the switching from summative to formative approach.
The following sentences after the aim of this study should be omitted as no comparison between June 2020 study with current study were being evaluated.The terms 'shares the same medical students pool and focuses on formative assessment…."made more confusion and leading towards one same study data sliced into two publications.

Methods
1 st para-this paragraph is confusing and not sure what information is being delivered eg number 2020 is representing the batch of students or the year or something else.
2 nd & 3rd and subsequent paragraphs-suggest rewriting this paragraph to explain the curriculum and its content in the beginning (the disciplines) and then explain the assessment approach taken (pre-covid era).Then explain what the changes are done and how it is conducted during covid19.
Anything less than 10 should be written in words, not numbers.
Suggest also to omit the June 2020 study as it makes things more confusing.
Ethics paragraph should be located after the methods, not in the middle of methods section Formative Assessment section to be rewritten together with the paragraph before the ethics based on the earlier suggestion methods layout (intro to assessment in the faculty, pre-covid assessment (summative), formative assessment, the conduct of the assessment).
Suggest separating Bachelor level and Master level as both cannot be compared to each other.The comparison should be Bachelor Summative vs Bachelor Formative and Master Summative vs Master Summative.
Overall generalisation should not be done as Bachelor and Master students are not homogenous.
How do the questions be selected?Having given the same questions used by previous batches/years, it basically will nullify the assessment as the senior students are able to pass their exams questions to their juniors.What are the ways taken by the faculty to secure theses exam question?(Questions selected based on difficulty degrees from a pool of many questions?Are the questions vetted for covid batches?Or any other ways?) The last paragraph of Formative Assessment section stated more towards self-perceptions of their performance rather that actual students' performance (based on grades or marks).

Survey
1 st para-suggest omitting as study done in June 2020 make lots of confusion to the current study that are being focused (eg N=803 vs n=648).
This section should state how the survey was conducted and what are the items in the survey/instrument/questionnaire.Has this instrument been validated?
Reflexivity or Roles by Authors should be located at the end, rather than be a content in the main study text.

Results
Ideally the result should answer the aims of the study which are to determine 1) students' performance, 2) self-assessment of performance and 3) perception of students towards the switching from summative to formative approach.This section ideally presented using table for the demographic data.
Please do include the disciplines during bachelor and master in the table format.
All the numbers in this section will be best presented in table form, so that comparison between bachelor, masters, discipline, number of attempts can be made adequately, not in narrative form.
Table 1 should be in Result section.
My suggestion to review the usage of box plot in Figure 2, how best each discipline be presented and easily understood by the readers.
Major rephrasing and restructuring of this section are needed so that the readers able to appreciate the process and the flow of result of this study.
For Figure 5, I believe that bar chart will be better than pie chart as these are free texts and many ideas can be obtained from 1 respondent hence 100% of the pie chart doesn't represent the whole batch.
Figure 2 and Table 2 are explaining the same outcomes

Discussion
This section was written not according to the flow or the aims of the study.The best way to present is to match with the flow of the result so that the readers will be able to appreciate the discussion and relate directly with the result section.
There are numbers and percentages in discussion section that should be stated in result section as the discussion section will explain the nature of multiple attempts by the students.
Major rephrasing and restructuring of this section are needed so that the readers able to appreciate the flow of discussion of this study.
For limitation and strength, suggest starting with strength and followed by limitation.
The author needs to show to the reader why this study is important and can be learnt by others, then followed by the limitations occurred during the study.

Conclusion
The conclusion adequately portrait the summary of the study.Due to the missing students' performance, the last sentence can't be made as a conclusion, as the study at present showed selfperception of performance rather than actual performance (as the responses were anonymously recorded).

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Partly
If applicable, is the statistical analysis and its interpretation appropriate?Partly

Have any limitations of the research been acknowledged? Yes
Are all the source data underlying the results available to ensure full reproducibility?Partly

Are the conclusions drawn adequately supported by the results? Partly
Introduction 1 st para-adequate 2 nd para-the authors tried to explain summative assessment and formative assessment as general terms (giving references to other journals).Suggestion is to include the actual definition/operational of summative and formative assessments used by the Faculty of Medicine, University of Geneva, so that readers able to appreciate the actual summative and formative assessment done by the faculty during this study period.These were partly explained in the Methods section and took a while to understand how the assessments were being conducted at the faculty.The definition of summative and formative assessments, given in the Introduction, is the one used by the Geneva Faculty of Medicine.We explained in the first paragraph of the Introduction that all the summative assessments were cancelled.The summative written assessments were replaced by mandatory formative online assessments (see last sentence of 3 rd paragraph: "the summative examinations at our Faculty were replaced with mandatory formative online assessments (i.e.students were required to participate, but not required to pass").
3 rd para-this paragraph stated that there are 1) students' performance and 2) perception of switching from summative assessment to formative assessment.Unfortunately, they don't match with the Abstract as in the abstract, the aims are to determine 1) students' performance, 2) self-assessment of performance and 3) perception of students towards the switching from summative to formative approach.The sentence was modified in the last paragraph of the Introduction: "The aim of this study was to determine medical students' performance at the formative assessments, their self-assessment of performance and their perception about the switch from a summative to a formative approach." The following sentences after the aim of this study should be omitted as no comparison between June 2020 study with current study were being evaluated.The terms 'shares the same medical students pool and focuses on formative assessment…."made more confusion and leading towards one same study data sliced into two publications.The current study and the 2020 study are two separate studies with different methods and different aims.The aim of the current study is to determine students' performance at the formative online assessments, self-assessment of performance, and perception about the switch from summative to formative assessment, whereas the aim of the 2020 study was to determine how students were organizing their activities, and the impact of the pandemic on their personal life, training and professional identity.Since the two studies explore different aspects of the impact of the pandemic, they both include the same cohorts of students.We specified the link between the 2020 study and the current study and the inclusion of the same cohorts of students at the editorial team's request.It underlines how complementary the two studies are.We put the information in a new paragraph ("In June 2020, etc") and rephrased the last sentence to make the link clearer ("The current study focused on formative assessment during COVID-19 pandemic in the same cohorts of medical students."

Methods
1 st para-this paragraph is confusing and not sure what information is being delivered eg number 2020 is representing the batch of students or the year or something else.2020 is the year of the switch from summative to formative assessments.To avoid confusion, we modified the sentence in the first paragraph: "We conducted an observational study including all the medical students from year 2 to 5 (n=648) of the Geneva Faculty of Medicine affected by the switch from written summative exams to formative assessments in 2020." 2 nd & 3rd and subsequent paragraphs-suggest rewriting this paragraph to explain the curriculum and its content in the beginning (the disciplines) and then explain the assessment approach taken (pre-covid era).Then explain what the changes are done and how it is conducted during covid19.The focus of the study is assessment, and not the content of teaching, and we prefer to keep the focus of the manuscript on assessment.At the end of each teaching unit, a written and/or oral summative exam is given to medical students.During the pandemic, the written summative exams were replaced by formative assessments.The details of the formative assessments during the COVID-19 period are given in the section Formative assessments.We modified the beginning of the second paragraph: "At the end of each teaching unit, a written and/or oral summative exam is given to medical students.During the pandemic, all the written MCQ exams were maintained as formative assessments and conducted online on a secured platform of the Faculty of Medicine (Moodle Platform version 2.1.0)." Anything less than 10 should be written in words, not numbers.We replaced "5 written formative assessments" by "five written formative assessments" and "10 written formative assessments" by "ten written formative assessments".
Suggest also to omit the June 2020 study as it makes things more confusing.The data about medical students' perception of the change from summative to formative assessment were collected as part of the study conducted in June 2020.Therefore, it is necessary to mention the study.We changed the last sentence: "Data about students' perception of the change from summative to formative assessment were included in this study" Ethics paragraph should be located after the methods, not in the middle of methods section We moved this paragraph after the conclusion and before the Consent to publish section.
Formative Assessment section to be rewritten together with the paragraph before the ethics based on the earlier suggestion methods layout (intro to assessment in the faculty, pre-covid assessment (summative), formative assessment, the conduct of the assessment).
All the relevant information is already given in the section, and we prefer to keep it this way since the topic of the paper is "Formative assessments during COVID-19 pandemic".
Suggest separating Bachelor level and Master level as both cannot be compared to each other.The comparison should be Bachelor Summative vs Bachelor Formative and Master Summative vs Master Summative.We agree with the reviewer that Bachelor students and Master students should not be compared and we did not compare them.The circumstances were so different between the summative exams in "normal" years and the formative assessments in the COVID-19 year that we chose not to compare the medical students' performance in summative vs formative assessments.
Overall generalisation should not be done as Bachelor and Master students are not homogenous.The cohorts of Bachelor and Master students are homogenous.The students do all their medical education (from the first year of Bachelor to the last year of Master) in the Geneva Faculty of Medicine.Furthermore, figure 2 shows the results of each single assessment separately.How do the questions be selected?Having given the same questions used by previous batches/years, it basically will nullify the assessment as the senior students are able to pass their exams questions to their juniors.What are the ways taken by the faculty to secure theses exam question?(Questions selected based on difficulty degrees from a pool of many questions?Are the questions vetted for covid batches?Or any other ways?)As explained in the 1 st paragraph of the Formative assessment section, formative assessments were similar to the usual exams concerning content and format.The tests had the same length (number of questions, examination time) and the content was chosen according to the blueprint of each discipline.The questions were not specifically developed for the formative tests, but they were taken from the pool used for the summative examinations.The pool of questions is kept on a secured platform and new questions are regularly added.The questions used for a summative exam are not made available to the students, and the ones used for the formative assessments during the pandemic will no longer be used in summative exams.
The last paragraph of Formative Assessment section stated more towards self-perceptions of their performance rather that actual students' performance (based on grades or marks).Figure 3 precisely shows the shape of the association between perceived and actual performance.

Survey
1 st para-suggest omitting as study done in June 2020 make lots of confusion to the current study that are being focused (eg N=803 vs n=648).As explained previously, it is necessaryand requested by the editorial team -to cite the 2020 study to explain how the data were collected and what the link between this and our study is.The study done in June 2020 included 803 students, i.e. the students from year 2 to year 6.The current study only includes the students affected by the switch from summative to formative assessments, i.e. 648 students, from year 2 to year 5; since year 6 is a clinical year and students don't take any exams.This section should state how the survey was conducted and what are the items in the survey/instrument/questionnaire.Has this instrument been validated?All the details regarding the survey can be found in the article we refer to (Wurth S, Sader J, Cerutti B, et al.: Medical students' perceptions and coping strategies during the first wave of the COVID-19 pandemic: studies, clinical implication, and professional identity.BMC Med Educ.2021; 21(1): 620).It is not relevant for the current study to describe the survey in detail again.
Reflexivity or Roles by Authors should be located at the end, rather than be a content in the main study text.We moved the Reflexivity section after the Consent to publish section.

Statistical analysis -OK
Results Ideally the result should answer the aims of the study which are to determine 1) students' performance, 2) self-assessment of performance and 3) perception of students towards the switching from summative to formative approach.This section ideally presented using table for the demographic data.
Please do include the disciplines during bachelor and master in the table format.We added a table in the Methods section after the second paragraph (Table 2).A ll the numbers in this section will be best presented in table form, so that comparison between bachelor, masters, discipline, number of attempts can be made adequately, not in narrative form.All the results described in the Results section are reported as figures.
Table 1 should be in Result section.The data presented in Table 1 are not a result, but a description of the cohorts of students.This is the reason why Table 1 is in the Methods section.
My suggestion to review the usage of box plot in Figure 2, how best each discipline be presented and easily understood by the readers.Box plots are adequate to present the results of each formative assessment.They are visual and informative.
Major rephrasing and restructuring of this section are needed so that the readers able to appreciate the process and the flow of result of this study.The Results section is logically structured according to the aims of the study: sub section Formative assessments, which describes students' performance (aim 1), and students' self-assessment of performance (aim 2), followed by sub section Survey, which relates to students' perception of the switch from summative to formative assessment (aim 3).The manuscript is written using the usual descriptive style.
For Figure 5, I believe that bar chart will be better than pie chart as these are free texts and many ideas can be obtained from 1 respondent hence 100% of the pie chart doesn't represent the whole batch.We modified Figure 5.
Figure 2 and Table 2 are explaining the same outcomes The editorial team rightly requested that we add some participants' quotes in a table.It is commonly done in qualitative research and supports the results presented in Figure 5.

Discussion
This section was written not according to the flow or the aims of the study.The best way to present is to match with the flow of the result so that the readers will be able to appreciate the discussion and relate directly with the result section.The Discussion section is written following the structure of the Results section, which follows the aims of the study: discussion of the students' performance (aim 1), of the students' self-assessment of performance (aim 2), and of the students' perception of the switch from summative to formative assessments (aim 3).
There are numbers and percentages in discussion section that should be stated in result section as the discussion section will explain the nature of multiple attempts by the students.All the numbers and percentages mentioned in the Discussion section are reported in the Results section.
Major rephrasing and restructuring of this section are needed so that the readers able to appreciate the flow of discussion of this study.As mentioned previously, the Discussion section is written following the structure of the Results section, which follows the aims of the study.
For limitation and strength, suggest starting with strength and followed by limitation.The author needs to show to the reader why this study is important and can be learnt by others, then followed by the limitations occurred during the study.
We changed the order starting with strengths followed by limitations: "This study has several strengths and limitations.The formative assessments offered to medical students of the Geneva Faculty of Medecine were robust: the questions were taken from the pool used for the summative examinations, students were given detailed feedback, and they had the opportunity to do the test again and monitor their progress.However, we do not have any information about the learning behaviour of the students who made only one attempt.They may have studied after the test to fill knowledge gaps and have benefited from the feedback offered.The participation in formative assessments was mandatory, so we do not know how the students would have behaved if the assessments had been optional.We observed a low participation rate in a couple of early formative assessments which were kept optional because of short decisional and organizational timelines."

Conclusion
The conclusion adequately portrait the summary of the study.Due to the missing students' performance, the last sentence can't be made as a conclusion, as the study at present showed self-perception of performance rather than actual performance (as the responses were anonymously recorded).We do present students' actual performances (see Figures 1,  2, and 3).
Conclusion might need rewriting based on the new writeup of Result and Discussion sections.Since the purpose of the conclusion is to give clear and synthetic messages, we prefer not to change it.
The online and distance formative assessments affected all medical students from years 2 to 5. Formative assessments were conducted online using a secure platform and closely resembled the usual exams in content and format.Students received detailed feedback after each attempt.A survey was conducted to gather students' perceptions of the switch to formative assessments.
Most students (72.2%) did not repeat the test, missing the opportunity for increased learning.However, students who repeated the test showed improved scores, suggesting that retesting was beneficial.
Students' self-assessment of scores was suboptimal, with high performers underestimating their scores and low performers overestimating them.
Interestingly, most students did not seek additional support in learning, and there was no significant association between the wish for support and self-assessment or actual scores.
Medical students generally supported the switch to formative assessments, primarily due to the stress alleviation during the pandemic.
The study highlights that medical students welcomed the switch from summative to formative assessments during the pandemic.Nevertheless, a drawback was observed, as it led to decreased motivation to study, despite the mandatory formative assessments.
Students who took the test multiple times showed improved performance, indicating the benefits of detailed feedback and retesting.
The study also reveals that students' self-assessment of scores was inaccurate, with potential implications for their learning and performance.
The manuscript presented is appropriate and covers an interesting aspect of the COVID-19 pandemic's impact on educational institutions.Implementing new strategies for teaching and assessing students during the pandemic, particularly using formative assessments as an alternative to traditional summative assessments, is highlighted.However, it is essential to acknowledge that this approach does not guarantee progress throughout the course.Moreover, students learn content and how to respond autonomously and in socially desirable ways.
The study identifies two groups of students: those who use assessment opportunities to improve their learning and those who perform them out of obligation.Conducting a correlation analysis between these groups and various factors would enhance the quality of the data presented.Additionally, it is crucial to clarify how the repetition of assessments was conducted and to analyze Bachelor and Master year students separately.Students closer to medical practice may exhibit a greater tendency to take advantage of feedback and learning opportunities.
Have you considered exploring learning styles as well?This could be an interesting topic, as students' perceptions of assessment models and feedback optimization are closely related to their learning styles.Additionally, delving into the differences between growth and fixed mindsets could be a valuable point of discussion.By exploring these two lines of inquiry, you may gain insight into how some students embrace challenges and take more risks, while others do not.Finally, investigating how motivation can be affected by a shift from summative to formative assessment could be an insightful addition to the conversation.Taking into account the mindset and learning style of each student might help in understanding why some students experience diminished motivation while others experience an increase in motivation.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Introduction
The Introduction should elaborate about summative and formative assessments among medical students in general.

○
The first sentence of the Introduction could be referenced since it mentioned that many schools adapted changes and switched to online teaching.

○
Please use the name of the Faculty of Medicine or its abbreviation throughout the text rather than referring to it as "our".

○
Please restructure and combine the following sentences in order to reflect that the information are from the same study ".In June 2020, an online survey was conducted among all the medical students from year 2 to 6 (Wurth et al., 2021).This study focused on students' main activities, the impact of the pandemic on their personal life, training and professional identity."

Methods
The Methods should start with a study design and setting.The design of the study should be mentioned clearly in the Methods, and which students were included in the analysis should also be clearly stated (all years or selected years) at the beginning of the methods.
The first paragraph of the methods appears to be a program description rather than the inclusion criteria in the study.

○
The courses/modules that the assessments took place in should be relocated to the study setting rather than the formative assessment section.

○
When was the survey administered?One time after all assessments, or after each assessment?Please specify.

○
The number of possible attempts appears confusing in the bachelor years.

Results
The results should start with sample description, number of students in each year, courses, etc., because they were used later in the analyses.

○
Why were the survey responses made by 6 th year students not coded?

○
In Figure 2, the number of attempts are missing for primary care, psychiatry, and surgery.

○
Why the authors didn't correlate students' perception and assessment outcomes?An additional analysis modeling students' perception category (from the survey) and formative assessment outcomes would be very important.

Discussion
The authors have to discuss more why there was a significantly higher single attempts by bachelor students compared to master's students.The difference between fundamental knowledge and clinical usefulness of topics does not appear to be a valid reason.
○ More details about the questions in the second and third attempts should be discussed.Did the assessment involve the same or different questions in each attempt?Or what percentage of questions from the first attempt is repeated in the second and third attempt?
○ Around 40% of participants didn't provide their feedback through the survey assessmenthow this can affect the generalizability of the findings about students' perception?

○
The referencing style should be unified throughout the manuscript, for example, the authors used citation number [10] in one paragraph of the Discussion whilst they used the APA style (author and year) throughout most of the text.

○
The self-assessment of scores comparison would have been better to be compared between academic years, courses/modules rather than gender.What are pedagogical outcomes from the gender comparison?
○ Participation in multiple assessment attempts does not necessarily correlate with better knowledge and clinical work because the average time between attempts was less than 24 hours.This time does not seem adequate to have a greater learning opportunity as mentioned by the authors.

○
The discussion should include practical implications for the findings of the study.This discussion part should be inserted before the limitations.

Conclusion:
The conclusion should reflect the results and data of the findings rather than hypothesis.So the following sentence can be hypothesized and presented in the discussion rather than a final conclusion: "Students who did take the test more than once significantly improved their performance, which supports a benefit of the combination of detailed feedback, learning time, and re-test opportunity".The authors should not include in the conclusion the reason for students' performance enhancements.

Is the work clearly and accurately presented and does it cite the current literature? Partly
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes

Have any limitations of the research been acknowledged? Yes
Are all the source data underlying the results available to ensure full reproducibility?life, training and professional identity."We modified the text as suggested: "In June 2020, an online survey was conducted among all the medical students from Year 2 to 6 to determinehow students were organizing their activities, and the impact of the pandemic on their personal life, training, and professional identity (Wurth et al., 2021)."

Methods
The Methods should start with a study design and setting.The design of the study should be mentioned clearly in the Methods, and which students were included in the analysis should also be clearly stated (all years or selected years) at the beginning of the methods.The first paragraph of the methods appears to be a program description rather than the inclusion criteria in the study.

○
We revised the Methods section to clarify the design of the study at the beginning.The Table 1 gives the distribution of students by year and gender.
The courses/modules that the assessments took place in should be relocated to the study setting rather than the formative assessment section.

○
We did as suggested.
When was the survey administered?One time after all assessments, or after each assessment?Please specify.
The survey was a one-shot cross sectional survey conducted among all the students in June 2020.Hence, it was not administered after every assessment.

○
The number of possible attempts appears confusing in the bachelor years.
○ Three written formative assessments (namely respiration, osteo-articular, and infectious diseases) allowed two attempts (one attempt per 24-hour period).Two written formative assessments (namely integration and community dimensions) allowed three attempts within a five-day period.We modified the text to make it clearer.

Results
The results should start with sample description, number of students in each year, courses, etc., because they were used later in the analyses.

○
We described the number and types of courses in Bachelor and Master years and the table 1 gives the distribution of students by year and gender.
Why were the survey responses made by 6 th year students not coded?
The sixth and last year of the pre-graduate curriculum is entirely dedicated to clinical clerkships.There are no faculty summative exams, so there was no switch to any formative assessments.This is the reason why we did not analyse the answers of these students.

○
In Figure 2, the number of attempts are missing for primary care, psychiatry, and surgery.

○
We thank the reviewers for pointing out this oversight.Figure 2 was corrected.
Why the authors didn't correlate students' perception and assessment outcomes?An additional analysis modeling students' perception category (from the survey) and formative assessment outcomes would be very important.

○
The survey was anonymous, we were not able to estimate any correlation.The survey and our study were originally designed as separate studies.

Discussion
The authors have to discuss more why there was a significantly higher single attempts by bachelor students compared to master's students.The difference between fundamental knowledge and clinical usefulness of topics does not appear to be a valid reason.

○
The students' comments in the survey show that students tended to study what they thought useful for clinical practice, and left out other, less related topics.Students were less motivated to study than for summative assessments.Taken together, these elements support our hypothesis.This is a novel finding; we found no data in the literature about this topic.Therefore, we prefer not to discuss it further.
More details about the questions in the second and third attempts should be discussed.Did the assessment involve the same or different questions in each attempt?Or what percentage of questions from the first attempt is repeated in the second and third attempt?

○
The assessment in each discipline involved exactly the same set of questions for each attempt.We added a sentence in the Methods section, sub-section Formative assessment: "The same set of questions was used for each attempt.»Around 40% of participants didn't provide their feedback through the survey assessment -how this can affect the generalizability of the findings about students' perception?
The analysis of the survey was qualitative and focused on identifying themes, which is not dependent on the sheer number of respondents.We cannot exclude that some themes would have emerged if more students had answered.However, it is unlikely because we found a broad homogeneity in the answers.In previous surveys, a 50% response rate allowed us to obtain reliable results regarding a whole cohort's perception.Previous generalizability studies conducted in our institution estimated that, considering the size of our cohorts of students, a 50% response rate would be compliant with the capacity to keep the standard error of the mean within the limits of desired levels of precision.The referencing style should be unified throughout the manuscript, for example, the authors used citation number [10] in one paragraph of the Discussion whilst they used the APA style (author and year) throughout most of the text.Thanks for pointing that out.We have corrected the referencing style.

○
The self-assessment of scores comparison would have been better to be compared between academic years, courses/modules rather than gender.What are pedagogical outcomes from the gender comparison?Many studies have dealt with gender-group differences in learning behaviour, hence it was relevant to split our results by gender.In addition, desegregation of data by 34174186; PMCID: PMC8443002.We also made some comparisons (not reported) by academic years: Master students clearly underestimated their performance more frequently than Bachelor students did (see the figure in supplementary data: https://doi.org/10.6084/m9.figshare.21755882)We agree with the reviewers that pedagogical outcomes are an important issue.To draw attention to biases -gender or other biases -matters, but may not be sufficient to change attitudes.An analysis of the clinical vignettes developed for teaching at the Geneva Faculty of Medicine showed that gender professional roles were stereotyped.Physicians were often male, while nowadays most medical students and more and more physicians are female.A similar observation was made with regards to the scenarios for the clinical skills examination (OSCEs) of the Swiss federal licensing exam.Our results support the change in learning material about to be implemented in the Geneva Faculty of Medicine.To depict women in leading positions will strengthen the image of women as competent professionals.It also better reflects the actual situation since more and more women occupy leading and academic positions.It would be interesting to determine whether female undergraduate medical students still underestimate their performance more than male students in a few years' time.We added a comment in the Discussion: "The difference can also be related to gender bias.In the clinical vignettes used for teaching at the Geneva Faculty of Medicine gender professional roles tend to be stereotypes, e.g.physicians are more often male.To depict women in leading positions would strengthen the image of women as competent professionals.»Participation in multiple assessment attempts does not necessarily correlate with better knowledge and clinical work because the average time between attempts was less than 24 hours.This time does not seem adequate to have a greater learning opportunity as mentioned by the authors.We would agree with this comment if the students had to answer different questions at the second and third attempts.Since all the attempts involved the same set of questions, the students had enough time to study what they did not know.

○
The discussion should include practical implications for the findings of the study.This discussion part should be inserted before the limitations.

○
Thank you for suggesting this addition.Our results support changes in the teaching material concerning the socio-professional characteristics of women (see above): "To depict women in leading positions would strengthen the image of women as competent professionals".We added a paragraph in the Discussion section before the study limitations that describes other practical implications: "Our findings underscore the need for formative assessment to be part of a larger assessment system (programmatic assessment) to motivate and benefit medical students best.The role of scholar defined in the CanMEDS (https://www.royalcollege.ca/ca/en/canmeds/canmeds-framework.html) or PROFILES (https://www.profilesmed.ch)frameworks requires lifelong learning and a planned approach to learning from physicians.In a competency-based curriculum, students must be supported to acquire the expected autonomy in self-improvement.Regular self-evaluation of scores at formative and summative assessments could help medical students better

Figure 1 .
Figure 1.Association between the score and the duration of the attempt.The continuous blue line was obtain using smoothing spline fitting.

Figure 2 .
Figure 2. Boxplot of the scores at first and second attempt (subgroup of students who made several attempts).

Figure 3 .
Figure 3. Relationship between standardized scores, self-assessment and gender.

Figure 4 .
Figure 4. Boxplot of the standardized scores split by indication regarding the perceived need for support in learning.

Figure 5 .
Figure 5. Survey results: "What impact did the switch from summative to formative assessments have on you? (Free text)".
the work clearly and accurately presented and does it cite the current literature?Partly Is the study design appropriate and is the work technically sound?Yes Are sufficient details of methods and analysis provided to allow replication by others?Partly If applicable, is the statistical analysis and its interpretation appropriate?Partly Have any limitations of the research been acknowledged?Yes Are all the source data underlying the results available to ensure full reproducibility?Partly Are the conclusions drawn adequately supported by the results?Partly Competing Interests: No competing interests were disclosed.Reviewer Expertise: PhD in Sciences, Specialist in Medical Education, Coordinator of the Health Education Research Group at CEDEM in the Faculty of Medicine of USP.I study topics such as faculty development, assessment, healthcare student's quality of life, resilience, and happiness.

Number of attempts and duration. Fifteen
They seemed to find this opportunity moderately appealing, which is somewhat surprising.Because of the sanitary measures requested by the COVID-19 epidemic, students were isolated and could not benefit from the usual learning opportunities and motivation between peers.They actually reported in the survey to worry about potential knowledge gaps.
Medical students were supportive of the switch from summative to formative assessments.The main reason was an alleviation of stress, but participants also mentioned the opportunity it gave them to volunteer for COVID-19-related activities.A frequently reported drawback was a decreased motivation to study despite the mandatory character of the formative assessments, the authenticity of the tests, and the extensive feedback given.Some students enjoyed the freedom to choose what they learned.A matter of concern is the way they discarded knowledge pertaining to basic sciences to focus on subjects they deemed useful for clinical work.A role of teachers as experts is to guide students' building up of knowledge by providing relevant learning content.Students' ability of correctly selecting what matters for knowledge scaffolding is questionable.The understanding of biological and physiopathological mechanisms is essential in clinical work: it helps relate various signs and symptoms, and disease manifestations with treatment choices.The risk of knowledge gaps in basic sciences is faulty clinical reasoning (Castillo Gerbase MW, Germond M, Cerutti B, Vu NV, Baroffio A. How Many Responses Do We Need?Using Generalizability Analysis to Estimate Minimum Necessary Response Rates for Online Student Evaluations.Teach Learn Med.2015;27(4):395-403.doi: 10.1080/10401334.2015.1077126.PMID: 26507997. ○Reference: