Manuscript: Intern preparedness for the CanMEDS roles and the Dunning-Kruger effect: a survey

Abstract Background The purpose of this study was to determine whether the first cohort of graduates from a new undergraduate medical programme in Botswana were adequately prepared for internship. Methods The authors surveyed 27 interns and 13 intern supervisors on site, who rated intern preparedness for 44 tasks using a previously validated instrument. Tasks were grouped according to the seven roles of the physician in the CanMEDS framework and Cronbach α values confirmed internal consistency. To determine the direction of differences between intern and supervisor ratings for tasks Likert scale ratings were treated as interval data and mean scores calculated. Rating frequencies for each role were compared using the χ 2 statistic. Reasons for differences between intern and supervisor ratings were explored by determining correlations between scores using the Spearman ρ statistic, and analysing qualitative data generated by the questionnaire. Results Preparedness for all seven roles and the majority of tasks was found to be between ‘Fairly well prepared’ and ‘Well prepared’. The ratings for four roles (Medical expert, Communicator, Collaborator, Professional) differed statistically, but not for the three others (Leader, Health advocate, Scholar). Interns rated their proficiency higher than their supervisors for the tasks in six roles; for the ‘Professional’ role intern ratings were mostly lower. Correlations between intern and supervisors scores were only significant for three roles (Medical expert, Communicator, Collaborator). Qualitative data provided further insights into the reasons for these associations. Conclusions Intern preparedness for tasks and roles varied but was generally satisfactory. Based on the analysis of the data seeming discrepancies in between interns and supervisor ratings were investigated and explanations are offered. For three roles the data indicate that their component tasks are understood in the same way by interns and supervisors, but not for the other roles. The Dunning-Kruger effect offers a plausible explanation for higher intern

scores for tasks in six of the roles. For the 'Professional' role differences between interns' internal, individual understanding and supervisors' external, group understanding may explain lower intern scores. The fact that respondents may understand the tasks they rate differently has implications for all research of this nature.

Background
Intern preparedness for medical practice has been the subject of discussion and research in medical education over the past decades. The term 'intern' refers to a recent medical graduate undergoing a period of supervised professional practice. Determining the preparedness of graduates of a new medical programme to practise as interns is a valued outcome metric for medical programmes (1). The World Federation for Medical Education recommends that a medical school must 'analyse performance of cohorts of students and graduates in relation to its mission and intended educational outcomes, curriculum and provision of resources' (2). In this sense 'analysis' is required so that successes are identified -and especially shortcomings, which can then be subsequently addressed.
In terms of scope, studies have dealt with a single institutions and programmes as well as multiple institutions -ranging from two institutions (23), several institutions (8) (12) (27) to an entire country (5)(6) (24) (33). There has also been comparison of two cohorts of the same programme such as where a traditional programme is being replaced by a new one with a problem-based ethos (7)(8)(9)(10), or one in a rural setting (11), or those comparing interns who had or had not undergone preparatory short courses (12). In cases where the effect of curricular innovations was studied the results could be positive (10) (16); in one study interns from a rural programme were judged to be better prepared for district hospital internship (11). Other studies comparing innovative and traditional programmes revealed little or no difference in the resulting preparedness for clinical performance, but positive differences were found in areas of focus for the new programmes such as ethics and law, interpersonal skills, self-directed learning, health system functioning and collaboration, and negative ones in (for example) understanding of disease processes (7)(9)(10). Beyond formal programmes, there has also been evaluation of the effectiveness of preparatory courses or 'boot camps' in preparing graduates for internship, whether general ones (17) (29) (30) (31) or those with a particular focus such as surgery or paediatrics (12) (28) (32).
Studies have investigated both general and specific aspects of preparedness. For instance some studies have evaluated how well interns are prepared against official standards such as General Medical Council recommendations in the United Kingdom (13) (15) (33). Where preparedness was measured against national standards, interns were sometimes not achieving them (15). On the other hand, other studies have focused on the effect of programmes to prepare students for specific competencies such as disaster preparedness (18), emotion regulation (19), memorable 'firsts' (20), infant lumbar punctures (21), basic medical procedures (22), career preparation and guidance (23), vaccination (24), health advocacy (residents/registrar in this case) (25), a particular internship rotation (26), or developing a professional identity (27). Additionally, studies have assessed the effect of entire undergraduate programmes on interns' feelings of competence (3)(4)(5)(6). When evaluating how well an entire undergraduate programme had prepared new interns overall, reports varied from reasonably well prepared (8)(10) to inadequately prepared (3)(4)(6). In such cases, deficiencies such as those relating, handover, planning treatment, pain management, communication were identified (35) -also managing stress (19). Such instances pointed to remedial action needed in undergraduate training.
Overall, studies have tended to be conducted in developed countries although there is emerging literature from African contexts (4) (30) (34). In many cases the study population consisted of the interns themselves (21)(22)(23)(26)(29) (30), and in other studies their supervisors as well (4)(9)(10)(13)(31)(33) (37). In most studies interns were approached at some stage in the course of their internship years. In some cases the study focused on final year students, investigating their prospective feelings of preparedness (17) (36).
This review of the literature revealed several challenges relating to evaluation of intern preparedness. Firstly, while it is common for studies to use intern self-evaluation with respect to their performance, it seems the validity of self-ratings was seldom questioned or taken into account in arriving at conclusions. Secondly, emerging evidence suggest that reliability can be enhanced through multi-source feedback, that is, through repeated observations and by a numbers of evaluators as an alternative to using single observations (40). Thirdly, studies have tended to treat ordinal data obtained from Likert type questions as interval data for the purposes of analysis. Finally, in studies where both interns and supervisors gave ratings, supervisors' ratings were higher than interns' in some cases (7)(31) (37) and lower in others (4) (13); this varied because in several cases interns and their supervisors observed different aspects of preparedness (9)(10) (33). For instance, a systematic review of physician self-assessment compared with observed measures of competence showed that out of the 20 included studies 13 showed little, no, or an inverse relationship, and only seven demonstrated positive associations (39). The tendency of some interns to over-rate, and others to under-rate themselves, can be possibly explained by the well-established Dunning-Kruger effect (41) (42).
In this article we describe research to determine the preparedness of the first cohort of graduates of a problem-based undergraduate medical programme in a middle-income country. We evaluated intern performance according to the interns' own perceptions together with those of their clinician supervisors, using a parallel instrument. To conceptualise intern competency we investigated specific competencies as well as how they indicate competency for the seven CanMEDS roles (43) which our Faculty of Medicine has recently used as part of a framework for evaluating the MBBS programme. We also explored the possible role of the Dunning-Kruger effect on self-evaluation data.

Context
The first group of medical graduates of the new Bachelor of Medicine, Bachelor of Surgery (MBBS) programme at the University of Botswana completed their internship at the end of 2015. The programme is outcomes-based and uses problem-based learning as its main learning strategy throughout its five year duration. Towards the end of the internship period we realised that this was an opportunity not to be missed -we needed urgently to carry out an evaluation of their performance as interns, and to attempt to link this performance to the sum of their experiences in their undergraduate programme. The study objective was therefore to determine the extent to which the undergraduate MBBS programme at the University of Botswana prepared graduates to function effectively as interns.

Study design and data collection
A survey study design was used, with information gathered from interns and their supervisors. The study population consisted of 35 interns and 20 physicians who had supervised them during internship. An anonymous questionnaire for interns with quantitative and qualitative elements was used, with permission from the University of Stellenbosch (34) which in turn had adapted it from an Australian study (8) to be more relevant to the local situation. The instrument was again modified slightly (two items added and three excluded) to suit our programme in Botswana. In addition a parallel questionnaire was prepared for supervisors (the source instrument was for interns only) since we believed it would add to the validity of the study: interns assess their individual experience (an 'internal' perspective), whereas their supervisors report on the interns they have worked with as a group (an 'external' perspective). The final instrument required both groups of respondents to grade preparedness for 44 routine internship tasks on a five level Likert type scale (Table 1) and to provide qualitative comments on tasks for which preparation was felt to have been good or insufficient. It was piloted with five interns and their supervisors, at a site where interns who had trained at other universities, and were therefore not eligible for inclusion in the research, were based. A few minor corrections to the instruments were made. It was administered in English since English is the language of secondary and tertiary education in Botswana and graduating students are fluent in the language, as are the doctors supervising the interns. Table 1 Rating scale for internship preparedness: interns and their supervisors

= not prepared
Intern I did not know how to do this/I did not feel prepared to do this, even with supervision Supervisor The interns appear not to know how to do this/do not seem prepared to do this, even supervision 2 = a little prepared Intern I was rather unsure of how to do this/I needed someone to guide me through the pro Supervisor The interns seem rather unsure of how to do this/need someone to guide them throu 3 = fairly well prepared Intern I was fairly sure of my ability/I was willing to try with some help Supervisor The interns seem fairly sure of their ability/are willing to try with some help 4 = well prepared Intern I felt that I knew how to do this/I could do this, but would have liked to have someon work Supervisor The interns seem to know how to do this/can do this, but still want someone to chec 5 = fully prepared Intern I knew how to do this really well/I felt able to do this well without any assistance Supervisor The interns know how to do this really well/are able to do this well without any assis Ethical approval for the research was obtained from the Institutional Review Boards of the University of Botswana and the Ministry of Health. As explained above we collected the data near the end of the internship year (with the evident danger of pollution of understanding by learning undergone during the year) but there was also a logic to this delay, since interns would only have completed all the internship rotations by that time and some of the questionnaire items were specific to particular internship disciplines. The interns and their supervisors were visited in their workplace in three geographical internship hubs in the country and requested to participate voluntarily in the survey. The response rate was 77% for interns and 65% for intern supervisors. Each respondent was given a code and the data from each questionnaire entered into Excel spreadsheets. The entered data were checked and cleaned. Statistical calculations were performed using Excel and the Social Science Statistics website (https://www.socscistatistics.com/).

Data analysis
The 44 tasks in the questionnaire were grouped according to the seven roles given in the well-known 2015 version of the CanMEDS framework (43). We introduced this second level of analysis because it would help us to evaluate the interns' preparedness not only for individual tasks but also for roles that are locally and internationally considered to be important. One of the authors who is familiar with CanMEDS was tasked with studying the CanMEDS document and allocating each task to the role which seemed to fit best. This was not always straightforward; for example the task 'Evaluate the impact of family factors on illness' could potentially be allocated to any of three CanMEDS roles, so the role in which the wording in the CanMEDS competencies most closely corresponded to that of the task was selected. After two iterations of this process the group of researchers approved the allocation. The result was as follows ( Table 2): The quantitative data from interns and supervisors were analysed as follows: 1. Frequency distributions for the ratings given by interns and supervisors for each task were determined, and summarised for each role. The percentages of ratings for each task and each role were calculated.
2. For both sets of respondents Cronbach α values were determined for the group of tasks in each role, to determine the internal consistency of the tasks in a role. One task 'Function effectively in a resource constrained environment' was excluded from the 'Leader' role since its inclusion resulted in an unacceptably low α value for the 'interns' group. The subsequent results are given in Table 3. There was now acceptable internal consistency in the way tasks were grouped in the roles. 3. To compare summarised ratings given by interns and supervisors for a role the two sets of five rating levels were compared using the χ 2 test -the ratings being ordinal not interval. This was done to establish the degree to which interns and their supervisors differ in their ratings of roles.
4. The direction of differences in rating frequencies can be problematical since sets of ratings (even given as percentages) do not always show the direction of differences clearly, nor do they make it possible to estimate the size of a difference. To gain an understanding of the direction and size of these differences the scores were treated as interval data (using the allocated values from '1' to '5'). This practice is also observed in other studies using Likert type scales (9)(12) (25). In the text the results of this operation are referred to as 'mean scores'.

Having completed
Step 1 of the analysis we noted varying patterns of differences between intern and supervisor ratings for tasks and roles. In an attempt to understand these differences better the correlations between the mean scores for tasks within each role were also determined, using the Spearman rank correlation test. This was used since the Shapiro-Wilk test showed that none of the 14 datasets was normally distributed, with negative skewness in each case.
We considered threats to validity inherent in the chosen instrument. The ideal would have been to observe interns in their daily practice, using for example workplace-based assessment tools such as MiniCEX and DOPS, and to follow this by discussions about the degree to which undergraduate training had contributed to the observed performance.
This method would however have been very time-consuming and the Hawthorne effect could be expected to operate (44). The validity of the survey tool had received attention in the two previous studies in which it had been used, but reducing the observation of complex performance to a set of numbers is still a very serious threat to validity. We attempted to enhance the validity by piloting to ensure comprehension and by extending its use to supervisors as well as interns (thereby triangulating perceptions).
The qualitative data collected have not been used to comment on quantitative findings, and this could be the subject of a separate paper in future.

Results
The results of intern and supervisor ratings are given as mean scores, rather than frequency distributions of ratings which would have been too cumbersome. These mean scores are given in Table 4. A summary of the ratings given by interns and their supervisors for each of the seven CanMEDS roles, and overall, is given in Table 5. The values are given in percentages for easier comparison. The significance of the differences was determined by conducting χ 2 comparisons of the N values of each rating category for each role. The ratings given by interns and supervisors for individual tasks and roles differed considerably. Overall and for four of the roles the differences in intern and supervisor ratings were significant at a level of p<0.05, using the χ 2 test for these ordinal data ( Table 5). As explained in the Methods section the direction of differences between intern and supervisor ratings are difficult to interpret, so they were analysed further by mean scores. In an attempt to explain possible reasons for the differences, the correlations between mean scores for all the tasks in each role were also calculated (Table 6). Overall and for six of the roles the intern mean scores were higher than those of their supervisors; however for the 'Professional' role the supervisor mean score was higher.
Varying patterns of the mean scores of interns and supervisors emerge: from large differences and significant correlations (the 'Medical expert' role), to large differences and poor correlation (the 'Scholar' role), to small differences and poor correlation (the 'Leader' role). Two of these examples are illustrated in Figure 1 and Figure 2 below.

Discussion
The principal objective of the research was to determine whether graduating interns and their supervisors believed that the new undergraduate medical programme at the University of Botswana had adequately prepared them to work as interns in the hospitals where they had been placed. The mean overall scores that both interns and their supervisors gave to the selected tasks were 3.61 and 3.40 respectively -in other words between 'fairly well prepared' and 'well prepared', and well short of the 'fully prepared' level. This was also true of each individual role ( Table 5).
As shown in the introduction studies of intern preparedness internationally are characterised by their variety of objectives, design and instruments, spatial orientation, study populations and findings. The present study shares characteristics of other studies of this nature but is identical to none of them. The current study investigated a new programme and not an established one (33). It investigated the outcome of one programme, rather than comparing two programmes (7)(8). The survey was done on completion of the internship year (5), rather than early on in it (9) or even before embarking on it (35). Data were gathered from interns and their supervisors (31)(33)(37), and not only from the interns themselves (22)(26) (30). The study investigated overall competence (3)(4), rather than competence in one specific area (23)(32). It focused on interns' ability to carry out relevant tasks grouped according to the CanMEDS roles; other studies investigated competence in relation to national guidelines (13) (15). In this study interns rated themselves more highly than their supervisors did (12), rather than the other way around (7)(31). In common with the findings of other studies it reveals broad overall competence in tasks and roles, but with specific shortcomings (34): in this case surgical procedures, managing labour, communicating with patients with terminal illness, managing uncooperative patients, serving in leadership roles, and selecting drugs in a cost-effective way. A further special characteristic of the present study is that it was conducted in a middle-income Southern country.
The interpretation of the findings of the study raises important questions, in relation to the congruence of intern and supervisor ratings in studies of this kind. The supervisors rated interns' ability to carry out tasks lower than the interns themselves did for Roles 1 to 6. For Roles 1, 2 and 3 there is good internal consistency, a significant difference in rating and significant correlation between mean scores. In such cases it appears likely that interns and supervisors had the same conception of tasks (and the roles overall), but the supervisors were more strict in their evaluation (see Figure 1). The research done by Kruger and Dunning (41) indicates that persons who are less skilled tend to rate their performance in tests more highly than they should -their actual performance belies their judgment of it. The most likely explanation for this phenomenon is that such persons not only lack knowledge and skill, but they also lack the metacognition required to judge their performance as inadequate -the very knowledge and skill that they lack they also need to judge their performance. Relatively inexperienced interns would tend to lack the metacognitive ability they need to evaluate their own performance. The systematic review The situation in Roles 4, 5 and 6 on the other hand is different. There is good internal consistency between ratings for both interns and their supervisors, but the combined ratings of interns and supervisors are not significantly different neither is the correlation between their mean scores. The wording in the intern and supervisor questionnaires was practically identical and both were pre-tested, so the most likely explanation seems to be that within each group members have similar conceptions of the tasks (cf. the good Cronbach α values) which however differ from the conceptions of the other group -hence the poor correlation (see Figure 2). This seems to be an almost inevitable weakness of studies of this nature, with few observers with different backgrounds making observations.
In their study of the reliability of multi-source feedback Moonen-van Loon et al. have shown that to achieve high reliability several occasions for observation by a relatively large number of assessors is needed, and that non-physicians' scores for the 'Scholar' and 'Health advocate' CanMEDS roles and physicians' scores for the 'Health advocate' role had a negative effect on composite reliability (40) -the latter a seeming parallel with the findings of this study. But even for these three roles the Dunning-Kruger effect may be operating overall.
Williams, Dunning and Kruger (42) have also demonstrated that there is a curvilinear relationship between objective performance and self-evaluations of ability. The difference between actual and self-perceived ability is greatest in highly unskilled persons, and this difference gets less the more skilled or knowledgeable the person is. Medical graduates are unlikely to be 'highly unskilled' or 'highly lacking in understanding' so the difference between their estimation of performance and their actual performance (as judged by their more skilled and experienced supervisors) should not be too great. This is in fact what our study has found: for Roles 1 to 6 the difference in mean scores is a mere 7.6% overall.
The 'Professional' role seems to be a special case. In five of the seven tasks interns rate themselves lower than their supervisors do. We propose two possible explanations. Interns necessarily have an 'internal', individual, personal view of their competence, whereas supervisors have an 'external' view of the competence of a group of interns. For tasks like 'Deal with own emotions when a patient dies', 'Cope with stress caused by work' and 'Balance own work and personal life' interns know that they should appear to cope. They may be rated highly by their externally observing supervisors, while being aware of their own struggles and feelings of inadequacy and therefore rating themselves lower. Another possible explanation is Kruger and Dunning's finding that very highly skilled persons tend to be less confident about their relative performance than they should be (41).

Study limitations
This study has several limitations. Although the response rates were fair the number of interns and supervisors is relatively small. The data collection instruments that were used were originally designed for different setting, and the data collected with them were not

Conclusions
The study objective was achieved -we now know that our new MBBS programme prepared interns reasonably well. We also know of specific deficits in their performance which need to be corrected. The study has helped us to gain more insight into a process of data analysis which was initially carried out rather mechanistically, copying what was used in similar studies. During data analysis we were brought to question the pattern of differences between intern and supervisor ratings which sometimes seemed contradictory.
We attempted to explain these by referring to the Dunning-Kruger effect, and by considering the different ways in which interns and their supervisors may have experienced and conceptualised tasks: interns evaluating themselves personally and internally, and supervisors evaluating a collective of interns externally. Remedial activity in the MBBS programme is currently underway; to evaluate its effect we need to repeat this study with a new group of interns after a suitable delay. Additional qualitative data in such a study should help to determine whether our explanations about the nature of differences in the data hold water, and may therefore be important in other studies -the fact that respondents may understand the tasks they rate differently has implications for all research of this nature. Data obtained from participants were recorded anonymously, without any identifiers that could link the information to a particular participant.

Availability of data and material
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
The research was funded by the Faculty of Medicine, University of Botswana. Other than suggesting the need for the research the Faculty had no further input into the study. Comparing intern and supervisor scores for the 'Medical expert' role Figure 1 demonstrates good correlation between intern and supervisor scores, whereas the opposite is shown in Figure 2. Data recording master2.xlsx