Reliability and acceptability of six station multiple mini-interviews: past-behavioural versus situational questions in postgraduate medical admission

The multiple mini-interview (MMI) is increasingly used for postgraduate medical admissions and in undergraduate settings. MMIs use mostly Situational Questions (SQs) rather than Past-Behavioural Questions (PBQs). A previous study of MMIs in this setting, where PBQs and SQs were asked in the same order, reported that the reliability of PBQs was non-inferior to SQs and that SQs were more acceptable to candidates. The order in which the questions are asked may affect reliability and acceptability of an MMI. This study investigated the reliability of an MMI using both PBQs and SQs, minimising question order bias. Acceptability of PBQs and SQs was also assessed. Forty candidates applying for a postgraduate medical admission for 2016–2017 were included; 24 examiners were used. The MMI consisted of six stations with one examiner per station; a PBQ and a SQ were asked at every station, and the order of questions was alternated between stations. Reliability was analysed for scores obtained for PBQs or SQs separately, and for both questions. A post-MMI survey was used to assess the acceptability of PBQs and SQs. The generalisability (G) coefficients for PBQs only, SQs only, and both questions were 0.87, 0.96, and 0.80, respectively. Decision studies suggested that a four-station MMI would also be sufficiently reliable (G-coefficients 0.82 and 0.94 for PBQs and SQs, respectively). In total, 83% of participants were satisfied with the MMI. In terms of face validity, PBQs were more acceptable than SQs for candidates (p = 0.01), but equally acceptable for examiners (88% vs. 83% positive responses for PBQs vs. SQs; p = 0.377). Candidates preferred PBQs to SQs when asked to choose one, though this difference was not significant (p = 0.081); examiners showed a clear preference for PBQs (p = 0.007). Reliability and acceptability of six-station MMI were good among 40 postgraduate candidates; modelling suggested that four stations would also be reliable. SQs were more reliable than PBQs. Candidates found PBQs more acceptable than SQs and examiners preferred PBQs when they had to choose between the two. Our findings suggest that it is better to ask both PBQs and SQs during an MMI to maximise acceptability.


Background
The single-station personal interview (SSPI) is widely used for medical and non-medical admission interviews. However, the SSPI has two significant problems: context specificity [1,2] and interviewer bias (i.e., the halo, or 'similar-to-me' effect) [2]. The multiple mini-interview (MMI), first used in 2004, is an interview method designed to overcome these problems [2].
SSPIs and MMIs either utilise situational questions (SQs) or past-behavioural questions (PBQs). SQs ask candidates what they would do in a certain hypothetical situation, whereas PBQs ask about the candidate's actual experience. Until recently, it was common to ask SQs rather than PBQs in MMIs [16,17], although both PBQs and SQs have been widely used in SSPIs [18]. Studies of non-medical admissions have demonstrated that reliability and acceptability are similar for PBQs and SQs in SSPIs, though PBQs have a higher predictive validity for high-complexity jobs, compared with SQs [16,18]. One study [17] reported that the reliability of PBQs was noninferior to SQs for an MMI-format postgraduate medical admission interview and that an MMI with five stations and two examiners per station was sufficient to ensure reliability when a structured approach was used. However, the study generated several additional questions about MMIs that need further investigation. Candidates were asked two questions per station: a PBQ and an SQ, always in that order. The answer to the first question may have affected the answer to the second, as they were asked at the same station; the reliability of SQs and the acceptability of PBQs and SQs may therefore have been affected by the fixed order of questions. Candidates in the study considered SQs more acceptable and easier to answer than PBQs, which may have been because they had adapted to the interview and were feeling more comfortable when answering the second question (an SQ). An investigation of the reliability of MMI using both types of questions, in different orders, would be of value.
This study aimed to investigate the reliability of PBQs, SQs, and both question types together using a sixstation MMI with one examiner per station and an alternating question order at each station to minimise question order bias. It also aimed to assess the acceptability of PBQs and SQs among candidates and examiners.

Settings and participants
After completing medical school, graduates in Japan obtain their medical licence by passing a national board examination. This is followed by the completion of the two-year National Obligatory Initial Postgraduate Clinical Training Programme (NOIPCTP) [17,19], after which physicians hold unlimited licenses and must obtain specialty training to become board-certified specialists. This study was conducted among individuals applying for specialty training in internal medicine, surgery, and emergency medicine. The selection was held on two days in September and October 2015 and two days in September 2016 at Tokyo Bay Urayasu Ichikawa Medical Center (TBUIMC), a midsize community hospital in Chiba, Japan, which has used MMIs since 2013 [17]. There were 24 examiners (23 men and one woman) involved over the 4 days, all of whom were licensed attending physicians in internal medicine, surgery, or emergency medicine at TBUIMC. All candidates, regardless of the specialty for which they were applying or their post-graduate year level, were examined by all examiners in attendance on each day. Examiners were randomly allocated to stations and stayed at the same station throughout the process.

Intervention
This study used six stations, each with one examiner assigned. There were two reasons for the reduction in the number of stations from the usual ten to six. First, a previous study in this setting demonstrated that an MMI with six stations and one examiner per station could ensure good reliability [17]. Second was the issue of cost. In Japan, especially in small to midsize community hospitals, attending physicians as examiners are a very limited resource. Numbers of examiners were therefore reduced as much as possible while maintaining reliability.
In 1999, the Accreditation Council for Graduate Medical Education introduced six domains of clinical competency for physicians: medical knowledge; patient care and procedural skills (PCPS); system-based practice (SBP); interpersonal and communication skills (ICS); practice-based learning and improvement (PBLI); and professionalism (Pro) [20,21]. Each domain included two to eight sub-domains [20]. Each station was set up to examine one of the domains of competence, with one station for each of PCPS, PBLI, ICS, and SBP, and two stations for Pro. The domain of medical knowledge was excluded because it was not considered appropriate for assessment through MMI. Two stations were set up for Pro because the TBUIMC training programme committee regarded it the most important of the six domains. Each domain was randomly allocated two of its associated sub-domains (one per question) for each station ( Table 1). All of the PBQs and SQs were constructed based on questions previously used in MMIs at TBUIMC, some of which have been previously reported [17].
One PBQ and one SQ were asked to every candidate at every station. The six stations were divided into two groups of three stations each: in the first group, the PBQ was asked first; and in the second group, the SQ was asked first. Candidates were assessed at group one and group two stations in alternate order to minimise question order bias. Each station was allotted 10 min, with 5 min allowed for each question and a 1-min break between stations.
Before asking a PBQ, the candidate was informed that the question was about their experience during their junior residency; the Situation-Task-Action-Result (STAR) approach was applied to guide the answers [17,22]. Before asking an SQ, the examiner explained that the question was about what would happen if they were to work as a senior resident at TBUIMC; a hypothetical scenario was described: candidates were presented with an ethical dilemma and asked what they would do, selecting one of two or more mutually exclusive possible courses of action [17,18]. This was followed by structured probing by the examiner [16,17].
All candidates were fully informed about the logistics of the MMI by email in advance and orally on the day of the MMI; all agreed for the results to be published. No information about which competency sub-domains would be assessed was provided to the candidates. Sixteen (67%) of the 24 examiners had previous experience in MMIs at TBUIMC and had therefore undergone training in the previous year. The remaining eight (33%) first-time examiners were trained prior to beginning the MMI using a method previously described [17]. Changes made to earlier methods were detailed. Examiners were given general instructions to keep the interview questions on track and to minimise close rapport-building with the candidates during the examination.
To assess candidates, examiners used rating rubrics that have been used for interviews at TBUIMC since 2013 [17] (Additional file 1). These included evaluation of three areas: 'communication skills' , 'strength and certainty of the answer' , and 'suitability for the programme'. A five-point scale, each point defined with a descriptor, was used to score each area. These three rubrics were used per question. On the day of the MMI, a group of candidates rotated through all six stations.

Post-MMI survey (Table 2)
At the end of the MMI process, all candidates and examiners answered a brief, anonymous survey, which was based on post-MMI surveys used at TBUIMC since 2013 [17]. In general, overall acceptability of MMI is evaluated by integration of face validity, candidate (or examiner) reaction, fairness, and feasibility. Therefore, to assess face validity, participants were asked about general satisfaction with the MMI method ( Table 2: 1C, 1E), candidates' satisfaction with the abilities assessed, and examiners' opinions about the accuracy of assessing these abilities based on PBQ and SQ formats ( Table 2: 2C, 2E); to assess candidate or examiner reaction, they were asked about the adequacy of time and ease in answering or asking questions in both formats (Table 2: 3C, 3E, 4C, 4E); and to assess general fairness, comparisons were made with SSPIs and questions asked about the acceptability of workloads (Table 2: 5C, 5E, 6C, 6E). Competency number in the Accreditation Council for Graduate Medical Education (ACGME) common programme requirements [21] b Sub-domain number within the competency in the ACGME common programme requirements [21] PCPS patient care and procedural skills, PBLI practice-based learning and improvement, ICS interpersonal and communication skills, Pro professionalism, SBP system-based practice, PBQ past behaviour question, SQ situational question Please write the reason in the space provided for free comments.
(C): Questions for candidates (E): Questions for examiners PR Positive response includes "mostly agree" and "agree" NR Negative response includes "mostly disagree" and "disagree" MMI multiple mini-interview PBQ past behavioural question SQ situational question SSPI single station personal interview All responses were recorded using a four-point Likert scale (disagree [1], mostly disagree [2], mostly agree [3], or agree [4]). Participants were also asked two additional questions: which they preferred, inclusion in the interview of both question formats, or only one; and, if they had to select only one type, which of PBQs or SQs would they choose. Space was provided for comments about these two questions. Participants were informed that individual survey answers would be kept confidential, used for research purposes, and not affect selection decisions.

Data analysis
To determine reliability, the MMI scores were analysed using generalisability (G) theory.

Reliability
We calculated the G-coefficients used in G and D studies.
The estimated variance components of candidates' ability on PBQs, SQs, and both questions were 0.312-0.476 ( Table 3), suggesting that the candidates were not a standardised group, but had moderate differences. The estimated variance components of the stations were small, suggesting that the level of difficulty in each station was adequate. In the D study, the G-coefficients for PBQs alone, SQs alone, and both question formats were 0.87, 0.96, and 0.80, respectively, with six stations and one examiner (Table 4). These values were 0.82, 0.94, and 0.73, respectively, when this was reduced to four stations.

Acceptability
All 64 participants (n = 40 candidates and n = 24 examiners) answered the post-MMI survey regarding acceptability. Overall, 53/64 (83%) participants were satisfied with the MMI in this study (

Discussion
We conducted an MMI with six stations and one examiner per station and found that the overall performance of this MMI format was reliable. In contrast to previous work in this setting, the reliability of SQs was superior to PBQs, which may be the result of minimising question order bias. As previously described, PBQs have been shown to have good reliability and validity in nonmedical admissions, particularly showing a higher predictive validity for high-complexity jobs when compared with SQs [16,18]. A Canadian study also reported that PBQs were more reliable than SQs in medical admissions [23]. We therefore tried to compare the reliability of PBQs with SQs in the setting of postgraduate medical admission in Japan because applicants are likely to have had more experience and more exposure to complex work than undergraduates. Our study showed that the reliability of SQs was better than PBQs. However, in general, G-coefficient scores of 0.80 or higher are considered to represent excellent reliability. Therefore, both PBQs and SQs were sufficiently reliable for junior residents under NOIPCTP in Japan. Reliability of both PBQs and SQs were better than in a previous study in this setting [17]. Other than minimising question order bias, the good reliability observed may be because twothirds of the examiners had previous experience in MMIs at TBUIMC and the remainder were trained in advance [17]. The examiners were therefore sufficiently similar and the assessments of each examiner had a certain amount of homogeneity. This study also showed that an MMI with four stations and one examiner per station using PBQs or SQs was sufficiently reliable, suggesting that MMIs can be conducted with fewer examiners and stations if context specificity, interviewer bias, and training of examiners are carefully accounted for. This finding may contribute to improvements in MMIs for postgraduate medical admissions. However, acceptability may decrease if MMIs use either PBQs only or SQs only, as over 80% of participants preferred to use both question formats rather than only one. Reliability of SQs was very high, but this may have been because SQs evaluated a narrower range of candidates' abilities, suggesting that the validity of an MMI using SQs alone may not be satisfactory. We plan to evaluate the validity of an SQ-only MMI method in the future. In addition, reliability was analysed in the context of two questions per station. Future studies should investigate the reliability of each type of question when asked alone at one station if we want to determine the reliability of PBQs only or SQs only with more accuracy.
Overall, over 80% of participants gave positive responses ('mostly agree' or 'agree') to most questions in the post-MMI survey; 83% of participants were satisfied with the MMI method used and over 93% were satisfied in terms of fairness and workload, suggesting that the overall acceptability of this MMI method was good. In particular, the acceptance of the workload by 96% of examiners suggests that this MMI method may be feasible for use in midsize community hospitals like TBUIMC. In contrast to previous findings among candidates at TBUIMC, PBQs were more acceptable and easier to answer than SQs in this study [17]. Minimising question order bias may provide a more accurate estimate of acceptability of the MMI. The majority of participants indicated that both questions were acceptable but examiners clearly preferred PBQs when they were asked to choose between them (p = 0.007). Based on the free comments in the surveys, some of the candidates and examiners who preferred PBQs to SQs felt that PBQs assessed candidates' actual experience and therefore seemed more reliable; those who preferred SQs to PBQs felt that SQs used a complicated scenario with an ethical dilemma and therefore seemed more suitable for evaluating a candidate's ability. Irrespective of these differences, 85% of candidates and 83% of examiners preferred to use both PBQs and SQs, instead of only one question format. The most frequently listed reason for this was that using both question formats provided more chances to express or evaluate abilities. With these findings in mind, we suggest it would be preferable to use both question formats to maximise acceptability of the MMI. However, reliability and acceptability are only two aspects of question format; validity is also important aspect and requires consideration and further investigation.
This study had limitations. First, it was conducted in one medical centre, which does not allow for generalisation to other medical programmes. Therefore, multicentre studies are needed to further investigate the reproducibility of these findings. Second, it is usual, in MMIs, for each ability to be assessed by a separate examiner at each station. In this study, a single examiner asked both a PBQ and a SQ at each station. This was potentially a major source of bias. However, we arranged for the conditions of the two types of question to be the same and therefore thought it would not be a problem when comparing PBQs with SQs.

Conclusions
This MMI method, with six stations, one examiner per station, and PBQ and SQ question formats that alternated in order at each station, showed good reliability and acceptability. SQs were more reliable than PBQs. Modelling suggested that an MMI with four stations and one examiner per station using either question format may be sufficiently reliable. Candidates found PBQs more acceptable than SQs and examiners preferred PBQs when they had to choose between the two. Our findings suggest that it is better to ask both PBQs and SQs during an MMI to maximise the acceptability of the assessment.