Examiner Quality and Consistency across LanguageCert Writing Tests

This paper reports on a study of the training and standardisation of examiners who mark LanguageCert’s International ESOL (IESOL) suite of English language tests linked to the Common European Framework of Reference (CEFR). Subjects in the study were a set of examiners (N=27) who had been marking LanguageCert’s IESOL Writing tests across the six CEFR levels. The focus of the study was on the consistency of marking in terms of severity within and across the six tests that the examiners mark. Correlations between examiner person measures across all six tests indicated that examiners were broadly consistent across tests, with examiner person measures generally correlating highly with their ‘partner’ test: A1 with A2, C1 with C2, and B1 with B2 tests. LanguageCert examiners – who undergo careful training and standardisation – may therefore be seen to mark consistently and accurately across a range of ability levels.


Introduction
One of the maxims of assessment is that tests be valid and provide accurate assessments of candidates' abilities: in particular in the context of how far a given test score may be interpreted as an indicator of the abilities or constructs to be measured (Bachman & Palmer, 1996;Messick, 1989). Under such a precondition, the marking of candidates' assessment therefore needs to be accurate if reliable assessments are to emerge. However, such accurate marking in performance assessment involving examiner judgment is an enduring challenge because scores assigned to candidate performance are mediated, interpreted and applied by examiners who are a potential source of error (Engelhard, 2002). From this, it naturally follows that examiners need to be properly trained and standardised -in particular with performance subjectively-marked tests such as Speaking and Writing.
This paper reports on a study of the training and standardisation of examiners who mark LanguageCert's International ESOL (IESOL) suite of English language tests linked to the Common European Framework of Reference (CEFR). Subjects in the study were a set of examiners (N=27) who had been marking LanguageCert's IESOL Writing tests across the six CEFR levels.
The focus of the study was on the consistency of marking in terms of severity within and across the six tests that the examiners mark.

Background to Tests, Examiners and Scripts
The data in the study were drawn from six examinations which comprise LanguageCert's International ESOL suite of English language tests. In the LanguageCert Writing tests, candidates complete two writing tasks which elicit a range of writing skills. Responses are marked using an analytic mark scheme which reflects the CEFR descriptors. Separate marks are awarded by marking examiners for different aspects of writing ability -Task fulfilment, Accuracy and Range of Grammar, Accuracy and Range of Vocabulary and Organisation of the text. This set of criteria ensures that a wide range of writing skills are considered, thus enhancing the reliability and representativeness of test scores.
The format of the tests and the nature of the assessment criteria reflect the broad multi-faceted construct underlying these examinations. Communicative ability is the primary concern, while accuracy and range are increasingly important as the CEFR level of the test increases.

Examiner Training
The importance of examiner training in any English language examination is an issue which has long been accepted as an essential factor in determining the reliability of a test (see e.g., Webb et al., 1990). Although empirical studies on examiner training have generated mixed results, a general consensus is that examiner training, if well designed, can improve the reliability and validity of examiner-mediated assessment (Kang at al., 2019). Studies have shown trained examiners to be more reliable (Saito, 2008) as well as more self-consistent (Davis, 2016) than untrained examiners.
In the case of performance-based assessment, it is important to attempt to ensure reliability through extensive examiner training and standardisation, including even sanctioning inconsistent examiners (see Elder et al., 2007). Webb et al. (1990) discuss the problems associated with examiner stringency, leniency and inconsistency. They state that problems with examiner stringency and leniency can be handled by statistical adjustment. They make it clear nonetheless that examiner training is essential for other problems -specifically, examiner inconsistency. As Weigle (1998) notes, examiner training was more effective in enhancing intra-examiner reliability than inter-examiner reliability. Lumley & McNamara (1995), in discussing inconsistency in examiners report that training and standardisation are not only essential, but also that further moderation is required shortly before the administration of Writing or Speaking Tests because a time gap between the training and the assessment event reveals that inconsistencies reemerge.
In order to address the issue of consistency, severity and leniency amongst the group of LanguageCert examiners, Multi-Faceted Rasch Analysis (MFRA), via the computer program FACETS (Linacre, 2020) has been utilised. A brief outline of the Rasch measurement model and MFRA is given below.

The Rasch Model
The use of the Rasch model enables different facets (person ability and item difficulty in the current instance) to be modelled together. First, in the standard Rasch model, the aim is to obtain a unified and interval metric for measurement. The Rasch model converts ordinal raw data into interval measures which have a constant interval meaning and provide objective and linear measurement from ordered category responses (Wright, 1997).This is not unlike measuring length using a ruler, with the units of measurement in Rasch analysis (referred as the 'logit') evenly spaced along the ruler. Second, once a common metric is established for measuring different phenomena (candidates and test items being the most obvious), person ability estimates are independent from the items used, with item difficulty estimates being independent from the sample recruited because the estimates are calibrated against a common metric rather than against a single test situation (for person ability estimates) or a particular sample of candidates (for item difficulty estimates). Third, Rasch analysis prevails over Classical Test Analysis statistics by calibrating persons and items onto a single unidimensional latent trait scale (Bond, Yan & Heene, 2020).
Person measures and item difficulties are placed on an ordered trait continuum by which direct comparisons between person measures and item difficulties can be easily conducted. Consequently, results can be interpreted with a more general meaning. The use of MFRA adds flexibility to the measurement by allowing the incorporation of facets in addition to person ability and item difficulty. As the current study focuses on the examiner facet (leniency vs severity of marking) in IESOL Writing tests, the MFRA analysis includes three facets: candidates, rating scales, and examiners.

Principles and Procedures in Training Examiners
As stated earlier, in any examination of direct performance it is important to attend to the question of examiner reliability. Although there is no agreement regarding the most effective training and standardisation methods (Kogan et al., 2015), in assessments of performance which rely wholly on examiner applications of the criteria established for the assessment, reliability can be established through a process of:  agreement on the validity of assessment constructs  creation of detailed specifications  creation of valid, detailed and usable descriptors  provision of credible and regular examiner training and standardisation See also Feldman et al. (2012), where a cogent summary of different modes of examiner training is provided.
The purpose of standardising examiners is to ensure that strong measures of agreement occur whenever a number of examiners apply grade descriptors to a criterion-referenced assessment instrument. This is the case with the LanguageCert Writing tests. In criterion-referenced assessment, which depends on the application of examiners' judgements to the criteria described in the descriptors, it is important that two principles are adhered to:  Judgements by one examiner over time with a number of candidates need to be consistent.
 Different examiners judging an individual candidate should provide assessments that are in close agreement.
There are a number of well-established standard procedures that can be used to train and standardise language examiners (see e.g., Coniam & Falvey, 2018). These procedures were applied in the specific training procedures used with trainee examiners for the IESOL Writing Tests and are described below.

Participants
All writing examiners must meet minimum requirements in terms of professional qualifications and experience in order to be eligible for consideration as an examiner. Prospective examiners go through a training process before they are approved and allowed to mark. The training process includes marking sample scripts. Candidates for the examiner role must show they can mark accurately and consistently before they are certificated as examiners. During live marking, where an examiner is found to be marking inaccurately and/or inconsistently, they may be removed from the marking session and/or retrained or dismissed as an examiner. Examiners are then monitored on an ongoing basis and required to attend standardisation meetings on a regular basis.
Participants involved 27 examiners who have been marking LanguageCert's IESOL suite of examinations for a considerable period of time. All 27 examiners marked the A1 and A2 scripts; however, only 24 examiners were available for the other four tests, i.e., B1 to C2.

Standardisation
Examiners were familiar with the rating scales, since they have been using them for five years. The standardisation session described in this paper took place in 2018 and is a regular feature of re-training and standardising undergone by LanguageCert assessment personnel. The process was led by the Chief Examiner, who has marked examinations linked to the CEFR for over 20 years.
Examiners were first given the rating scales and LanguageCert's Guide for Examiners and asked to familiarise themselves with the constructs and levels in the scales. Some brief discussion was then followed by two stages of training, Induction and Training, each consisting of the assessment of 36 benchmarked scripts -six per CEFR level -and subsequent discussion of queries, potential discrepancies between raters, the applicability of descriptors, etc. The sample scripts shared with examiners during the Induction and Training stages exemplified the four criteria along with the performance descriptors which constitute the marking scheme.
Over a period of a day and a half, examiners then marked, one test at a time, six scripts from each of the six tests in the LanguageCert IESOL suite (i.e., from A1 to C2). The marking began with the six A1 tests, progressing upwards. After each set of marking and after all examiners had submitted their awarded marks, the Chief Examiner revealed the scores he had awarded and led some discussion about the merits of different scripts.
LanguageCert training and standardisation procedures and practices may be seen therefore to equate with those employed primarily under a performance dimension training (PDT) -see Kogan et al. (2015) -as all three training stages (Induction, Training, Standardisation) are based on the assessment of a series of sample scripts (performances), selected and/or adapted to demonstrate certain issues in candidate performance. To account for potential discrepancies in marking as a result of raters' idiosyncratic tendencies (e.g. leniency), elements of a frame of reference training (FoRT) methodology were employed so that the role of subjectivity in the application of the marking criteria was minimised.

The IESOL Writing Test
The IESOL Writing tests comprise two tasks, as laid out in Figure 1. Concerning marking, all tasks conform to CEFR 'can do' statements for writing and are assessed on a four-point scale on four domains. Figure 2 illustrates.

Figure 2. Rating scale domains
Task Fulfilment Accuracy and range of grammar Accuracy and range of vocabulary Organisation

Method
The key research question for this study is whether examiner severity will be comparable within each test and across tests at the six CEFR levels; i.e., whether examiners will apply the marking descriptors accurately and be consistently lenient / severe on tests within a level and across levels. Two indicators of examiner severity and consistency were examined to address the research question.
The first indicator generated from the Rasch analysis is the person fit statistic. This statistic is not a direct indicator but a pre-requisite of examiner consistency. Examiner performance has to satisfy Rasch measurement requirements (i.e., the fit to the Rasch model) before any meaningful discussions on severity estimates may be made. The computer program FACETS (Linacre, 2020) provides a number of statistics which give an indication as to how well the data fits the model. One of these is the mean square statistic. For person fit statistics (examiners, in our case), acceptable practical limits of fit have been proposed as 0.5 for the lower limit and 1.5 for the upper limit (Lunz & Stahl, 1990).
The second indicator relates to examiner invariance across tests. While MFRA provides a framework for obtaining fair measurements of examinee ability that can be statistically invariant over examiners, tasks, and other aspects of performance assessment procedures, this only applies across one test. In the current study, examiner invariance across the six tests is examined via the Spearman's rho, which reports rank order correlations between tests. A high correlation indicates consistency of rank order of examiner severity estimates.

Examiner Fit to the Rasch Model
As the cornerstone of good rating is fit to the Rasch model, results are first presented below for the examiners on each of the six tests. Tables 2a and 2b present the results for the 27 examiners who participated in the standardisation exercise. As mentioned, 24 examiners marked all six tests, with the whole cohort of 27 examiners marking tests A1 and A2. In the tables, Infit is reported. Infit shows the 'big picture' in that it scrutinises the internal structure of a facet (examiners, in this case). Generally speaking, high infit (above 1.5) values would suggest an examiner's ratings were rather 'scattered', providing a confused picture about the placement of the examiner's ratings. Very small (below 0.5) infit values indicate only very small variation in the data, thereby providing little information to articulate clear and meaningful judgments about the examiner -and their ratings.
Infit figures above 1.5 are highlighted in yellow, while Infit figures below 0.5 are highlighted in green. In the data and discussion below, all examiner names have been anonymised. As can be seen from the data in the above table, examiner fit to the model was generally good; there were only one or two examiners who showed underfit (i.e., with a mean square of over 1.5) on each test.

Examiner Consistency across Tests
Having established that examiners broadly fit the model acceptably, the next step involves examining examiner consistency across tests. Table 3 presents the results of rank order correlations (via Spearman's rho) conducted against examiner person measures across the 6 tests. Correlations significant at the 0.01 level are highlighted in yellow, and those significant at the 0.05 level in green. . While the A2 test appears to correlate with almost all tests, all tests correlate quite highly with at least two or more different tests. The implication of these correlations is that the rank order of the examiners is broadly consistent across tests: if an examiner is going to be strict on one test, it is quite likely that they will be strict on other tests.

Conclusion
This study has examined the issue of examiner severity and invariance across LanguageCert's six CEFR-linked IESOL Writing tests. The research question was whether examiner severity would be comparable within each test and across the six tests; i.e., examiners would be consistently severe on each test. If examiners are seen to be erratic in their severity at some levels but not at others, this may impact on fairness in terms of grades awarded to candidates.
An examination of 27 examiners standardised to mark LanguageCert's six CEFRlinked IESOL Writing tests, illustrated that examiner fit to the Rasch model was generally good -a key background consideration.
From correlations run among the examiner person measures across all six tests, a rank order emerged indicating that examiners were broadly consistent across tests. Examiner person measures generally correlated highly with their 'partner' test: A1 with A2, C1 with C2, and B1 with B2 tests. While the A2 test correlated with almost all tests, all tests correlated quite highly with at least two or more different tests.
A major implication which arises regarding consistency is the following: if an examiner is going to be strict at one level, they will quite likely be strict at other levels -and strictness can be compensated for. Given that LanguageCert examiners undergo careful training and standardisation, what the current study illustrates is that LanguageCert examiners may be seen to mark consistently and accurately across a range of ability levels.