Using Self-Assessments to Investigate Comparability of the CEFR and CSE: An Exploratory Study Using the LanguageCert Test of English

This paper reports on an exploratory comparability study between the Common European Framework of Reference for Languages (CEFR) and the China Standards of English (CSE). Established equivalences are exhibited via the LanguageCert Test of English of reading and language use for the CEFR and a comparable test of reading and language use produced by a top-tier China university. In the study, a large sample of test takers took part, first sitting the two comparable tests of reading and language use, and subsequently completing a number of self-assessment Can-Do statements related to the CEFR and the CSE. Validity of the dataset was established by linking both tests and sets of self-assessments to a single frame of reference using a third test whose robustness and values had been previously established. While there were some divergences between how the two frameworks aligned – more notably towards the lower ends of the scales – correspondences which emerged between the CEFR and CSE frameworks were broadly in accordance with those reported in other studies referenced in the current paper. The current study therefore sets the groundwork for determining the correspondence between LanguageCert Tests, aligned to the CEFR, and the CSE.


Introduction
The current study is the first step in aligning LanguageCert's different tests -which are currently aligned to the CEFR -to other key frameworks or assessments, in this case the CSE.To frame the study, the following section presents detail on methods of establishing comparability between such assessment instruments as Can-Do self-assessments.Background to the CSE and CEFR is then presented, along with a description of studies which have investigated the correspondence between the frameworks.

Self-assessment of Language Abilities
Over the past two decades, self-assessment has been shown to be of value in assisting learners to evaluate their language ability (Bailey, 1998).The benefits of self-assessment (SA) have been explored in a number of studies and shown to make worthwhile contributions in both learning and assessment.In the context of learning, for example, Butler (2018) illustrated the value of SA in the self-regulated learning process, Babaii et al., (2016) showed how SA aided self-awareness in learning, Dann (2002) showed its value in promoting learner autonomy, and De Saint-Leger (2009) demonstrated how SA was associated with learner confidence and hence performance.
In the area of language assessment, SA has been shown to offer a range of potential benefits.Bachman & Palmer (1996) demonstrated how SA permitted learners to self-assess themselves in an interactive, yet low-anxiety, manner.Oscarson (1989) showed how SA could help expand the range of assessment, emphasising the fact that assessment should be the responsibility of both learners and teachers.Of relevance to the current study, Liu & Brantmeier (2019) reported a study of young learners in China who were able to quite accurately self-assess their abilities in reading and writing.As outlined below, Peng et al. (2021) explored the alignment of the CSE and the CEFR frameworks, in large part through the use of self-assessment descriptors.Jones (2014) presents a description and analysis of the large-scale use of 'Can-Do' self-assessment descriptors [Note 1] established in the 1990s to provide common levels of proficiency across European languages via the ALTE (Association of Language Testers in Europe) Framework.Jones concludes that, despite there being some variation across different educational systems in Europe, students of different languages were, on the whole, reasonably accurate in estimating their relative ability.The use of instruments such as Can-do statements in self-assessment has been validated in a number of other studies (see e.g., Brown et al., 2014;Summers et al., 2019).

The CSE and the CEFR
For the past two decades, the CEFR has been accepted as illustrating standards of language ability by many stakeholders: policy makers, publishers, exam bodies and test developers (Deygers et al., 2018).Not only in Europe, but in many countries around the world (Little, 2007), the CEFR has become the common currency for specifying levels of language ability (Figueras, 2012).The CSE reflects an overarching notion of language ability, with which language knowledge and strategies co-function in performing a language activity.Its development attempts to pull together all the different English language curriculums and assessment instruments into one overarching framework.
Figure 1 CEFR and CSE Levels Jin et al. (2017) describe the development of the "Common Chinese Framework of Reference for English (CCFR-E): Teaching, Learning, Assessment" which began in 2014.The CCFR-E was finalised in 2018, being released as the "China Standards of English" (CSE).The CSE has three major level stages, each subdivided into three sublevels.Figure 1 illustrates.Alderson (2017) discusses a range of studies exploring the CSE and its correspondence to the CEFR.This is supported by the discussion by Jin et al. (2017) and by research by Zhao et al. (2017), investigating the linking of College English vocabulary levels with the CEFR. Figure 2 presents a summary of the results of the different studies.Dunlea et al. (2019) describe a comprehensive study involving all four language skills that explored the relationship between the British Council's Aptis test and IELTS with the CSE.The methodology involved expert judgement of items against CSE and CEFR levels and the assignment of CSE descriptors against tasks.Following this, the proposed levels were field tested in an "external evaluation" exercise, where Chinese teachers rated their own students against the proposed matched levels.As Figure 2 below illustrates, CSE L2 appeared to correspond to CEFR A1, CSE L3 to A2, CSE L4 / L5 to CEFR B1, CSE L6 / L7 to CEFR B2, CSE L8 to CEFR C1 and CSE L9 to CEFR C2.

Previous CSE / CEFR Equivalence Studies
Peng and associates have undertaken a number of studies investigating correspondences between CEFR and CSE levels.Level A0, it should be noted, denotes a level below CEFR A1.Peng et al. (2021) report on a study attempting to establish level correspondences between CEFR and CSE levels using difficulty estimates of all published descriptors (467 for the CEFR and 1,051 for the CSE) of ratings by English language teachers and students.While there was close correspondence at the top and bottom ends of the scale, there was overlap in the middle levels.Peng et al. (2021) In another study, Peng (2021)  The different studies outlined in Figure 2 contribute to the level alignment between the CSE and the CEFR.As may be seen, while there is a degree of agreement in the correspondence between the two studies, there are also divergences which may result from a number of factors: the samples; the tests; the judges used in the ratings.

Current Study
This section briefly outlines the background and make-up of the tests and the self-assessment ratings which test takers completed.The methodology employed in the current study differs from that used in the Dunlea et al. (2019) andPeng et al. (2021) studies.The principal methodology in the latter two involved the use of expert ratings.In the current study, a large sample of test takers took a live LanguageCert test, which was then calibrated in a single frame of references with the self-assessment ratings.

Test material
In late 2020, approximately 2,500 Year 1 non-English major college students took a 65-item multiplechoice reading and language use test prepared by experts from the university involved in the current study.Three months later, this same set of students took a 53-item multiple-choice reading and language use test adapted from existing and previously validated LanguageCert Test of English (LTE) material (Coniam et al., 2021).The items in the LTE test used in the study were selected on the basis of representing the spectrum of difficulty across the six CEFR levels.1.This scale lays out item difficulty levels generally adopted in LanguageCert assessments (Coniam et al., 2021).For analysis and calibration purposes, 100 has been taken as the mid-point of the scale.To this end, Rasch logit values are rescaled to a mean of 100 and a standard deviation (SD) of 20 (see Coniam et al., 2021).
Appendix 1 provides a comparative analysis of the make-up of the two reading and usage tests.As may be seen, the CET test is slightly longer than the LTE test; also, all CET items are 4-option multiple-choice whereas the LTE items are 3-option multiple-choice.Despite these differences, the content of the two tests, and even the order in which the different sections of the test appeared to test takers, exhibit a great deal of similarity.

Can-do self-assessment descriptors
Both the CEFR and the CSE contain large arrays, for all skill areas, of Can-Do descriptors (see e.g., https://www.cultofpedagogy.com/can-do-ell/for examples of how such descriptors help classroom teachers understand what learners at different levels of proficiency should be able to do).
To reflect the focus of the current study, two sets of Can-Do self-assessment descriptors were assembled for reading and language use for each framework.A set of 22 Can-Do statements related to the CSE was compiled by the China university staff who designed the CET test used in the current study.Another set of 16 Can-Do statements related to the CEFR was compiled by members of the LanguageCert research and assessment team.All Can-Do statements were framed as Yes/No questions so that test takers rated themselves dichotomously (i.e., as can / cannot) on each statement.The relevant Can-do statements may be found in Appendices 2 and 3.
The composite set of 38 items were then intermingled.This was intended to forestall respondents trying to guess where their own estimated ability level might terminate.

Test and self-assessment profile administration
The first test (the CET) was administered in late 2020.In early 2021, the second test (the LTE) was administered.Immediately after the second administration, test takers completed both sets of Can-Do self-assessments.These were all presented bilingually in both English and Chinese.

Self-assessment can-do statements and research questions
Against the backdrop outlined above, the current study pursued two main Research Questions.
RQ1: To what extent can self-assessment Can-Do statements be validly used to establish correspondences between the CEFR and CSE frameworks?
RQ2: To what extent are correspondences between the CEFR and CSE frameworks in line with those reported in previous studies?

Statistical Analysis: Rasch Measurement
The manner for gauging test fitness-for-purpose in the current study, and for linking the data -the two different tests and self-assessments -involves the use of Rasch measurement, which will now be briefly outlined.
The use of the Rasch model enables different facets to be modelled together, converting raw data into measures which have a constant interval meaning (Wright, 1997).This is not unlike measuring length using a ruler, with the units of measurement in Rasch analysis (referred to as 'logits') evenly spaced along the ruler.In Rasch measurement, a test taker's score is not derived solely from the raw score.Rather, the test taker's theoretical probability of success in answering items is gauged, with the resulting probabilistic score emerging from the calculations.While such 'theoretical probabilities' are derived from the sample assessed, they are able to be interpreted independently from the sample due to the statistical modelling techniques used.Measurement results based on Rasch analysis may therefore be interpreted in a general way (like a ruler) for other test taker samples assessed using the same test.
Once a common metric is established for measuring different phenomena (test takers and test items in the current instance), test taker ability may be estimated independently of the items used, with item difficulty estimates also estimated independently from the sample (Bond et al., 2020).
In Rasch analysis, test taker measures and item difficulties are placed on an ordered trait continuum.Direct comparisons between test taker abilities and item difficulties, as mentioned, may then be considered, with results able to be interpreted with a more general meaning.One of these more general meanings involves the transferring of values from one test to another via anchor items.Anchor items are a number of items that are common to both tests; they are invaluable aids for comparing students on different tests.Once a test, or scale, has been calibrated (e.g., Coniam et al., 2021), the established values can be used to equate different test forms.
In interpreting Rasch, the key statistic involves the 'fit' of the data in terms of how well obtained values match expected values (Bond et al., 2020).A perfect fit of 1.0 indicates that obtained mean square values match expected values one hundred percent.Acceptable ranges of tolerance for fit range from 0.5 to 1.5 (Lunz & Stahl, 1990).

Data and frame of reference
To recap, there are four sets of assessment data in the current study: the 65-item CET test, the 53item LTE test, 22 CSE-referenced Can-Do ratings and 16 CEFR-referenced Can-Do ratings.Since all four datasets were collected from the same test takers, the data configuration may be taken as a unified collection, in that all data are referenced to the same candidates and to their English language ability.The person links (Boone, 2016) in the four datasets embrace a coherent frame of reference (FOR), defined by Humphry (2006) as "compris[ing] a class of persons responding to a class of items in a well-defined assessment context." In order to calibrate the four datasets in the current study onto the LanguageCert Item Difficulty (LID) scale (see Table 1), a previously calibrated test (henceforth referred to as "Test 3") from the Coniam et al. (2021) study was incorporated into the data.As a subset of Test 3, the LTE test in the current study provides a set of item links (Boone, 2016).With sets of both person links and item links established, the LTE test could then be linked to Test 3. Following this, the other datasets in the study -the CET test and the two sets of self-assessments -could then be calibrated against Test 3 onto the LID scale.This resulted in all five assessment datasets being included into one single FOR.

Analysing within a single frame of reference
As mentioned, Test 3 was the anchoring frame, having been previously anchored to the LID scale.Against this backdrop, the composite analysis is presented in Figure 3 below.
In Figure 3, Column 2 contains the analysis of the amalgamated five datasets of 158+PB4 items.Column 3 contains the 53-item LTE test, Column 4 the 65-item CET test, Column 5 the 22 CSEreferenced Can-Do ratings, and Column 6 the 16 CEFR-referenced Can-Do ratings.
To recap, item links in the overall dataset are established between the 53 items in the LTE test and Test 3. Person links are established via the two tests and the two sets of self-assessments.All five datasets may therefore be seen to be within an overall FOR -the composite analysis to the far left of the personitem map in Figure 3. Against the overall picture of calibration, which is centred at 100, the mid-point of B1, it may be seen that the means for the two tests are slightly higher than the overall mean.Tables 2 and  3 present fit and reliability details on the two tests.4 and 5 now present fit and reliability details for the two sets of self-assessments.4 and 5, it can be also be seen that the two sets of self-assessments fit the Rasch model; mean infit and outfit figures are within the 0.5 to 1.5 range, and reliability figures are again high.The means of both two sets of self-assessments are again comparable, although this time a quarter of a logit below the overall mean of 100 -both being around 95.This slightly lower score is indicative that, on the self-assessments, test takers have tended to slightly over-rate themselves -a not uncommon phenomenon (Kruger & Dunning, 1999;Dunning et al., 2004).
The difference between the item means of the Can-Do ratings, and the LTE and CET assessment results are within half a logit (10 LID scale points): a difference which is generally accepted within Rasch measurement as being non-significant (Zwick et al., 1999).The conclusion that may be drawn is that test takers can be considered sufficiently objective in their self-assessments to permit tentative correspondences to be drawn between CSE and CEFR levels.The next section explores the correspondences.

Establishing correspondences between CSE and CEFR levels
Given that the two sets of self-assessments have been established as valid and broadly comparable, the current section presents sets of tables -one at each CEFR level -which incorporate Can-Do statements within corresponding CEFR and CSE levels.Tables are presented one at a time for each CEFR level, in line with LID score ranges for the corresponding CEFR level.The tables are laid out such that the lefthand half of the table includes the detail for the CEFR level: the relevant Can-do statement, the LID value assigned in the current single FOR calibration, and the CEFR level for the Can-do, as laid down in formal documentation.The right-hand half of the table then includes corresponding detail for the CSE level: Can-do statements and their CSE level which fall into the LID value range for the CEFR level.
Table 6 presents the joint analysis for CEFR level C1, the LID range for which is 131-150 scale points.Within the C1 CEFR LID range of 131-150, four CEFR C1 self-assessment were found, along with one CSE Level 7 self-assessment.The fit would appear to be CEFR C1 → CSE L7.
Table 7 presents the joint analysis for CEFR level B2, the LID range for which is 111-130 scale points.Within the B2 CEFR LID range of 111-130, three CEFR C1 self-assessment were found, along with six CSE self-assessments, of which two were at L5, two at L6 and two at L7.The B2 CEFR / CSE fit would appear to be broader, i.e., CEFR B2 → CSE L5-L7.
Table 8 presents the joint analysis for CEFR level B1, the LID range for which is 91-110.Within the B1 CEFR LID range, one CEFR B1 self-assessment was found, along with four CSE selfassessments, of which one was at L4, one at L5, and two at L6.The B1 CEFR / CSE fit would therefore also appear to be quite broad, i.e., CEFR B1 → CSE L4-L6.Table 9 presents the joint analysis for CEFR level A2, the LID range for which is 71-90.Within the A2 CEFR LID range, three CEFR A2 self-assessment were found, along with seven CSE self-assessments, of which one was at L3, four at L4, and two at L5.The A2 CEFR / CSE fit would therefore appear to be mainly CEFR A2 → CSE L4-L5.Table 10 presents the joint analysis for CEFR level A1, the LID range for which is 51-70.Within the A1 CEFR LID range, four CEFR A1 self-assessments were found, along with three CSE self-assessments, of which one was at L2, and two at L3.The broad A1 CEFR / CSE fit would appear to be CEFR A1 → CSE L3.
Finally, below CEFR A1, there was one fit between the CEFR and CSE.Table 11 presents.
From the above set of tables with the comparative fit of the CEFR and CSE levels, it is now possible to produce an overall tentative mapping of how the CEFR scale, as represented by the LTE, may be mapped against the CSE.Table 12 presents the match.It should be noted that there was insufficient data to calibrate CEFR level C2.
As can be seen from Table 12, as might perhaps be expected, while there is not a one-to-one match between the levels in the two frameworks, as one moves up the scale, there is a graduated fit between the CEFR and the CSE. Figure 4 below, presents a reworking of Figure 2, which included the alignments proposed in the Dunlea et al. (2019) [henceforth the 'Donlea' study] and Peng and associates' (2021) studies [henceforth the 'Peng' studies], together with the alignments as they have emerged empirically in the current study.

All skills
All skills Writing Listening The results of the current study can be seen to echo the mappings of the previous studies, although the mappings which have emerged suggest a slightly more lenient fit than that reported in other studies (see below) -as for example with CEFR C1 being located against CSE L7 in the current study as against CSE L7 / L8 in the Peng studies and CSE L8 by Dunlea.This is mirrored at the lower end of the scale, where the current study does not suggest direct one-to-one matches.There are a number of possible reasons for these divergences.A key difference is that the current study empirically matched levels against performance, as opposed to an expert-rater-focused methodology.Another reason may be attributable to the fact that only one skill -essentially reading -has been explored in the current study, whereas the other studies examined all four skills.A third is that the sample was limited at the top end of the ability spectrum to C1-level test takers.

Conclusion
The current study was pursuing two Research Questions.The first research question was that selfassessment Can-Do statements may be validly used to establish correspondences between the CEFR and CSE frameworks.As was illustrated, from a comprehensive analysis of both test and Can-Do selfassessment responses, respondents tended to slightly over-estimate their abilities on both the CEFR and the CSE.These over-estimations were minimal, however, in that mean values were only a quarter of a logit higher than might have been expected.Secondly, the over-estimations were consistent with the scales for both frameworks.
The second research question was that correspondences between the CEFR and CSE frameworks would be broadly in accordance with those proposed by previous studies.While there have been some divergences, more notably towards the lower end of the scales, the correspondences proposed in the current study broadly echo those reported in previous studies.
A range of correspondences may well be expected from different studies, exploring different assessment instruments.Difficulties in accurate alignment have been commented on by other researchers ( Papageorgiou et al., 2015;North & Piccardo, 2018).Peng (2021) insightfully comments that "the CSE is a local standard with granular levels reflecting Chinese learners' requirements and progress [ …. ] while the CEFR is a framework for reference with broad bands of proficiency and is intended to be adapted or further developed for specific contexts and uses".In the current study, the assessment context has focused on reading and language use, whereas the Dunlea et al. (2019) and the Peng et al. ( 2021) studies examined all four language skills, as well as writing and listening, which Peng (2021) and Peng & Liu (2021) respectively explored.
From a wider, and methodological, perspective, the use in the current study of a single frame of reference to calibrate self-assessment ratings directly against performance adds to the armoury of tools available to assessment professionals in linking exercises such as those between two different tests, or by providing a larger perspective between two different assessment frameworks.
The approach adopted in the current study may be useful for other assessment situations, where Can-Do ratings may be incorporated at the end of an assessment session.This may even be done in a userfriendly manner where individual candidates rate subsets of Can-Do ratings, which are then linked via common items to cover a range of Can-Do aspects.
A limitation of the current study was that the investigation of test types was limited to reading and language use.Future studies will broaden this by extending the investigations conducted in the current study to other language skills.I can read with ease virtually all forms of the written language, including abstract, structurally or linguistically complex texts such as manuals, specialised articles and literary works. C2

Table 3
Summary analysis: 65-item CET test Tables

Table 4
Summary Analysis: 22 CSE Can-Do Statements

Table 5
Summary Analysis: 16 CEFR Can-Do Statements

Table 12
CEFR / CSE Fit in LTE Study

Appendix 1 Component Analysis of CET and LTE Tests
Notes1.The CEFR framework comprises descriptors laying out what a student can do as a particular skill when they have completed a given level.A descriptor for Reading at A2, for example, is: "I can understand short narratives and biographies written in simple words."

Do Statements used in the Study
texts of personal interest (e.g.articles about sports, music, travel, etc.) written with simple words.the main points of texts dealing with everyday topics (e.g.life, hobbies, sports) and obtain the information I need.texts dealing with topics of general interest, such as current affairs, without a dictionary, and can understand multiple points of view.