Impact of test format on vocabulary test performance of EFL learners: the role of gender

This study aimed to examine the effect of test format on test performance by comparing Multiple Choice (MC) and Constructed Response (CR) vocabulary tests in an EFL setting. Also, this paper investigated the function of gender in MC and CR vocabulary measures. To this end, five 20-item stem-equivalent vocabulary tests (CR, and 3-, 4-, 5-, and 6-option MC) were administered to 243 (132 male and 111 female) pre-intermediate students. Results of the study revealed that MC tests were easier than CR. Results indicated a gender bias, in that, males scored better than females in all versions of MC tests while females outperformed males in CR. The findings implied that testers should consider the effect of test format while assessing vocabulary knowledge and use a combination of test formats (MC and CR) in vocabulary assessment to reduce gender bias and format effect.


Introduction
Assessment is an integral part of any educational system, and it plays an important role in English as a foreign language (EFL) context. One of the components of language that is challenging for test constructors is vocabulary, a skill without which a learner cannot understand or communicate in a foreign language. Word knowledge or vocabulary repertoire is a fundamental component of language proficiency and also an essential component of communicative competence and acts as a vital element for production and comprehension in a second language (Coady & Huckin, 1997;Richards & Renandya, 2002).
Numerous researchers have highlighted the importance of vocabulary knowledge in their writings. Wilkins (1972), for example, asserted that "while without grammar very little can be conveyed; without vocabulary nothing can be conveyed" (pp. 111-112). His assertion highlighted the importance of vocabulary and placed it above grammar. According to Luppescu and Day (1993), it is necessary for students to build a large repertoire of vocabulary when learning a language because people with large vocabularies are more proficient and competent than those with limited vocabularies. In other words, students with a good knowledge of vocabulary can communicate in a foreign language much better than low proficient students. Baker, Simmons, and Kame'enui (1998) emphasized that learning a language is mainly dependent on vocabulary and word knowledge; in other words, a high repertoire of word knowledge is required to be competent in a foreign language. Huyen and Nga (2003) and Zhao (2009) also highlighted that vocabulary (and its mastery) plays an important role in learning a second language, and without mastery over vocabulary, none of the other skills (listening, reading, speaking, and writing) are attainable.
Vocabulary is accordingly an indispensable component of language and any language education program. Indeed most researchers (Read, 2000;Zhao, 2009) agree that its mastery plays an important role in the process of language learning. Consequently, the practice and manner of its measurement becomes vital. In other words, examining the vocabulary knowledge of students correctly and providing accurate and relevant information about the process of language learning and teaching are imperative. Vocabulary knowledge is considered as a psychological attribute or mental ability which cannot be measured or observed directly, and special techniques are needed to measure it. Psychological attributes or constructs are "hypothetical concepts -products of the informed scientific imagination of scientist who attempts to develop theories for explaining human behavior" (Crocker & Algina, 2006, p. 4). Constructs are not visible, and indirect methods are used to measure them; the process of observing constructs through indirect methods is called operationalization. In other words, it is a way of moving from the abstract level to empirical level (Lewis-Beck, Bryman, & Liao, 2004).
A test is considered as an apparent example of the indirect method mentioned above, and it refers to the processes and procedures used by the tester to obtain information about the optimal performance of stakeholders or typical performance of individuals (Crocker & Algina, 2006). Different kinds of tests, from multiple-choice to gap filling, have been used to assess vocabulary knowledge in different levels of proficiency; however, the current study will mainly focus on selected response or Multiple Choice (MC) format and Constructed Response (CR) format.

Literature review
As mentioned above, the present study aimed to investigate the impact of format of the test, MC and CR, on the performance of students in vocabulary tests in an Iranian context. According to Bachman (1990), test format or test method should not interfere with the construct being measured. We believe this issue is of utmost importance for test developers since a slight change in the performance of test takers due to format of test will color the results of test in either a negative or a positive way, and the results will not be the real measures of students' abilities anymore. So, the statement "one test fits all" may not be true in language testing.
In this regard, reviewing some studies conducted in this field would be valuable. The literature on MC and CR tests from a quantitative perspective has mainly focused on two issues: "(a) differences in construct or trait measured using multiple-choice and open-ended formats and (b) differences between test scores in multiple-choice and open-ended formats (i.e., the relative difficulty of test formats)" (In'nami & Koizumi, 2009, p. 221). Also, according to Nixon and Kennedy (2002), there are three major streams in comparisons of MC and CR test formats. The first and most prominent stream addresses and investigates whether MC and CR tests measure the same construct or not. The second stream deals with "how to scale or link MC and CR scores so as to create a single total score from one type of exam with those of another" (p. 959). The last stream, on which this study focuses, addresses whether tests in one format are more difficult than their equivalents in other formats. Nixon and Kennedy (2002) compared the scores of students in stem-equivalent MC and CR tests of economics and showed that students do indeed score much better in MC. Gender in this study did not have an effect on the performance of students. Hastedt and Sibberns (2005) compared MC and CR test formats in Trends in International Mathematics and Science Study (TIMSS, a series of assessments of mathematics and science knowledge of students around the world) for 1995 and 1999. They observed only small differences between MC and CR scores, and based on such an observation, researchers suggested that "using MC and CR items in international studies, because it guarantees that test takers are treated equally and fairly" (Hastedt & Sibberns, 2005, p. 159). Gender analysis suggested that females performed better in CR while males outperformed in MC item format. Famularo (2007) revealed that there were significant differences between MC and CR items and MC tests were found to be much easier than their CR counterparts.
Liu, Lee, and Linn (2011) explored the function of explanation multiple choice (EMC) and showed that EMC and MC items were easier than CR, but EMC items were harder than MC items. Hickson, Reed, and Sander (2012) used a data set composed of thousands of observations on individual students in economics classes at a public university. They found that instructors paid too much attention to writing CR questions that assess higher level learning, but actually all these efforts were in vain, since little difference was found between MC and CR scores.
Shaibah and van der Vleuten (2013) compared the scores of students in MC and Free Response Format (FRF) (a version of CR) in a gross anatomy course. A Rasch model was utilized, and analysis revealed a strong correlation between MC and FRF scores. Shaibah and van der Vleuten suggested that MC test is a valid method which can be used as an alternative to FRF items. Moreover, their results showed that students scored better in recall MC tests than FRF understanding items. Sangwin and Jones (2017) compared the performance of students in MC and CR tests in an online test and found that the overall score of students were higher in MC compared with CR. Gamer and Engelhard (1999) examined gender differences in the performance of students on MC and CR items. Researchers found that there were gender differences in doing MC and CR items. They explained that females performed relatively better than males in CR format while males outperformed females in MC item format. Weaver and Raptis (2001) investigated the performance of male and female students in nine introductory atmospheric and oceanic science exams over 7 years. The analysis of performance of 295 male and 194 female students who participated in the study showed that there were no significant differences between the performance of male and female students. In a study conducted by Bacon (2003), gender differences in MC and short answer tests were studied. Results of t test suggested that there were no significant differences between the performance of male and female students. Taylor and Lee (2012) compared the performance of male and female students in reading and math test items selected from state criterion-referenced tests with MC and CR items. Results of the study revealed that in both reading and math tests females did better in CR while males performed better in MC. Reardon, Kalogrides, Fahle, Podolsky, and Zárate (2018) investigated the association between test item format and gender achievement gaps on math and English language arts tests in fourth and eighth grades. They found that MC and CR tests measuring the same underlying constructs may rank the performance of males and females differently. In other words, gender gaps were sensitive to item format, in that males did better on MC tests and females performed better in CR.
Although there are numerous studies addressing the effect of test format on performance of test takers in different disciplines such as mathematics, psychology, and economics (e.g., Birenbaum & Tatsuoka, 1987;Hastedt & Sibberns, 2005;Simkin & Kuechler, 2005), few studies have been conducted in applied linguistics (AL) on the impact of test format on the performance of test takers in language tests in general and vocabulary tests in particular. For example, In'nami and Koizumi (2009) conducted a meta-analysis of the effect of test format on L1 and L2 reading, and L2 listening performance. Results of the study suggested that MC test was easier than CR in L1 reading and L2 listening. Although multiple-choice formats were found to be easier than openended formats, format effect in L2 reading was not observed.
Currie and Chiramanee (2010) compared the effectiveness of MC against CR items in the context of English language education in Thailand. Results of the study suggested that students achieved significantly higher scores in MC test than their stem-equivalent CR based on which Currie and Chiramanee mentioned that MC and CR tests were not measuring the same constructs. In other words, "a more realistic implication to draw from these results would be that the M/C format had the effect of distorting the measurement of language based abilities which were used by the participants in answering C/R items" (p. 485). The researchers, however, do not provide further explanation or any justification on why MC items had a distorting effect on measurement but CR items did not.
Performance of test takers on language tests is affected by different sources of variance which Bachman (1990) grouped into four broad categories: Communicative language ability (CLA), test method facets (TMF), personal attributes (PA), and random factors (RF). While the ultimate aim of language testing is to measure CLA, the outcome is often compromised by the other three (TMS, PA, and RF). Test makers have little or no control on random factors; however, they can exercise some control over test method facets and test-taker attributes. Test method facet or test format refers to the characteristics of tests or test tasks which are used to elicit information about test takers' knowledge about a matter (Bachman, 1990). In other words, Bachman believed that the performance of test takers on language tests was mainly the product of both an individual's language ability and the facets of test method employed (Weir, Vidaković, & Galaczi, 2013). This study draws on Guttman's (1980) facet theory and Bachman's (1990) test methods framework and aims to investigate the "type of response" facet (i.e., CR vs. MC), which is a characteristic of the "format" of the "expected response".
This study mainly investigates the effect of test method facet (test format) and personal characteristics (gender) on vocabulary test performance. For this purpose, we decided to study the most important state wide high-stakes examination in Iran. This Iranian national university entrance exam is called "Konkur". Students in Iran sit for this MC exam (including a vocabulary test) and based on the results of the test, students are admitted to public universities. This test is a 4.5-h MC exam that covers all subjects taught in Iranian high schools, from math and science to Islamic studies and foreign languages. The exam is so high-stakes that students normally spend a whole year preparing for it. For the purpose of this study, we selected the English language section of this national test which consists of MC vocabulary and grammar items, a cloze test and a reading passage, followed with MC comprehension items. Also, more specifically, we selected only the vocabulary section (with four choices). For the parallel CR versions, we omitted the choices presented (as well as adding further choices to MC or dropping a choice when required). Since this exam is the most important examination in Iran affecting the future lives and of more than a million students and their families, we decided to examine one part of it (vocabulary section) in light of the facets proposed by Bachman. It should be mentioned that to the best of the researchers' knowledge, no similar study has been conducted based on this specific exam in Iran.
This study was accordingly an attempt to bridge a gap in the field of language testing by delving into the effect of test method effect on performance in vocabulary tests in an EFL setting in Iran. Furthermore, since research findings on the relationship between test format and gender are contradictory and inconclusive, the current project was aimed at investigating the differential function of gender (if any) across CR and MC test formats. In this regard, the following research questions were proposed: Q1: Is there any significant difference between the performance of EFL students in MC and CR vocabulary tests? Q2: Is there any significant difference between the performance of male and female students in MC versus CR vocabulary tests?

Participants and setting
The participants in the current study were 258 (140 male and 118 female) fourth year high school students within the age range of 17-18. The participants attended public high schools in Iran, where as part of their compulsory education, they received two hours of English education every week.

Instruments
The following four instruments were utilized in this research study:

Proficiency test
In order to guarantee the homogeneity of the participants in terms of language proficiency, an adapted version of Key English Test (KET) for schools (updated in 2009) was utilized. Before being given to the main study students, the adapted KET was administered to 25 students similar to the target group and the KR-21 reliability of test was calculated to be 0.78.

Vocabulary Pre-test
This vocabulary test was selected from Cambridge Key English Test 4 Self Study Pack (KET Practice Tests) (2006) by Cambridge ESOL. The test consisted of 24 MC vocabulary items with 3-options which was piloted with 24 students and enjoyed a KR-21 reliability of 0.83.

MC tests
Four versions of MC tests (constructed based on a CR test, see below) were used in the study. The CR and MC tests utilized in this study were stem-equivalent; in other words, all of them shared the same stems. The contents of these tests were based on the materials that the students studied and covered at school and matched their level of proficiency. Different versions of MC tests utilized in this study only differed in the number of options (which were based on the incorrect responses of students in the CR test). A 20-item 6-option MC test was constructed and piloted with 30 students similar to the target group and its KR-21 was estimated to be 0.87. The frequencies of all words which acted as answers and distractors were checked against Collins COU-BUILD Advanced Learners' English Dictionary (2006), and were found to have similar frequency. After the 6-option test was administered to the pilot group, the least chosen distractors were omitted and 3-, 4-, and 5-option MC tests were constructed accordingly. These tests were piloted with 23, 25, and 30 students and their KR-21 was found to be 0.79, 0.82, and 0.85, respectively. These four versions of MC tests were also reviewed by two English language teaching (ELT) professionals, two high school English teachers, and two native speakers before being used in the main study. They approved of the appropriateness of the tests, and some minor revisions (on wording) were applied to a few items based on their suggestions.

CR tests
The fourth instrument used in this study was a 20-item CR test which was adapted from a 4-option vocabulary MC test in entrance examinations in Iran. For preparing this test, different entrance examination tests for Bachelor of Arts (BA) and Bachelor of Science (BS) from different years (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) were reviewed by the researchers, and faulty and problematic items were changed, revised or substituted after revision by an expert and after piloting. The final version of the test was reviewed by two ELT professionals, two high school English teachers, and two native speakers. It was piloted with 21 high school students similar to the target group and its Cronbach's Alpha was estimated to be 0.85. Exact-word scoring was utilized for scoring the test when it was used in the main study. Incorrect responses provided by test takers during the piloting of CR were used to construct appropriate distractors for the 6-option MC test.

Data collection
This study sought to determine the possible impact of test format on the performance of Iranian pre-intermediate fourth year high school students across gender. In order to conduct this study, ten intact classes (five male and five female) were chosen from different high schools in Urmia, Iran. To make sure that all the participants were homogeneous and of the same proficiency level, the researchers first administered an adapted KET for schools test (KR-21 reliability of 0.83). Based on the results of KET, six students were regarded as outliers and were excluded from the study, and later nine other students who did not participate in the vocabulary pre-test were omitted; consequently, the number of participants was reduced from 258 to 243. In this study, there were five parallel groups (based on their proficiency and vocabulary performance) for each gender: Each parallel group (male and its counterpart female group) received one test (either a CR format or one of four MC formats) on a random basis. That is, the first group of female participants and a parallel group of male students received one test in CR format. All the participants in these two groups had 15 min to answer 20 CR vocabulary items. The second parallel group of female and male students received MC vocabulary test with 3-options and the students in both groups had 10 min. to answer the stem-equivalent 3-option MC test. The CR group was given more time since they had to write their responses. The third, fourth, and fifth parallel groups of female and male students received MC vocabulary tests with 4-, 5-, and 6-options, respectively, through the same procedures (Table 1).
As shown in Table 2, after excluding the outliers and the students who did not participate in the vocabulary pre-test, the number of participants reduced from 258 to 243 (132 male and 111 female).

Results
A series of independent samples t test were conducted to compare the mean score of groups that took MC (3-, 4-, 5-, and 6-option) and CR test formats. Table 2 shows the descriptive statistics of the learners' vocabulary performance in 3-, 4-, 5-, and 6-option MC and CR tests.
As the mean and standard deviation scores in Table 3 shows, there are differences between EFL learners' performance in 3-option MC and CR tests. However, in order to get more accurate and reliable results, an independent samples t test was run, the results of which are displayed in Table 4.
So, in response to the first question about the differences in the performance of test takers in different test formats (3-, 4-, 5-, 6-option MC, and CR), it can be concluded Step 2 Pre-test Vocabulary test Step 3 Tests CR 3-option 4-option 5-option 6-option that the performance of test takers in all versions of MC tests were significantly better than their performance in stem-equivalent CR test (Table 5).
In order to examine the gender differences (second research question) in MC (3-, 4-, 5-, and 6-option) and CR test scores, a series of independent samples t test were conducted. Table 4 indicates the means and standard deviations for males and females in MC and CR tests. Table 6 illustrates the results of t test statistics. Overall, the results of a series of t tests indicate a significant difference between the performance of males and females in doing MC and CR formats. It can be stated that females performed better than males in CR format while males performed better than females in MC format.

Discussion
In this study, we have examined the effect of test format and gender on the vocabulary performance of fourth year high school (pre-university) students.
Results of the analysis (Table 4) suggested a significant difference between the performance of test takers in MC and CR tests; in other words, findings revealed that vocabulary performance of test takers varied based on test format (MC/CR), and that their performance was remarkably better in MC test formats. Findings suggest that test takers performed relatively better in selective response format than in constructed or productive format in the context of the current study.
The findings of the current study are in line and consistent with other response format studies that showed test takers perform relatively better in MC than their stem equivalent CR format (e.g., Currie & Chiramanee, 2010;Famularo, 2007;In'nami & Koizumi, 2009;Nixon & Kennedy, 2002). For example, Nixon and Kennedy (2002) carried out a study to compare the performance of test takers in stem-equivalent MC and CR tests. They found that test takers performed much better in MC items. Famularo (2007), also, compared the scores of test takers in MC and CR items, and their findings indicated that MC format was significantly easier than CR version of the same test. They found that test takers benefit from test taking strategies and corrective feedback The findings of the current study also confirm those reported by Currie and Chiramanee (2010). They compared the performance of test takers in MC and CR English structure test. Their results indicated that scores of test takers were significantly better in MC than their stem-equivalent CR. Shaibah and van der Vleuten (2013) indicated that scores of test takers were better in MC. On the other hand, the studies done by Hastedt and Sibberns (2005) and Hickson et al. (2012) appear to contradict with the findings of this study. In their study, Hastedt and Sibberns (2005) observed only little differences between the scores and Hickson et al. (2012) found that item format did not affect the scores of test takers. The studies done by these researchers were related to different fields such as science, mathematics, and economics, and the differences in the characteristics, age, and knowledge of test takers could have led to different results.
The present study aimed to find the effect of test format on the performance of test takers. The comparison of mean scores suggested that test takers performed better in MC items. Based on the previous studies conducted (Cohen, 2012;Shohamy, 1984), the researchers believe that one of the reasons for this finding may be that test takers in doing MC and CR items make use of different skills and processes. While answering MC items, they just comprehend and select the option; while in doing CR items, they comprehend and produce an answer, which requires more processing (Shohamy, 1984). Furthermore, options in MC items provide additional information and cues for test takers, and by looking at options test-takers can remember and deduce the answer. Test taking strategies (TTS) refer to the processes that test takers have consciously selected in order to address language issues and item-response demands (Cohen, 2012). Test takers utilize a variety of strategies (such as facilitation and problem solving) to  improve their scores in an exam. Facilitation strategies help test takers facilitate doing a process (test tasks) and problem solving strategies are utilized when a problem comes in. Also, test takers in doing MC items make use of a wide range of strategies such as elimination and test wiseness, which help them to answer and select the option. Test wiseness strategies are a sub category of construct irrelevant strategies (Cohen, 2012) which assist test takers in MC items to find possibly the correct answer among several distractors: stem-option cues, grammatical options, similar options, and item giveaway (Allan, 1992) are examples of test wiseness strategies utilized in MC items. All such skills, strategies, and cues facilitate in answering MC items, increasing the chance for a better score. Overall, the findings of this study suggest that employing MC and CR format in tests of vocabulary is likely to create format related noise or effect, and teachers are recommended to consider the cost and effectiveness of each format while choosing an appropriate format for measuring an intended construct like vocabulary. For example, in using MC format because of its practicality, teachers must weigh it against the risk that the measurement of construct is likely to be contaminated or affected by constructing irrelevant factors such as item format or guessing. It is recommended that teachers use a wide variety of formats such as CR, MC, matching, or cloze test while measuring a construct like vocabulary to decrease the effect of format-related factors.
Findings of this study showed that males performed better in all version of MC while females outperformed males in CR test. In other words, we found that MC vocabulary items may bias male students while CR items would bias female students. It should be noted that the issue of gender and its effects on educational tests have long been a concern for testers (Brown & McNamara, 2004).
However, there are mixed results in the literature. As a prime instance, Mauldin (2009); Simkin and Kuechler (2005) and Weaver and Raptis (2001) examined performance of students in MC and CR tests and found no significant differences between male and female test takers. However, findings of this study suggested that males perform better in MC items while females do better in CR items. Similar to our results, Taylor and Lee (2012) found that males performed better in items that asked them to identify interpretations (answers) while females did better in items that asked them to write their own answers and interpretations.
The results of the current study are consistent with findings of earlier studies (Bolger & Kellaghan, 1990;Gamer & Engelhard, 1999;Hellekant, 1994;Taylor & Lee, 2012). Findings of the current study also corroborate with those of DeMars (2000); Hastedt and Sibberns (2005)and Reardon et al. (2018) in that females perform better in CR while males outperform in MC items. Moreover, the findings of this research support those of Taylor and Lee (2012). Their study indicated that in reading and math tests, females did better in CR while males performed better in MC items. Furthermore, the results discussed above show that the findings of the present research are in line with claims of Elder (1998). She noted that females mostly perform better in CR rather than in MC items, which could be the result of their superior verbal ability, and also other factors which are unrelated to language abilities (e.g., handwriting). On the other hand, males perform better in forced choice test format or MC.
We cannot determine the exact reasons for the difference in the measured gender gaps on tests with different item formats, but we feel that the differences are large enough to have meaningful consequences for students especially in Iranian context where the study was conducted. We hypothesize that one of the reasons for this gender difference in MC item may be related to different levels of confidence in males and females. Males are believed to be more confident and risk taking than females in doing MC items (Biria & Bahadoran Baghbaderani, 2015) while males trust the option they choose, females tend to doubt and change their answers several times which results in losing time and getting confused and more stressed in doing other MC items; on the other hand, in CR format unlike MC format females do not doubt or do not get confused and perplexed, so they rely on the first answer that they believe is right.
The current study is subject to a number of limitations. One limitation relates to the fact that the current study only investigated the performance of test takers in MC and CR test formats and due to operationalizability issues the researchers could not include other item formats. Also, the number of the participants and items utilized were limited (243 participants and 20 items). The researchers only analyzed the mean score of the students on the tests and further studies with other psychometric analyses that need to be conducted. Also, test taking strategies that the students exactly used for answering the MC and CR questions were not studied.
Considering the limitations of this study, further studies need to be conducted with more item formats, more participants so as to be able to recommend more generalizable findings and better alternatives for vocabulary test/formats. Furthermore, more research is needed to investigate whether male and female students differ in their answering strategy for MC and CR questions. The results of the current study were mainly based on the adapted MC (and CR) vocabulary questions used in a state wide entrance examination in Iran, and the findings may only be context specific. However, we believe that teachers and policy makers should consider the related format and gender-related differences in constructing MC and CR formats and develop test formats that are less biased toward a specific gender or use a wide range of formats to compensate for gender related differences.

Conclusions
The results of this study suggested that the performance of students were much better in MC vocabulary test than CR, and that they would have less difficulty in answering stem equivalent MC items The results of the study also showed gender differences in the performance of test takers in MC and CR formats; the findings suggested that females performed better in CR while males performed relatively better in MC test format. Based on these findings, another implication for teachers, testers, and policy makers would be taking into account individual and more specifically gender differences while constructing and developing test items for stakeholders. As mentioned earlier, for reducing bias against any gender, test developers are recommended to use different formats.