Comparing Two Measures of L2 Depth of Vocabulary Knowledge Using the Association With Vocabulary Size

This study compared two tests of second language (L2) depth of vocabulary knowledge, 2 namely the word association test (WAT) and vocabulary knowledge scale (VKS), with 3 respect to their associations with vocabulary size. The same relationships were further 4 examined separately for the five word-frequency bands of the vocabulary size test. To 5 this end, 115 English as a Foreign Language (EFL) learners took the WAT, VKS, and 6 Vocabulary Levels Test (VLT). Results of multiple linear regression analyses indicated 7 that: (a) while both measures of vocabulary depth were predictive of the VLT, the WAT 8 had a higher association with the dependent variable; (b) both the WAT and VKS were 9 predictive of the high-frequency vocabulary, with the relationships being more 10 significant for the WAT; (c) the WAT could significantly predict the mid-frequency 11 vocabulary, whereas the VKS had no significant contribution; and (d) while the VKS was 12 significantly associated with the low-frequency vocabulary, the WAT had no significant 13 contribution to the prediction of this level. The findings are interpreted with reference to 14 the suitability of both the WAT and VKS depending on the type of input, expected 15 response, and desired frequency of the target words. 16


Introduction 23
Vocabulary knowledge has been recognized as one of the most important components of 24 language learning without which no meaning can be conveyed and understood (Author1

19
However, as Qian (1998) argued, VKS assesses only one meaning of the prompt word 20 coupled with its actual use and ignores measuring multiple meanings or associations. 21 Henriksen (1999) further confirmed this argument and noted that VKS only assesses the 22 receptivity or productivity of the target words with no measurement of their different 23 aspects. In addition, Schmitt (2010) listed the following limitations for this scale: 1. the 24 first two stages of the scale are unverified; 2. the underlying knowledge construct are 25 inconsistent, jumping from form-meaning (categories I to IV) to production in context (category V); 3. the intervals between the categories are not consistent; 4. the 1 metalinguistic judgement in categories II (I think I know the word) and III (I know the 2 word) can be confusing for some learners since they are better at judging what they can 3 do with the words; and, more importantly, 5. the simple sentences examinees write in 4 category V cannot clearly show their productive knowledge of the target word. As Webb 5 (2013) mentioned, in VKS, "it is possible [for test takers] to use a word correctly in a 6 sentence without knowing its meaning" (p. 3). In this regard, Zhong (2016) suggested to 7 adapt the test in a way to reach the minimum possible chance for test takers to produce 8 'neutral' sentences like 'It is beautiful' or 'He is calm'. 9 The dimensional approach, on the other hand, tries to describe the mastery of various 10 components of different words and considers the mastery of lexical networks of an 11 individual word as important (Read, 1993). To assess such an aspect, WAT was designed 12 and further revised by Read (1993Read ( , 1998 which assesses the depth of individual 13 vocabulary knowledge through word association and the relationships between the words 14 in the mental lexicon. This was a developed format of his previous attempt to measure 15 depth of vocabulary through interview procedure in which the learners were asked to 16 pronounce the words, provide an explanation, identify the domain, provide word 17 associations, and suggest other forms of the word (Read, 1998). The first version of the 18 test designed in 1993 consists of eight options for each target word four of which were 19 associated with the target word paradigmatically (synonym), syntagmatically 20 (collocation), and analytically (component) (see Figure 2). 21  The 1998 version uses two boxes with eight words in each for 40 target words, all of 3 which are adjectives. The examinees are required to select only four words associated 4 with the target word from the two boxes (see Figure 3). The words in the left box are 5 paradigmatically related to the target word and the ones in the right box are 6 syntagmatically related. To reduce the guessing effect, the patterns of students' responses 7 differ such that three format are possible: two words from the right box and two from the 8 left one; three from the right and one from the left; or three from the left and one from the 9 right. The merit of this test format is in its ability to tap different instances of meaning, 15 collocation, and formulaic language (Schmitt, 2010 three different aspects of vocabulary depth, namely concept and referents, form  3 and meaning, and collocation, it does not provide separate scores for each of these aspects 4 and it is plausible that two test takers who are actually distinct in their depth of vocabulary 5 dimensions receive the same score without being distinguished in terms of what depth of 6 vocabulary aspect was known by each. Akbarian (2010) highlighted that due to the 7 identification of nouns to be collocated with the adjectives given in the test as target 8 words, the test taps knowledge of adjectives directly and nouns rather indirectly. Also, 9 adverbs are indirectly focused on in WAT since almost all adverbs are related to their 10 corresponding adjectives (Ishii, 2005). However, measuring depth of knowledge of verbs 11 is taken for granted and not included in the test. In addition, as Milton (2009) and Read 12 (1993Read 12 ( , 1998 asserted, WAT is susceptible to guessing due to its receptive multiple-choice 13 format which can threaten the validity of the test. Test takers can easily choose some of 14 the given words on random which can make the score interpretation problematic since 15 scores may not provide a true estimate of the test takers' depth of vocabulary knowledge. mostly happen for scores 0-2 and not for scores 3-4 for each item. More specifically, they 18 found that split scoreswhere test takers achieve 1, 2, or 3 out of the maximum 4 for 19 each itemmostly resulted from no knowledge or partial knowledge of the target word 20 and consequently no clear interpretation can be reached upon for these scores. They also 21 relate guessing in WAT items to its tendency to overestimate the test takers' actual 22 knowledge of the target words and raise the question of whether test takers are successful 23 in guessing even if they have no knowledge of the target words. 24

The interconnection between size and depth as a possible yardstick
Though being distinct in terms of measurement instrument, depth and size of vocabulary 1 have been found to be so much inter-related. Nurweni and Read (1999), in a study on the 2 vocabulary knowledge of first-year students in an Indonesian university concluded that 3 the tests of size (word translation test) and depth of vocabulary (WAT) correlated highly 4 with each other (r = .62). Qian (1999)  knowledge, and not vocabulary depth which is assessed using word associations tasks. 10 Put it more simply, the relationship between these two dimensions could possibly mean 11 that they are related to the same construct and, therefore, should not be seen as separate 12 aspects (Vermeer, 2001). Another interpretation is that tests of depth of vocabulary 13 knowledge, such as WAT, are not really tests of vocabulary depth; they are rather size 14 tests masquerading as depth tests (Akbarian, 2010). This claim was also backed by Milton 15 (2009) who asserted that the associative format to measure depth of vocabulary is not 16 successful in measuring this vocabulary construct for the main reason that this format is 17 incapable of tapping into the quality of association the test takers make. 18 The correlation between size and depth of vocabulary knowledge has been reported to 1 be unclear for lower and higher frequency words. While there seems to be little difference 2 between these two dimensions for higher frequency words, a gap has been reported 3 between these aspects of vocabulary for lower frequency words (Schmitt, 2014). 4 Shimamoto (2000), Noro (2002), and Henriksen (2008) for instance, found the 5 relationship to be weaker for learners who had larger vocabularies and higher language 6 proficiency. 7 In this study, we used the proposed model of Meara and Wolter (2004) and the 8 interconnection between the two dimensions of vocabulary size and depth as a yardstick 9 to identify the more suitable test of vocabulary depth. Additionally, the nature of this 10 relationship was probed for higher and lower word-frequency levels of the vocabulary 11 size test. The following research questions were thus addressed in this study: English language teaching and English literature major students. Accordingly, the 21 participants who scored between 30 and 47 in the test, i.e., B1 and B2 due to common 22 European framework of reference (CEFR), were selected. The selected participants 23 ranged from freshmen to junior who were both male and female students with the age range of 18 to 25. The reason for selecting this sample is that based on the nature of the 1 study, participants should have a good mental lexicon in terms of quantity and quality of 2

WAT-test of dimensional aspect of vocabulary depth 13
Developed by Read (1993), WAT measures the depth of vocabulary knowledge of the 14 participants. The test is a list of 40 prompt words each of which consists of one stimulus 15 word, which is an adjective, followed by a list of eight words in two boxes of four words. 16 The left and right boxes consist of the synonymous words and collocations of the stimulus 17 words, respectively. The participants should choose four words that are related to the 18 prompt word semantically. The four related words have been selected to represent three 19 semantic relations, namely paradigmatic, syntagmatic and analytic (Read, 1993). Read 20 (1995) reported its reliability (KR-20, N=94) as .93 and Nassaji (2006) and Qian (2002)  21 found its split half reliability to be .89. 22

VKS-test of developmental aspect of vocabulary size 23
Developed originally by Paribakht and Wesche (1993), VKS was used to find out the 1 participants' self-perceived level of developmental aspect of depth of vocabulary 2 knowledge. Participants should indicate their level of knowledge about the target words 3 on a Likert scale ranging from total unfamiliarity to the ability to use the words in context. 4 The instrument enjoys a high reliability estimate of .89 for content words and .82 for 5 discourse connectives as reported by Paribakht and Wesche (1997). 6 As VKS is a tool which in theory can be used with any set of words and since the 7 purpose of the present study is to compare VKS and WAT, the same prompt words in the 8 latter test were utilized as the cue words for the former. group consists of 6 cue words that should be matched with 3 definitions (see Figure 5). 20 The test has been reported as reliable with a Cronbach alpha of .96 (Akbarian, 2008)   were encouraged to give as many answers as they could, even if they would not be sure 23 whether the given answers were correct or not (Read, 1993). As for the VLT, the participants were required not to follow the guessing strategy for the words they did not 1 know, but they were suggested to find the answer if they thought they might know it. The 2 time allotted for each test was 30 to 45 minutes. The WAT, VKS, and VLT papers of the 3 participants were scored following Nassaji (2006), Wesche and Paribakht (1996), and 4 Schmitt et al. (2001), respectively. Multiple linear regression analyses were run using 5 SPSS version 23.0 to find the contribution of WAT and VKS to VLT and the extent that 6 the high and low word-frequency bands were predicted by the two tests of vocabulary 7 depth. 8

Results 9
Descriptive and reliability statistics 10 Table 1 represents a general profile of the descriptive statistics of the participants' scores 11 on the WAT, VKS, VLT, and the four word-frequency bands of the VLT. As the Table  12 shows, the participants' scores on the three administered vocabulary tests and the sub-13 tests of the VLT enjoyed appropriate Cronbach's alpha reliability estimate. 14 Before running multiple regression analyses, the correlations among the variables were 1 calculated. The results of Shapiro-Wilk test indicated that except for the scores on the 2 VKS and VLT, the scores on the other WAT and the four sub-tests of the VLT were not 3 normally distributed (p > .05). Spearman correlation coefficients were calculated for the 4 sets of scores the results of which are provided in Table 2. It shows that the correlations 5 among all the variables were significant (p ˂ .05) and the correlations between the VKS 6 and WAT, as the predictor variables were also significant (p ˂ .05). However, 7 multicollinearity was not a concern as the tolerance values were less than 0.40 and the 8 variance inflation factors (VIFs) were less than 2.5 (Field, 2009). 9

Predictive ability of WAT and VKS in VLT 15
The contribution of the participants' WAT and VKS scores to VLT scores was examined 16 through multiple linear regression analysis (using the stepwise method). The results, as 17 shown in Table 3, revealed that two models emerged for this association. The first model 18 in which only the WAT was entered as the predictor variable could explain about 23% of the variance in the VLT (F (1,113) = 33.565, p ˂ .001, R 2 = .229). The second model 1 where both WAT and VKS were entered as the explanatory variables could explain 29% 2 of the VLT performance (F (2,112) = 23.052, p ˂ .001, R 2 = .292). In other words, the 3 addition of the VKS scores could provide an additional 6% of the predictive power which 4 was a significant change (p ˂ .01). 5 The standardized beta weights also reaffirmed the strength of the association between 6 the scores on the WAT and VLT in the first (β = .479, t = 5.794, p < .001) and second (β 7 = .355, t = 4.005, p < .001) models. The VKS, however, made a less contribution to the 8 prediction of the VLT scores (β = .279, t = 3.146, p < .01). 9

Predictive ability of WAT and VKS in high and low frequency vocabulary of VLT 13
The second research question of this study investigated the extent that the WAT and VKS 14 scores could predict the high and low word-frequency bands of the VLT. A series of 15 multiple linear regressions (using the stepwise method) were run for this purpose. The 16 results (see Table 4) indicated that, for the 2,000-word-frequency band of the VLT, two 17 models emerged. In the first model, only the WAT was entered as the predictor variable 18 which could explain 15.5% of the variance in this sub-test of the VLT (F (1,113)  Appraisal of the standardized beta further confirmed the significant associations between 17 the WAT scores and the 2,000-word-frequency band (β = .394, t = 4.559, p < .001), the 18 3,000-word-frequency level (β = .490, t = 5.977, p < .001), and the 5,000-word-frequency 19 level (β = .376, t = 4.310, p < .001). The links between the VKS scores and the 2,000-20 word-frequency band (β = .223, t = 2.354, p < .05) as well as the 3,000-word-frequency 21 level (β = .305, t = 3.497, p < .01) were comparatively less significant. In contrast, while 22 the WAT performance was the only variable associated with the 5,000-word-frequency 23 level (β = .376, t = 4.310, p < .001), the VKS was the only format which could be linked 24 with the 10,000-word-frequency band (β = .380, t = 4.368, p < .001). .380 *** *p < 0.05, **p < 0.01, ***p < 0.001.

Discussion 5
This study made an effort to identify the more suitable measure of vocabulary depth by 6 using the yardstick of associations with VLT, a measure of vocabulary size. The results 7 of multiple linear regression analyses for the scores of 115 EFL students indicated that 8 the WAT was more significantly associated with the VLT scores, particularly the high-9 and mid-frequency bands. The VKS, however, had a comparatively weaker contribution 10 to the prediction of the VLT scores, but its prediction of the low-frequency band of this 11 test was unique. 12 The findings indicated that the interconnection between the two aspects of vocabulary 1 size and depth was strong, as measured through the WAT and VLT, supporting previous 2 studies (e.g., Akbarian, 2010;Gyllstad, 2007;Henriksen, 2008;Milton, 2009;Zareva, 3 2005). For instance, Akbarian (2010) used regression analysis and reported the WAT 4 could predict the variance in the VLT. This study also found that the links between the 5 higher frequency words of the VLT and WAT were stronger than the 10,000-word-6 frequency band. This could somehow support Schmitt's (2014, p. 941) conclusion that 7 for higher levels of vocabulary size "there is often little difference between size and a 8 variety of depth measures" while this association is weak for lower frequency bands of 9 the VLT where "there is often a gap between size and depth, as depth measures lag behind 10 the measures of size". Noro (2002) and Henriksen (2008) further reported a less 11 significant correlation between the VLT and WAT for lower frequency words. The strong 12 association between the two tests could be justified with reference to the findings of 13 Meara and Wolter (2004) who reported that an increase in vocabulary size could lead to 14 an increase in vocabulary depth, particularly for lower levels of language proficiency. 15 The results further indicated that the prediction of the VLT was mainly made by the 16 WAT while the VKS had a less significant contribution to the prediction of the VLT 17 scores. This could be due to the different task format of the WAT, which employs 18 matching items, while the VKS uses a scale that indicates knowledge subjectively. The 19 objective matching format of the WAT is more compatible with the matching type of the 20 VLT which both reduce the guessing effect (Stewart, 2014). Therefore, the students' score 21 on the WAT could be a more precise indication of their depth of vocabulary knowledge 22 than the VKS which is more subjective. The findings also showed a lower power of VKS 23 than that of WAT in predicting VLT. This finding implies that WAT can be regarded as 24 a measure of depth of vocabulary that is more influenced by the size dimension of 25 vocabulary knowledge, and hence, according to Meara and Wolter's (2004) model, it 1 might be regarded as a better measure of depth of vocabulary in comparison with VKS. 2 The results for the second research question showed that while the WAT was more 3 predictive of the high-and mid-frequency vocabulary (Schmitt & Schmitt, 2014), for the 4 10,000-word-frequency band of the VLT, the VKS was the only predictor variable. This 5 can be further discussed in that the partial receptive/productive nature of VKS can better 6 picture knowledge of less frequent vocabulary compared to the WAT which is only 7 receptive. As it was mentioned previously rather implicitly, the first two columns of VKS 8 measure receptive aspects of depth of vocabulary knowledge and the other three columns 9 focus on the productive aspect. This special feature of VKS makes it measure depth of 10 vocabulary both receptively and mostly productively. On the contrary, WAT is mainly a 11 receptive measure of vocabulary depth dealing with making associations among the given 12 words. The difference of receptivity and productivity of these two depth of vocabulary 13 tests can be regarded as the cause of their distinction in regression analysis results. This 14 provides empirical support for Read's (2004) proposal calling for distinguishing among 15 different aspects of depth of vocabulary with different measures. 16 Taking the overall results into account, it can be claimed that though WAT was shown 17 to be more predictive of a measure of vocabulary size and hence a better measure for 18 depth of vocabulary than VKS in this regard, each of these tests should be used depending 19 on the purpose of measurement, i.e., whether to measure receptive or productive aspects 20 of depth of vocabulary. Moreover, for tapping less frequent aspects of vocabulary depth, 21 the VKS would be a more suitable option. 22

Conclusion 23
Founding on Meara and Wolter's (2004) conceptualization of the relationship between 24 vocabulary size and depth, the current study compared WAT and VKS in order to find the more appropriate measure of vocabulary depth via comparing their power to predict 1 VLT scores, as a measure of vocabulary size. It can be concluded that although the WAT 2 scores explain the variance in the VLT scores to a larger extent and could be, therefore, 3 considered a more suitable test of vocabulary depth when we consider the association of 4 size and depth as a yardstick (Meara & Wolter, 2004), the VKS should also be seen as a 5 more subjective test of vocabulary depth that could tap into the more productive aspect 6 of this dimension of vocabulary knowledge. The results shed light on the difference 7 between WAT and VKS reporting a low correlation between the two which signifies that 8 they cannot be used for research and instruction purposes interchangeably. Rather they 9 should be used for the purposes which correspond to the nature of their test item structure. 10 In other words, vocabulary researchers can use VKS when they are exploring the role of 11 depth of vocabulary in speaking and writing performance, as productive skills, especially 12 if the focus of the investigation is less frequent words. Also, WAT can be used in probing 13 the association between reading and/or listening comprehension, as receptive skills, and 14 vocabulary depth. With this specification of the use of measures of depth of vocabulary 15 knowledge, more precise results might be achieved in future vocabulary studies. 16 The results and conclusion of the present study need to interpreted with caution as 17 there were some limitations which lead to some suggestions for further research. First of 18 all, similar to previous quantitative studies on WAT

Availability of data and materials 17
The datasets used and/or analyzed during the current study are available from the 18 corresponding author on reasonable request. 19

Competing interest 20
The authors declare that they have no competing interests. 21

Funding 22
There is no funding for this research.