Structural Equation Modeling of Vocabulary Size and Depth Using Conventional and Bayesian Methods

In classifications of vocabulary knowledge, vocabulary size and depth have often been separately conceptualized (Schmitt, 2014). Although size and depth are known to be substantially correlated, it is not clear whether they are a single construct or two separate components of vocabulary knowledge (Yanagisawa and Webb, 2020). This issue has not been addressed extensively in the literature and can be better examined using structural equation modeling (SEM), with measurement error modeled separately from the construct of interest. The current study reports on conventional and Bayesian SEM approaches (e.g., Muthén and Asparouhov, 2012) to examine the factor structure of the size and depth of second language vocabulary knowledge of Japanese adult learners of English. A total of 255 participants took five vocabulary tests. One test was designed to measure vocabulary size in terms of the number of words known, while the remaining four were designed to measure vocabulary depth in terms of word association, polysemy, and collocation. All tests used a multiple-choice format. The size test was divided into three subtests according to word frequency. Results from conventional and Bayesian SEM show that a correlated two-factor model of size and depth with three and four indicators, respectively, fit better than a single-factor model of size and depth. In the two-factor model, vocabulary size and depth were strongly correlated (r = 0.945 for conventional SEM and 0.943 for Bayesian SEM with cross-loadings), but they were distinct. The implications of these findings are discussed.


INTRODUCTION
The structure of language ability is a focus of concern for second language (L2) assessment researchers. Research on this issue dates back to Oller (1983), who reported on the unitary (i.e., single-factor) structure of a university placement test comprised of sections on grammar, composition, vocabulary, phonology, and dictation or cloze tasks. The findings of Oller's study suggested that a single ability could be measured by a test consisting of multiple components of language ability. This finding was criticized by Bachman and Palmer (1983) and Farhady (1983), the whose results contradicted Oller and instead suggested that language ability consisted of multiple components. Research into the structure of language ability has continued, and numerous studies have made contributions to the issue (e.g., Shin, 2005; In'nami and Koizumi, 2012a;Sawaki and Sinharay, 2017;Yan et al., 2019). Such investigations have for the most part focused on tests that assess skills (e.g., four skills in Sawaki and Sinharay, 2017;speaking in Sawaki, 2007;writing in Bae et al., 2016; listening and reading in Yamashita and Shiotsu, 2017). If the intended test constructs accord well with an observed factor structure, this constitutes one piece of evidence for the validity of interpretations based on test scores and evidence for an inference in the validity argument (e.g., Chapelle et al., 2008;Kane, 2013).
This line of research, which should also include vocabulary constructs (i.e., the vocabulary knowledge and ability that tests are intended to measure and what is actually measured; see Chapelle, 1998), has hitherto been limited. The necessity of investigating the quality of vocabulary tests and their constructs has been emphasized by Schmitt et al. (2020), who stated that L2 vocabulary fields need step-by-step test development and validation of vocabulary tests to allow for the meaningful interpretation and application of test scores.
While vocabulary knowledge has been conceptualized in various ways, vocabulary size and depth have often been separately conceptualized (Schmitt, 2014). Although the two have been shown to be substantially correlated, how size and depth should be conceptualized is not clear (Yanagisawa and Webb, 2020). Since they are strongly related to one another, should they be considered a single construct? Or should they be treated as two distinct constructs of vocabulary knowledge? These questions regarding the factor structure of size and depth can be better examined via structural equation models that take into account measurement error. Although structural equation modeling (SEM) has been used in language testing, models can be more flexibly tested using Bayesian estimation within the framework of SEM. The current study reports on the uses of conventional and Bayesian SEM to examine the factor structure of size and depth of L2 vocabulary knowledge of Japanese adult learners of English.

LITERATURE REVIEW Defining Size and Depth
Many vocabulary researchers share the view that vocabulary knowledge can be classified into several components (e.g., Henriksen, 1999;Read, 2000;Meara, 2005;Daller et al., 2007;Milton, 2009;Schmitt, 2010;Nation and Webb, 2011;Webb and Nation, 2017;Nation, 2020). Of the several methods of classification, one in particular often used is the size and depth of vocabulary knowledge. It was proposed by Anderson and Freebody (1981) and is defined as follows: Size, or breadth, concerns a quantitative aspect related to knowledge of a word form and a primary meaning. This is also termed the formmeaning association. In contrast, depth involves a qualitative aspect associated with "how well a learner knows individual words or how well words are organized in the learner's mental lexicon" (Staehr, 2009, p. 579).
Size has garnered much more attention as a research target than depth (Schmitt, 2014;Qian and Lin, 2020;Yanagisawa and Webb, 2020). In contrast, depth covers a wide range of lexical dimensions and is difficult to define. According to Webb (2013), "there is no definition of vocabulary depth that is widely agreed upon" (p. 1657). One of the leading researchers in depth studies, Read (2004) classified depth into three aspects: precision of meaning, comprehensive word knowledge, and network knowledge. Schmitt (2014), in relation to relationships between size and depth, organized depth into seven aspects: "receptive versus productive mastery, knowledge of multiple word knowledge components, knowledge of polysemous meaning senses, knowledge of derivative forms (word family members), knowledge of collocation, the ability to use lexical items fluently, and the degree and kind of lexical organization" (p. 922). Nation (2013Nation ( , 2020) offered a comprehensive list of vocabulary knowledge by using three categories (i.e., Form, Meaning, and Use), each of which is further classified into three aspects: (a) Form: spoken, written, and word parts; (b) Meaning: form and meaning, concept and referents, and associations; and (c) Use: grammatical functions, collocations, and constraints on use (e.g., register, frequency). Each aspect has receptive and productive dimensions. Among them, "form and meaning, concept and referents, and collocation, " which Webb (2013) considers to be assessed by the Word Associates Format (WAF; Read, 1993Read, , 1998, seem to be the aspects studied most. While both size and depth are important for language use, size has been considered the primary aspect of vocabulary knowledge because of its importance in the form-meaning link for vocabulary use (e.g., Laufer et al., 2004;Webb, 2005;Schmitt, 2010). Given the centrality of size, indications of form-meaning knowledge are often interpreted as having the ability to use words in reading, listening, writing, and speaking, and even as having vocabulary depth such as derivatives and collocations. However, Kremmel and Schmitt (2016) argue that these interpretations are not justified based on their research.

Measuring Size and Depth
Size and depth have been measured using various formats. Size has been typically measured by means of a recognition (e.g., multiple-choice or matching) or recall (e.g., translation) format, in which the L2 target form or its meaning is presented and test takers select or supply the meaning or L2 form (e.g., Laufer et al., 2004). There exist many vocabulary size tests, such as the Vocabulary Levels Test (Nation, 1983;Schmitt et al., 2001) and the Vocabulary Size Test (Nation and Beglar, 2007). The following shows a sample item from the Vocabulary Size Test Beglar, 2007, JALT2007, p. 2), in which test takers are asked to select the most appropriate meaning of the written target word out of four written choices. poor: We are poor. ( * answer) a. have no money * b. feel happy c. are very interested d. do not like to work hard While the form-meaning association appears relatively simple to define and assess, research has shown that it is not: Size test scores are affected not only by the intended test construct but also by various factors such as differences in item formats and test takers' test-taking strategies (Gyllstad et al., 2015;Kremmel and Schmitt, 2016;McLean et al., 2020).
Still, measuring size is less complicated than measuring depth. Depth is a multifaceted construct, ranging from various aspects of vocabulary to lexical organization, resulting in varied test formats. Yanagisawa and Webb (2020) grouped various approaches to measuring depth into three categories: a developmental approach, a lexical network approach, and a components approach. The developmental approach considers depth as something expanding from zero to full knowledge and attempts to test on what stage learners are located. An example can be found in the Vocabulary Knowledge Scale in Wesche and Paribakht (1993, p. 30), which uses self-assessment and some production items. Test takers are asked to indicate their degree of knowledge of each target word using the following scale.
I. I don't remember having seen this word before. II. I have seen this word before, but I don't know what it means. III. I have seen this word before, and I think it means.____ (synonym or translation) IV. I know this word. It means.____ (synonym or translation) V. I can use this word in a sentence:____________. (If you do this section, please also do Section IV.) Scores vary according to the quality of written responses. For example, if the synonym or translation provided in III-IV by test takers is wrong, those who choose III gain a score of 2. Yanagisawa and Webb (2020) summarize validity issues related to the Vocabulary Knowledge Scale. These issues include the lack of empirical basis for the developmental scale structure and difficulty in interpreting total scores because the test assesses multiple aspects of vocabulary knowledge in different stages.
In the lexical network approach, depth is conceived as a lexical network in which words are associated in learners' mental lexicon, and indications of knowledge of word association are elicited in tests taking this approach. The WAF (Read, 1993(Read, , 1998 uses this approach and is possibly the most frequently used depth measure (Yanagisawa and Webb, 2020). In the following sample item from Read (1998, p. 46), test takers are asked to select four words related to the stimulus word out of eight options. In the box on the left, words that may have paradigmatic associations with the cue word (synonym or one element of the meaning) are presented, whereas the box on the right contains words that may have syntagmatic associations with the cue word (collocations). There are possibly one to three answers out of four in the left box and one to three answers in the right box, and four answers in total. sudden beautiful quick * surprising * thirsty | change * doctor noise * school While the WAF is relatively easy to administer and score, there are limitations: For example, this format taps limited aspects of the lexical network; this format allows test takers to use guessing strategies; studies using the WAF have modified test formats and scoring methods according to their research orientations, so the scores are not always comparable across studies (Yanagisawa and Webb, 2020); it is also rather difficult to interpret what its total scores mean because multiple aspects of vocabulary depth are combined (Webb, 2013).
The components approach handles different aspects of depth separately. Webb (2013) recommended this approach, stating that creating tests assessing each aspect separately would bring the field forward for more precise depth assessment and research. Using this principle, multiple measures have been developed. For example, Webb (2005) developed 10 tests that focus on five aspects (i.e., written form, form and meaning, association, collocation, and grammatical functions), each of which was assessed with receptive or productive (i.e., recognition or recall) formats. Tests focusing on written form assessed size, whereas those focusing on the other four aspects assessed depth. Nguyen and Webb (2017) developed a collocation test in a multiplechoice format in which test takers were required to choose "the word that co-occurred most frequently with the node word from four options" (p. 306). An example is shown below (p. 309).
advantage a. get b. give c. have d. take * Among the three approaches to measuring depth (i.e., the developmental, lexical network, and components approaches), Yanagisawa and Webb (2020) recommended the components approach most because of its transparency in what the test scores indicate. They suggested investigating a wider range of depth aspects by using separate tests. The current study responds to this call for research and develops tests separately focusing on three depth aspects: association, polysemy, and collocation.

Correlations Between Size and Depth
Numerous researchers have examined the relationship between size and depth in L2 vocabulary studies (e.g., Nurweni and Read, 1999;Mochizuki and Aizawa, 2000;Vermeer, 2001;Noro, 2002;Qian, 2002;Akase, 2005;Shimamoto, 2005;Ishii and Schmitt, 2009;Koizumi and In'nami, 2013;Kremmel and Schmitt, 2016;see Schmitt, 2014, for a comprehensive summary). They have been interested in exploring the degree to which size and depth are related and how constructs of size and depth can be conceptualized in L2 vocabulary assessment. In his seminal article on a critical review of studies on vocabulary size and depth, Schmitt (2014) posed the following questions: "Do size and depth behave as separate constructs, " "or are they essentially the same construct?" (p. 941). These questions underlie the research conducted and discussions held thus far. For example, Akbarian (2010) reported a strong simple (zeroorder) correlation (r = 0.864) between vocabulary size and depth among 112 Iranian learners of English. Size was measured using the Vocabulary Levels Test (Schmitt et al., 2001), whereas depth was measured using the WAF (Read, 1993). Strong correlations were also found in Vermeer (2001). He examined 25 L2 Dutch kindergarteners who took two size tests in which words were presented orally. In one test, they selected the picture option that showed the meaning of the word they heard; in the other, they described the meaning of the word presented. In the depth test, they were asked to express what they knew about the target word by answering the following questions: "What is a . . .?" "What does a . . . usually look like?" "What can you do with a . . .?" "What do you feel when you touch a . . .?" and "Can you tell us some more about a . . .?" (p. 224). Their responses to the depth test were evaluated in terms of the quality of the word association network. It was found that the size test scores strongly correlated with the depth test scores (r = 0.72-0.76), which led him to state that "there is no conceptual distinction between the two" (p. 231). Schmitt (2014) attributed these high correlations to overlapping constructs. He argued that "the depth test only tapped into deeper semantic knowledge of a single meaning sense, so all tests (both size and depth) were essentially various types of meaning tests" (p. 921). He added that if he had used the measures that assess broad aspects of depth, the correlations would not have been so strong.
This hypothesis has been supported by previous studies such as Schmitt and Meara (1997), which examined the relationships between size and depth (i.e., both receptive and productive aspects of word association and suffix knowledge) among 88 Japanese learners of English. Size was assessed by the Vocabulary Levels Test (Nation, 1983). Receptive word association and suffix knowledge were assessed by requiring test takers to select the correct suffixes and words associated with a target word. Productive word association and suffix knowledge were assessed by requiring test takers to write every suffix that they thought could be added to the stimulus word as well as three word associations prompted by the stimulus. The simple correlations between size and depth aspects ranged from low to moderate (r = 0.27-0.62).
Findings from previous studies suggest that size and depth are correlated but that the strength of correlations varies from weak to strong across studies. Schmitt (2014) has attributed this variation mainly to different types of depth assessed and instruments used and to different L2 proficiency levels of test takers. He also pointed out that many depth tests may have problems related to reliability and validity. Since correlation coefficients are lowered in tests with low reliability, measurement error may partly explain the differing strengths of the relationships between size and depth across studies. One way to more accurately estimate correlation coefficients while addressing measurement error is to use SEM. SEM has been used to examine the factor structure of language ability by testing the fit of models to data. Ability and measurement error are modeled separately so that the relationships between abilities can be more precisely examined while separately estimating the impact of measurement error (see In'nami and Koizumi, 2011;Winke, 2014;Ockey and Choi, 2015, for SEM in an L2 assessment field).
In vocabulary studies, the factor structure of size and depth can be modeled using SEM in two ways. First, in a single-factor model, both size and depth measures (i.e., observed variables) are hypothesized to reflect one vocabulary factor (size and depth combined). If this model is the most likely, the distinction of size and depth is not very important, as size and depth assess the same vocabulary knowledge. Second, in a correlated factor model, size and depth factors are hypothesized to be correlated with one another. Even when they are correlated very highly, they should be treated separately, as doing so better explains the data.
A few previous studies examined a factor structure of the L2 vocabulary size and depth of L2 learners by modeling both size and depth as latent factors and comparing fit statistics across multiple models: Tannenbaum (2008) targeted first language (L1) users, and Kieffer and Lesaux (2012) targeted L1 and L2 users and analyzed a combined sample. To our knowledge, Vafaee (2016) is the only study that focuses on L2 learners' vocabulary factor structure. The authors are aware that several studies used SEM but did not model size or depth as a separate latent factor (Tseng and Schmitt, 2008;Zhang, 2012;Koizumi and In'nami, 2013), or one study (Tseng, 2011, as cited in Schmitt, 2014) did not provide sufficient information for review. Vafaee (2016) examined the relationship between size and depth of 263 lower-intermediate to advanced Persian learners of English. In the size test, test takers listened to a word and non-defining sentence once and selected from four choices of L2 meanings provided on the answer sheet (i.e., an aural version of the Vocabulary Size Test; Nation and Beglar, 2007). The test was divided into four sections according to the frequency of target words, and these four sections were used as indicators of vocabulary size (α = 0.67-0.84). In addition, an aural test of depth was created by adapting the WAF (Read, 1993(Read, , 1998. Test takers listened to the target word and options and were required to choose a synonym or collocation in relation to the target word. Results of synonym and collocation were separately scored, with synonym and collocation forming two indicators of depth (α = 0.92-0.93). There were moderate simple correlations between size and depth indicators (r = 0.64-0.77). A single-factor model with six indicators of size and depth was compared to another model (size and depth were separately modeled and correlated). The latter model (a correlated twofactor model) fit the data better than the single-factor model, with size and depth highly correlated (r = 0.94). However, the results of Vafaee (2016) may have been affected by (a) measures used to assess size and depth, (b) aspects assessed by depth tests, or (c) other features, such as participants' L1 and L2, or L2 proficiency levels. Regarding (a), Vafaee (2016) used aural versions of the Vocabulary Size Test and the WAF. Regarding (b), the research focused on synonym and collocation, as measured by the WAF. Although an aural version of the WAF was developed for the research, issues related to WAF test interpretation and use mentioned in the Literature Review apply to this research as well. Regarding (c), the participants were Persian learners of English at lower-intermediate to advanced levels. In order to know to what extent the findings of Vafaee can be generalized beyond contexts, further research is needed to examine a factor structure of size and depth with different types of measures addressing different aspects of the vocabulary knowledge of various target participants. Thus, this study examines a factor structure of size and depth, targeting beginner to intermediate Japanese learners of English, using diverse measures of depth.

CURRENT STUDY
To examine the relationship between the size and depth of L2 vocabulary knowledge, the following research question Frontiers in Psychology | www.frontiersin.org is investigated in the context of L1 Japanese adult learners who studied English as a foreign language at beginner to intermediate levels.
Research question: Which factor structure of the size and depth of vocabulary knowledge explains the data better, a singlefactor or a correlated two-factor model?
The current study expands on the findings of previous studies in five ways: First, we include three aspects of depth for analysis: word association, polysemy, and collocation. As described in the Literature Review, many previous studies, including Vafaee (2016), have used the WAF, which basically targets synonym and collocation; we increase the number of the aspects of vocabulary depth measured from two to three by employing more tests. We intentionally select word association, polysemy, and collocation, which are more closely related to a form-meaning link than are other depth aspects such as word parts, to rigorously examine the separability of size and depth constructs. Second, we use four separate depth tests by taking the components approach. Third, unlike studies that used simple correlations or regressions to investigate the relationship between size and depth (e.g., Vermeer, 2001;Akbarian, 2010), we use SEM to empirically identify the structure that best fits the data. The use of SEM should more clearly elucidate the relationship in question, with the measurement error of the instruments examined separately. Fourth, we explicitly compare a single-factor model with a correlated two-factor model using SEM. The identification of a best-fitting model of size and depth in comparison to competing models would have strong implications for vocabulary theory and practice. Fifth, we use both conventional and Bayesian SEM. In conventional SEM (and particularly in confirmatory factor analysis), the relationships between observed variables and factors are modeled by specifying paths between the two. Specifying no path indicates that no such relationship is hypothesized. According to Muthén and Asparouhov (2012), this is a very strong assumption and may not reflect researchers' theories or hypotheses since it is highly unlikely that no relationship exists between observed variables and factors. They have stated that it would be more sensible to model near-zero relationships with some variability between these observed variables and factors. Yet, conventional SEM does not allow researchers to specify such models. This is possible in Bayesian SEM, where degrees of a relationship can be specified using prior information (i.e., priors) based on theory and previous studies. This allows for more flexible testing of models by enabling researchers to specify major and minor loadings, namely those expressed as near-zero cross-loadings and correlations between residuals (i.e., measurement error).

METHOD Participants
In 2012, a total of 255 adult learners (18 or older) took vocabulary tests as part of their L2 English courses or as volunteers. Originally, 257 students took the tests, but 2 were found not to have taken the test seriously, so these 2 students were not included. Of these 255 test takers, 239 were undergraduates at nine Japanese universities; 9 were graduate students at four Japanese universities; and 7 were professionals who used English frequently. The undergraduate and graduate participants attended national or private universities and majored in various subjects. Other information such as gender and age was not available, but it is reasonable to assume that most participants were 18-22 years old and studied English as a foreign language for at least 6 years at the secondary school level. This is because most undergraduates in Japan are in this age range and have similar English-learning experience.
They took five vocabulary tests (see Instruments and Procedures below) and provided scores obtained in 2012 or earlier for the TOEIC (Test of English for International Communication R ) Listening and Reading Test. The distribution of participants' TOEIC scores (M = 514.84; SD = 181.17; Min = 205; Max = 985) resembled the distribution of all Japanese test takers for the TOEIC test (M = 520; SD = 180; reported in Educational Testing Service, 2019). Most participants were estimated to possess an A2 level proficiency of the Common European Framework of Reference (CEFR; Listening = 62.75%; Reading = 50.20%), based on their TOEIC Listening and Reading Test scores and a conversion table (Tannenbaum and Wylie, 2013).

Instruments and Procedures
The intent of the study's five vocabulary tests was to measure vocabulary size (one test) and vocabulary depth (four tests). We used a multiple-choice format with four or five options (see Table 1 for examples and Appendix A in Supplementary Material for all the test items). All the tests employed a discrete, selective, context-independent format (Read, 2000). Words used in the size and depth tests were different across tests.
The tests were developed for research, using lemma as a basis of counting units (see Vilkaitė-Lozdienė and Schmitt, 2020, for its appropriateness). They were constructed using the JACET8000 vocabulary list, a word list specifically adapted to Japanese learners of English [JACET (Japan Association of College English Teachers) Basic Word Revision Committee, JACET Basic Word Revision Committee, 2003]. This list was compiled using the British National Corpus (BNC) and subcorpora based on material that Japanese learners of English are likely to encounter, such as in textbooks for secondary schools. We considered using the word list matching the target learners' learning context to be appropriate for measuring their vocabulary (Nation and Sorell, 2016). Readers can refer to Appendices B, C in Supplementary Material for information on word frequency. The JACET8000 vocabulary list was later updated (JACET Basic Word Revision Committee, 2016; see the older and latest version lists 1 ). All the tests were piloted and revised before the validity of interpretations based on the scores of each test was examined and reported in Mochizuki et al. (2014).
The JACET8000 Vocabulary Size Test was intended to assess lexical knowledge of L2 written forms and the primary meanings (the first definition that appears in dictionaries) of up to 8,000 lemma. Test takers were required to select an L2 form that (1) manipulate (2) mention* (3) minister (4) moderate * = answer. [] = explanation to readers of this article, which did not appear on the tests. The instructions and explanations were written in Japanese to ensure participants fully understood the procedures.
corresponded to a meaning provided in L1 Japanese. There were 40 items in total, with 5 items for each 1,000-lemma level. The 40 items were divided, according to word frequency, into three subtests of 15 (levels 1,000-3,000), 15 (4,000-6,000), and 10 items (7,000-8,000). The second through the fifth vocabulary tests assessed three aspects of depth of vocabulary knowledge: word association, polysemy (two formats), and collocation. Stimulus words presented in each test were selected from the 1,000-to 3,000lemma levels (Polysemy Test 2) or from the 1,000-to 2,000lemma levels (the other three tests). All correct options but one (Word Association Test, No. 12) were within 3,000-lemma levels (see Appendix B in Supplementary Material).
In the 30-item Word Association Test, test takers were required to choose which L2 word was associated the most strongly with the L2 word provided. To construct this test, Mochizuki et al. (2014) asked Japanese learners of English with low to high proficiency to write three to five English words related to stimulus words (e.g., sky). They then selected (a) a word association as an answer that distinguished low-and highlevel learners and (b) distractor word associations that did not distinguish between the two levels.
There were two polysemy tests. The first (Polysemy Test 1, 20 items) asked test takers to select two frequent meanings of an L2 polysemous word (including homographs). They were selected from the following lists of polysemous words (Gorfein et al., 1982;Twilley et al., 1994;Seto, 2007).
The other polysemy test (Polysemy Test 2) required test takers to choose an English word with the same meaning in Japanese. The stimulus words were selected from words that had at least three meanings displayed in the Collins COBUILD Advanced Learner's English Dictionary. The definition that appeared third in the dictionary was selected. Of the 20 items originally on the test, only 17 of the items were used for analysis because the remaining 3 items were found to assess knowledge of the first definition, which overlapped the concept of vocabulary size. It should be noted that the Size Test and Polysemy Tests 1 and 2 all assessed relationships between L2 form and L1 meaning but differed in their constructs in that the Size Test assessed primary meanings with higher frequency, Polysemy Test 1 assessed two frequent meanings, and Polysemy Test 2 assessed a less frequent meaning.
Finally, in the 20-item Collocation Test, test takers were required to select two L2 words that co-occurred with the L2 word provided. The collocation was either of an adjective + noun type or of a noun + noun type; these were selected from the Longman Dictionary of Contemporary English (5th ed.) and Oxford Collocations Dictionary for Students of English. Distractors were selected from words that least collocated with the stimulus word, and this assumption was confirmed by asking two experienced Japanese teachers of English and an English native speaker. We also examined the items by mutual information (MI) scores using the Corpus of Contemporary American English (COCA) and BNC, accessed through English-Corpora.org 2 , and found no major problems (see Appendix C in Supplementary Material for details). As collocation is a part of word association, the Collocation Test and Word Association Test partially overlap the constructs. However, we intended to assess wider areas of depth of vocabulary knowledge instead of avoiding the overlaps.
For the Polysemy 1 and Collocation Tests, one point was awarded when two correct options were selected, whereas for the Size, Word Association, and Polysemy 2 Tests, one point was awarded when the correct option was selected.

Analysis
The structure of the size and depth of vocabulary knowledge was examined by testing two variants of models that hypothesized the relationships among variables as single-factor or correlated two-factor models. These models are presented in Figures 1,  2. In each figure, the rectangles represent observed variables, the ovals represent latent factors, and the circles represent measurement errors or residuals. Models 1 and 2 were built based on the structures of vocabulary knowledge discussed in the literature (e.g., Vafaee, 2016). Model 1 had three indicators of size and four indicators of depth. Both size and depth were hypothesized as a single factor of vocabulary knowledge. In Model 2, the same indicators were used to hypothesize correlated but separate factors of size and depth (see details of the models below).
The observed variables in this study were composite scores aggregated using item-level dichotomous data. The unidimensionality of each observed variable was examined and confirmed before the aggregation (e.g., Little et al., 2002Little et al., , 2013Meade and Kroustalils, 2006).
After a preliminary analysis of score distribution and reliability, conventional and Bayesian SEM was conducted using Mplus (Version 8.3; Muthén andMuthén, 1998-2019; see Appendices D-H in Supplementary Material for Mplus codes used). There were no missing values. For scale identification, one loading from a factor was fixed to 1.00. Observed variables were standardized to ease interpretation of priors (Muthén and Asparouhov, 2012). For conventional SEM, the data were univariately normally distributed, as judged by the skewness and kurtosis values of |3.30| (the z score at p < 0.01; e.g., Tabachnick and Fidell, 2014) and histograms. The data were multivariately non-normal according to Mardia's multivariate normality test available in an R package, MVN (Korkmaz et al., 2019). To account for such multivariate non-normality, a maximum likelihood estimation with a robust standard errors method was employed for estimation. Models were judged using fit indices: a comparative fit index (CFI) of 0.90 or above (Arbuckle and Wothke, 1995), a root mean square error of approximation (RMSEA) of 0.08 or below, and a standardized root mean square residual (SRMR) of 0.08 or below (Hu and Bentler, 1999). The Akaike information criterion (AIC) and chi-square difference tests were used to compare models (see Mplus, 2019). With statistical non-significance, a more parsimonious model with fewer parameters to estimate (usually, a model with a greater number of degrees of freedom) was selected. Model fit and statistical criteria were used with substantive interpretability to evaluate each model.
For Bayesian SEM, Models 1 and 2 were examined by specifying a series of priors (i.e., prior parameter distributions). Bayesian SEM can include two types of priors: non-informative and informative. They differ in the degree of specification imposed on the models, with informative priors specifying the particular distribution of parameters, as compared to non-informative priors, which do not specify such particular distributions. Our analyses using conventional and Bayesian SEM were also different in terms of crossloadings and residual correlations. As mentioned above, Bayesian SEM can include not only major loadings but also cross-loadings and residual correlations that have small, non-major effects on the model (expressed as a dotted line) by specifying informative priors. Specifying approximate zeros is more realistic than specifying exact zeros (e.g., de Bondt and van Petegem, 2015).
In Model 1a, non-informative priors were specified for factor loadings, with normally distributed priors with a mean of zero and infinite variance, and for observed variable variances, with inverse gamma distribution priors with infinite means and variances. In Model 2a, non-informative priors were additionally specified for factor (co)variance(s), with an inverse-Wishart distribution prior with a mean of zero and the degree of freedom of the model. These specifications were the softwaredefault settings of Mplus. In Model 2b, informative priors were additionally specified for cross-loadings, with normally distributed priors with a mean of zero and a variance of 0.01. A variance of 0.01 results in 95% cross-loading limits of ± 0.20 (Muthén and Asparouhov, 2012). This means that factor loadings vary in size between ± 0.20, although their means are zero. For example, this permitted the modeling of the small, non-major effects of vocabulary depth on vocabulary size. In Models 1c and 2c, informative priors were additionally specified for residual covariances, with inverse-Wishart distribution priors with a mean of zero and the degree of freedom of the model.
Model convergence was judged using (a) potential scale reduction (PSR) values and (b) Bayesian posterior parameter trace plots showing little change at each iteration. For (a), the value should be 1.0 at convergence, but values of less than 1.1 are considered acceptable. Model fit was assessed using posterior predictive p values of near 0.5, with 95% confidence intervals (CIs) that should be symmetric and center around 0. The models were compared using the deviance information criterion and Bayesian information criterion. For details on these criteria for convergence, model fit, and model comparison (see Muthén and Asparouhov, 2012;de Bondt and van Petegem, 2015;Norouzian et al., 2018). Table 2 shows that internal consistency was high for all vocabulary tests (α = 0.74-0.88), except for the Word Association Test (α = 0.56). Pearson product-moment correlations between size and depth indicators ranged from small to moderate (0.297-0.793), which were lower than the criterion for concerns about multicollinearity (r≥ 0.9; Tabachnick and Fidell, 2014). Note that correlations between Polysemy 1 and Polysemy 2 and between Association and Collocation were not high (r = 0.498, 0.297). This suggests that these tests measure rather different, marginally overlapping constructs.

Descriptive Statistics
While detailed analysis is conducted using SEM, results of simple correlations between size and depth are reported for the sake of comparison with previous studies. Correlations of Size with Association, Polysemy 2, and Collocation were moderate, ranging from 0.422 to 0.530. These correlations were considered similar because their 95% CIs overlapped with each other. The correlation between Size and Polysemy 1 was strong (r = 0.814), and its 95% CI did not overlap with the CIs of the correlations of Size with Association, Polysemy 2, or Collocation. Differences between Polysemy 1 and 2 in relation to their correlations with Size arose mainly due to minor differences between constructs. Polysemy 1 assessed the knowledge of primary and secondary meanings, whereas Polysemy 2 assessed the knowledge of only the secondary meaning. Table 3, Models 1 and 2 fit the data well (e.g., SRMR = 0.024 and 0.018, respectively). A comparison of these two models revealed Model 2 to be the best model to represent the structure of vocabulary knowledge for the current data, as shown by a lower AIC (4,113.674 vs. 4,107.574 for Models 1 and 2, respectively) and the significant result produced by a chi-square difference test (the chi-square difference between the two models was 5.806, exceeding the critical value of 3.841 at p < 0.05). The standardized parameter estimates [see the column "Conventional SEM" (Model 2) in Table 4] show that each vocabulary factor was, overall, well explained by the tests (vocabulary size: 0.832 for Size123 to 0.899 for Size456; vocabulary depth 0.496 for Collocation to 0.901 for Polysemy 1). The vocabulary size and depth factors were highly correlated (r = 0.945). Thus, size and depth are considered to be separate but closely related. Table 5 shows the results for Bayesian estimation. Models 1a, 2a, and 2b converged, whereas Models 1c and 2c did not. For example, Model 1a had a PSR value of 1.001, which was very close to 1.0 and less than 1.1. Bayesian posterior parameter trace plots, although not reported here, showed a stable, horizontal band for the parameter in question. These results suggest the convergence of the parameters in the model. On the other hand, Models 1c and 2c failed to converge. For example, Model 1c had a PRS value of 5.463, which considerably exceeded 1.1. Bayesian posterior parameter trace plots, although not reported here, showed a widely fluctuating, horizontal band for the parameter in question. These results suggest that Models 1c and 2c displayed poor convergence for their parameters. N = 255. Size123 = size with 1,000-3,000 levels; Size456 = size with 4,000-6,000 levels; Size78 = size with 7,000 and 8,000 levels. [] = 95% confidence intervals (CIs). All correlations were statistically significant, p < 0.01. (1) Size was a composite of (1a) Size123, (1b) Size456, and (1c) Size78 and was not analyzed using not analyzed using structural equation modeling (SEM). Instead, these three variable were analyzed using SEM. A comparison of converged models -Models 1a, 2a, and 2b -revealed that Models 2a and 2b were statistically equally likely, which is supported by similar values for model fit indices. Nevertheless, Model 2b was considered to best represent the structure of vocabulary knowledge for the current data, given that it was sensible to specify small-variance cross-loadings: The role of vocabulary size was very small but not zero in the vocabulary depth tests, and the role of vocabulary depth was, likewise, very small but not zero in the size test. The standardized parameter estimates [see the column "Bayesian SEM" (Model 2b) in Table 4] show that each skill factor was, overall, well explained by the tests (vocabulary size: 0.817 for Size123 to 0.900 for Size456; vocabulary depth: 0.518 for Collocation to 0.877 for Polysemy 1). The vocabulary size and depth factors were highly correlated (r = 0.943). Cross-loadings were very close to zero. This shows that the size and depth measures were successful in assessing separate constructs. In conclusion, as with the results from conventional SEM, Bayesian SEM showed that both size and depth are separately modeled but closely related.

DISCUSSION
To examine the relationship between vocabulary size and depth for low-to intermediate-level Japanese learners of English using conventional and Bayesian SEM, the following research question was addressed: Which factor structure of the size and depth of vocabulary knowledge explains the data better, a single-factor or a correlated two-factor model? Vocabulary knowledge was modeled as a single factor (i.e., vocabulary size and depth as one entity) and as two correlated factors (i.e., vocabulary size and depth as separately conceptualized), and model fit was compared. The results of conventional SEM showed that the correlated two-factor model best explained the data. The results from Bayesian SEM showed that the bestfitting model was the correlated two-factor model with very small cross-loadings. For both models, vocabulary size and depth factors were highly correlated (r = 0.945 for conventional SEM and 0.943 for Bayesian SEM). Thus, both size and depth are closely related yet two separate constructs. It is worth recalling that the structure of language ability has been an important research area, and the current results of having two lexical components strongly correlated to each other would add to the existing literature of the multicomponential nature of language ability.
The strong relationship between size and depth is consistent with a previous study using SEM (Vafaee, 2016). The adoption of the correlated two-factor model over the single-factor model in the current study as well as in Vafaee (2016) suggests that, even with very high correlations of 0.9 or above (r = 0.943-0.945 in the current study; r = 0.94 in Vafaee, 2016), distinguishing the two factors better explains the data than analyzing them as one factor. This means that size and depth should be distinguished conceptually and statistically. In other words, a person who knows more words (i.e., vocabulary size) tends to have a deeper vocabulary knowledge (i.e., vocabulary depth), but researchers and practitioners should still consider size and depth separately. It should be recalled that Vafaee (2016) and the current study differ in (a) the measures used to assess size and depth, (b) the aspects assessed by depth tests, and (c) the participants' L2 proficiency levels and their L1: Vafaee (2016) used an aural size test of meaning recall and an aural depth test (the WAF) of selecting synonyms and collocations, whereas we used a written size of form recognition and four depth tests focusing on association, polysemy, and collocation. Vafaee's (2016) participants were Persian learners of English at lower-intermediate to advanced levels, whereas the current study's participants were Japanese learners of English at low to intermediate levels. These differences may suggest some degree of generalizability regarding the factor structure of size and depth.
However, both studies involved L2 adult learners of English as a foreign language and multiple-choice formats of assessing size and depth. The relatively similar type of learners and the use of the same formats for measuring size and depth may have produced similar results across studies. Additionally, there was an overlap in the assessed depth aspects, with collocation tested in both studies. Synonym in Vafaee (2016) and "association and polysemy" in the current study are also related to meaning and are more similar to size (defined as knowledge of a word form and a primary meaning) than are other aspects of depth such as word parts. Vafaee (2016) and the current study showed that size and depth can be separately modeled, even when depth is operationalized as something similar to size. Thus, we can assume that when depth is operationalized as something more different from size, size and depth can also be separately modeled, with different degrees of correlations expected between size and depth, but this requires empirical research. Note that the results in the final models suggest that association, polysemy, and collocation measures primarily assess depth, not size: In Model 2 in conventional SEM, the three depth aspects loaded on the depth factor only (β = 0.496-0.901); in Model 2b in Bayesian SEM, the three depth aspects loaded on the depth factor to a large degree (β = 0.518-0.877) and on the size factor to negligible degrees (β = -0.022-0.022).
The use of SEM can help clarify the latent relationships between size and depth, but it is usually difficult to compare the SEM results with previous studies using simple correlation. We suggest three ways for effectively using previous study results. First, it is possible to model relationships using SEM when previous studies report means and SDs of the variables, and all correlations between them (see In'nami and Koizumi, 2010Koizumi, , 2012bVafaee and Kachinske, 2019), to examine how the latent factors of size and depth are correlated. However, this is often difficult because of the lack of reports of necessary statistics (Larson-Hall and Plonsky, 2015; see also Kline, 2016, p. 65, for cases where summary statistics are not enough and raw data are required) and the research design of previous studies. For example, previous studies often used only one measure of size, and it is, therefore, difficult to model it as a size factor.
The second and third methods use simple correlations and other descriptive statistics. Simple comparisons between correlation coefficients are not very productive because they are often affected by measurement error and sample size. In the second method, using reliability coefficients reported by previous studies, researchers can estimate the strength of correlation coefficients in the case of perfect test reliabilities (i.e., with no measurement error), using a formula for correcting for attenuation (Glass and Hopkins, 1996): (Correlation coefficient between the first and second tests)/( √ [reliability coefficient of the first test] × √ [reliability coefficient of the second test]).
For example, the correlation between Size and Collocation was r = 0.422, with the reliability of the two tests being α = 0.91 and 0.77, in the current study. The corrected correlation is 0.575 (i.e., 0.422/[ √ 0.91 × √ 0.77]), which is higher than the original value, 0.422. Thus, the use of this method of correcting for attenuation allows researchers to examine relationships while at the same time accounting for measurement error. This concept is similar to the one used in SEM (Hancock and Schoonen, 2015). The  second method is relatively simple but is often difficult to execute because reliability results for all the variables are not always reported (Larson-Hall and Plonsky, 2015). Table 6 summarizes the previous studies reviewed in the Literature Review as well as the current study. It shows that the three studies using simple correlations (i.e., Schmitt and Meara, 1997;Vermeer, 2001;Akbarian, 2010) did not report reliability sufficiently, and this hampers the use of the second method. The third method uses 95% CIs of correlation coefficients. The use of CIs allows researchers to view sampling statistics (i.e., correlation coefficients) as values that fluctuate. In fact, researchers are encouraged to report CIs along with the point estimates (e.g., means and effect sizes; Norris et al., 2015;American Psychological Association, 2020). Although CIs are not always reported (Larson-Hall and Plonsky, 2015), CIs of correlation coefficients can be calculated using free online calculators 3 . The information of 95% CIs shows that if similar studies are conducted many times, 95% of those CIs will capture the population correlation. If 20 studies are conducted on the same topic, 19 (20 * 0.95) of those CIs will capture the population correlation. Considering CIs along with the point estimates allows researchers to interpret results more accurately. For example, correlations between Size and Collocation (r = 0.422) and between Size and Association (r = 0.530) appear different, but in fact they are not, considering the substantial overlap of their 95% CIs (0.315 and 0.518 for the former; 0.435 and 0.613 for the latter; see Table 2). Table 6 shows 95% CIs of the correlations of previous studies and the current study, suggesting a similarity of relationships. The results can be quite a contrast when compared with the results using correlation coefficients only. For example, Vermeer (2001) has a wide CI (e.g., 0.455, 0.868) because the number of participants is small (n = 25), and the lower end of the CI (0.455) is very close to the upper end of the CI in Schmitt and Meara (1997;0.065, 0.453). Schmitt (2014) suggested that different degrees of correlations between size and depth may be derived from different measures and different depth aspects measured, but different sample sizes and resulting measurement error may also be other factors.
3 https://www.psyctc.org/stats/R/CI_correln1.html; http://vassarstats.net/rho.html Future research should consider using the abovementioned three methods, especially CIs, for comparing previous studies. Additionally, other methods that would provide more precise estimates would be (a) a bootstrapping method (McLean et al., 2020), which is useful when primary data are available, and (b) meta-analysis, which can systematically integrate previous studies while taking sample size and measurement error into account (e.g., Plonsky and Oswald, 2015;In'nami et al., 2020). When researchers obtain a matrix of meta-analyzed correlation coefficients through (b), they can more rigorously examine relationships of size and depth using meta-analytic SEM (Cheung, 2015).
From a methodological viewpoint, it is important to note that for the best-fitting correlated two-factor model, the vocabulary size and depth factors were highly correlated (r = 0.945 for conventional SEM and 0.943 for Bayesian SEM with cross-loadings). Recall that Bayesian SEM was used in the current study to more flexibly examine the factor structure of vocabulary size and depth by specifying cross-loadings and residual correlations. Obtaining similar factor structures with similarly high correlations between vocabulary size and depth for conventional and Bayesian SEM indicates the robustness of such structures and correlations. This was revealed only after comparing the findings from conventional and Bayesian SEM approaches. It should be noted that this does not mean that conventional SEM is sufficient and the use of Bayesian SEM is redundant. Bayesian SEM allows for more varied specifications of parameter distribution that were not used in the current study. This advantage of Bayesian SEM should be best employed in future vocabulary studies.

CONCLUSION
We examined the factor structure of the size and depth of vocabulary knowledge with five tests focusing on size and depth (association, polysemy, and collocation). We found that a correlated two-factor model explained the data better than a single-factor model. As introduced in the Literature Review, Schmitt (2014) asked, "Do size and depth behave as separate constructs, " "or are they essentially the same construct?" (p. 941). Our answer to these questions based on the findings of the current study is affirmative for the first question: Size and depth can be considered separate constructs, even when depth is measured by tests assessing aspects related to meaning and more similar to size.
Our study highlights the importance of distinguishing size and depth as two correlated but separate aspects of L2 vocabulary knowledge. This finding has implications for practice and theory. First, for L2 vocabulary assessment, if the purpose of the tests is to assess overall vocabulary knowledge, both size and depth should be included in tests to minimize construct underrepresentation of vocabulary knowledge. Given that vocabulary knowledge consists of size and depth, the inclusion of both aspects should maximize and best represent the construct of vocabulary knowledge (see Tseng and Schmitt, 2008;Zhang, 2012;Koizumi and In'nami, 2013;Vafaee and Suzuki, 2020; for examples). Further, in scoring and interpreting tests that include size and depth, separate scoring and interpretation of size and depth is justified, even though the two sections can be combined into a total score, given the high correlation between size and depth. When test developers or users conduct a validation study of their vocabulary tests that include size and depth, they should consider modeling these two when possible so as to examine the factor structure of their tests. For example, they can model a size factor using different frequency band scores and a depth factor using different section scores assessing depth. Moreover, in L2 instruction, teachers need to consider enhancing both size and depth as possible instructional goals and design, and allocate tasks for increasing these lexical components in a course (see Nation, 2013;Webb and Nation, 2017;Newton, 2020, for suggestions on effective learning and teaching in class).
Second, for theory building, the use of SEM for model construction and testing is helpful in empirically investigating relationships by considering measurement error. This allows researchers to examine the relationship between constructs of interest while separately estimating the impact of measurement error. This is how the close relationship (more than r = 0.9) between vocabulary size and depth was revealed in the current study. It should be recalled that simple correlations between size and depth measures were not that strong (0.422-0.793), which clearly shows one of the strengths of SEM. Taking this step further, minor cross-loadings and minor residual correlations were specified by Bayesian SEM. This would not have been possible with conventional SEM. These features of Bayesian SEM should help construct more realistic models to test specific hypotheses.
We have expanded previous studies and examined three depth aspects (association, polysemy, and collocation) by using separate tests, not the WAF. However, our results may be limited, as we targeted only Japanese adult learners of English at low to intermediate levels. What is needed are studies with different types of learners of diverse L1 and L2.
Further, we used only one measure of size (with three indicators at different levels of word frequency) and four measures of depth. There are other aspects that are important but were not examined (e.g., spoken forms, word parts, grammatical functions), and future research should include measures of size and depth by using various instruments (e.g., Godfroid, 2020, for offline and online measures to cover Nation, 2020, vocabulary elements) to examine their relationships. Tests should be developed, and test validation should be conducted, by following the principles summarized by Schmitt et al. (2020).
Specifically, to improve measures in the current study, the following three points are stated. First, we used only the multiplechoice format. A problem with this format is that it allows guesswork and overestimates the scores. Gyllstad et al. (2015) showed that takers of multiple-choice tests make educated (e.g., using the knowledge of word family) as well as blind guesses. For a more valid measurement, recall and other formats should also be employed. Second, the vocabulary size test may have had a low sampling rate. According to Gyllstad et al. (2015), 30 words (taken from a pool of 1,000 words) function with greater precision than 10 words in a vocabulary size test. In the current test, we included a maximum of 40 items in the test battery (a total of 130 items to be attempted in 80 min) keeping in mind the issue of test takers' concentration. Future research should consider how to manage the balance between the assessment need for including more items -so as to have a representative sample of vocabulary -and the practical need to reduce the items (Gyllstad, 2020). Third, some items in depth tests may need improvement. In particular, the collocation test should use a standard recently employed for selecting collocations, such as a minimum frequency of 10-50 in a corpus and an MI score of 3.00 or more (Kremmel and Schmitt, 2016;Nguyen and Webb, 2017).
In order to provide researchers with a comprehensive picture of relationships between size and depth, the following are required: a wider range of participants, and size and depth measures with larger item size, better quality, and wider focus. Comparisons of size-depth relationships across different L2 proficiency groups using multi-sample SEM (In'nami and Koizumi, 2012a;Zhang et al., 2014) would further elucidate intricate relationships. It would also be possible to model various aspects of depth separately as latent factors to specify more precise models of size and depth (see Stewart et al., 2013, for an attempt at modeling spoken vocabulary knowledge, polysemous word knowledge, and contextual word knowledge). The current study's insight into the highly correlated but distinctive nature of size and depth should help researchers advance the understanding of the structure of vocabulary knowledge.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
RK conceptualized, designed, and conducted the study, organized the data, performed the conventional SEM, and wrote the first draft of the manuscript and revised it throughout. YI performed the Bayesian SEM, revised the manuscript and added sections to the manuscript. RK and YI contributed to manuscript revision and read and approved the submitted version.