The hidden depths of new word knowledge: Using graded measures of orthographic and semantic learning to measure vocabulary acquisition

We investigated whether the presence of orthography promotes new word learning (orthographic facilitation). In Study 1 (N = 41) and Study 2 (N = 74), children were taught 16 unknown polysyllabic words. Half of the words appeared with orthography present and half without orthography. Learning assessments captured the degree of semantic and orthographic learning; they were administered one week after teaching (Studies 1 and 2), and, unusually, eight months later (Study 1 only). Bayesian analyses indicated that the presence of orthography was associated with more word learning, though this effect was estimated with more certainty for orthographic than semantic learning. Newly learned word knowledge was well retained over time, indicating that our paradigm was sufficient to support long-term learning. Our paradigm provides an example of how word learning studies can look beyond simple accuracy measures to reveal the cumulative nature of lexical learning.

group); the remaining children did not receive these instructions (incidental group). Finally, we investigated the impact of spelling-sound consistency by including words that varied continuously on a measure of pronunciation variability (after Mousikou, Sadat, Lucas, & Rastle, 2017; see Method for more details). The quality of lexical representations was measured in two ways. A cuing hierarchy was used to elicit semantic knowledge from the phonological forms, providing a fine-grained measure of semantic learning. First, participants were asked to provide a definition. If their response was incorrect, they were given part of the definition (a cue) and asked to provide the rest. If their response was still incorrect, they were asked to select the definition from a choice of four. A spelling task indexed the extent of orthographic learning for each word. We sought to make the experimental paradigm as naturalistic as possible. Therefore, real words were taught, using an instruction and assessment approach adapted from standard educational and speech and language therapy practice. In Study 1, we measured knowledge of newly learned words at two intervals: one week and eight months after teaching. Longitudinal studies of word learning are rare and this is the first longitudinal investigation of orthographic facilitation. Study 2 extended the same experimental paradigm to a larger and more varied sample.

Study 1
Forty-one children aged 9-10 years completed the word learning task, followed by semantic and orthographic assessments one week after learning (Time 1), and eight months later (Time 2). Given the paucity of longitudinal data in word learning and orthographic facilitation research, we did not make predictions about the influence of time. We addressed three research questions: 1.
Does the presence of orthography promote greater word learning? We predicted that children would demonstrate greater orthographic learning for words that they had seen (orthography present condition) versus not seen (orthography absent condition). We anticipated that orthographic facilitation might also be observed for semantic learning (Colenbrander et al., 2019).

2.
Will orthographic facilitation be greater when the presence of orthography is emphasised explicitly during teaching? We expected to observe an interaction between instructions and orthography, with the highest levels of learning when the orthography present condition was combined with explicit instructions. However, this prediction was tempered by one study showing that this was not the case in younger children (Chambre et al., 2017).

3.
Does word consistency moderate the orthographic facilitation effect? For orthographic learning, we expected that the presence of orthography might be particularly beneficial for words with higher spelling-sound consistency, with learning highest when children saw and heard the word, and these codes provided overlapping information. For semantic learning, we reasoned that if the presence of orthography on semantic learning is driven by a beneficial effect of orthography on the learning of phonology (Ricketts et al., 2009;Rosenthal & Ehri, 2008), then orthographic facilitation will be greatest for word forms with more consistent spelling-sound mappings.

Method
2.1.1. Participants were 41 children from one socially-mixed school in the South-East of England (Mage = 9.95, SD = .53, 24 female). All spoke English as a first language, and none had any recognised special educational need. Based on these responses, the original list of 50 words was ranked in order of words least well known by respondents. Two lists of eight words were then selected that could be matched for counterbalancing purposes. Words were matched exactly in pairs for number of morphemes (range = 1-2 morphemes) and syllables (range = 2-4 syllables) and the items in each pair were allocated to separate lists. Item lists were also matched closely (all Fs < 1) for adolescent survey ratings, number of letters (range = 6-11 letters), number of phonemes (range = 4-10 phonemes) and our measure of spelling-sound consistency (see below). Only one word in each list started with a vowel and initial consonants appeared a maximum of once in each list to avoid confusion amongst words. Care was taken to make sure that word meanings were not overlapping.
Spelling-sound consistency relates to the frequency with which letters correspond to sounds and vice versa. Spelling-sound consistency has been conceptualised carefully for monosyllabic words (Kessler & Treiman, 2001) but there is no consensus on how to capture consistency for polysyllabic words. We indexed consistency at the whole word level using H indicating greater inconsistency (pronunciation variability) with increasing magnitude.

Experiment: procedure.
The experimental procedure is summarised in Figure   1. A pre-test was conducted to establish participants' knowledge of the stimuli. Then, each child was seen for three 45-minute sessions to complete training (Sessions 1 and 2) and posttests (Session 3). Sessions were spaced one week apart to emulate the pace of topic-related vocabulary learning in school, and to allow for spaced teaching (Carpenter, Cepeda, Rohrer, Kang, & Pashler, 2012). This also enabled newly learned words to be consolidated during sleep (Henderson, Weighall, & Gaskell, 2013). The intended gap of seven days between sessions was achieved for most participants between Sessions 1 and 2 (56%; M = 7.37 days, SD = 1.09, range = 6-12 days) and Sessions 2 and 3 (71%; M = 7.00, SD = 0.55, range = 6-8).
Post-tests were then re-administered approximately eight months later at Time 2 (M = 241.58 days from Session 3, SD = 6.10).
All instructions, stimuli and feedback were pre-recorded by a native speaker of English and presented to participants via the E-prime 2.0 programme (Schneider, Eschman, & Zuccolotto, 2012a, 2012b. Instructions, feedback and orthography (where relevant) were also presented in written form on the screen. E-prime was used to randomise order of presentation and record the accuracy of responses. Presenting information in this way also allowed us to ensure that the experiment was presented as intended, pronunciations were standard across children and exposures and all children had the same opportunity to learn.
The second author conducted all experimental sessions with all children. Three children were excluded due to an administration error. They are not referred to in the participants section, nor are they included in any tables, figures or analyses.
--- Figure 1 about here--- condition only, the prompt 'for some of the activities, you will see the word written on the screen. You might find this helpful' was given once prior to each semantic-phonological training block. The pre-test provided one exposure to each phonological form; training provided a further 24 exposures. Children were exposed to word definitions 10 times and, for words in the orthography present condition, to orthography four times.
The phonological training block familiarised children with the new phonological forms. In an initial set of trials participants heard and repeated each word once (e.g., 'repeat epigram'). In the second set of trials they heard each word and then repeated it whilst simultaneously tapping out its syllables to draw attention to the phonological structure of the word (e.g., Lundberg, Frost, & Petersen, 1988;Yopp & Yopp, 2000). This allowed for four exposures to the phonological form per session (eight over training).
In the phonology-semantic block (see Figure 1), children completed five activities with each word, taking one word at a time: 1. repeat it (e.g., 'repeat epigram'); 2. listen to the word with its three-word definition (e.g., 'listen carefully / you don't need to do anything / epigram is a witty remark'); 3. listen to the word in sentence context (e.g., 'listen carefully / you don't need to do anything / Ed knew how to use a good epigram to keep his friends entertained over dinner'); 4. repeat the word with its definition (e.g., 'repeat after me / epigram is a witty remark'); and 5. repeat the word and definition again, substituting the middle word of the definition (an adjective) for a synonym (e.g., 'repeat it, but this time change the middle word to a different word that means the same thing / epigram is a witty remark'). All definitions followed a determiner-adjective-noun structure and included simple vocabulary to ensure understanding. Sentence contexts (15 -16 words) included the target word and provided additional cues to meaning. All definitions and sentence contexts appear in the Appendix. Repetition trials were included to engage children in the task and the synonym substitution was included to encourage them to actively process the meaning of the word. For words trained in the orthography present condition, the orthographic form appeared during passive activities: 2) listen to the definition; 3) listen to the word in sentence context.
The two phonology-semantic blocks allowed for 16 exposures to each phonological form, 10 exposures to the definition and, for orthography present items, four exposures to the orthographic form.

Semantic post-test.
The semantic post-test assessed knowledge for the meanings of newly trained words. We took a dynamic assessment or cuing hierarchy approach (Hasson & Joffe, 2007), providing children with increasing support to capture partial knowledge and the incremental nature of acquiring such knowledge (Dale, 1965).
Each word was taken one at a time and children were given the opportunity to demonstrate knowledge in three steps: definition, cued definition, recognition. In the definitions step, each child was asked, 'what does [word] mean?' If they were able to provide the target definition or a close approximation, the next word was presented. If not, they were given a semantic cue, using a set format: 'it is a type of [noun]. Can you tell me what type?' If the child provided the target adjective or a close synonym the next word was presented. Otherwise, the child was asked to select the correct definition from an array of four, comprising the target definition and three distractors.
For the recognition step, the distractors were identical to the target definition with the exception of the adjective, which was substituted with a plausible alternative (e.g., for epigram, target definition: 'a witty remark'; distractors: 'an unfunny remark', 'a kind remark', 'an indignant remark'). Adjectives were not used more than once across target and distracter definitions, and distractor adjectives that were similar in meaning to the target were avoided. Where possible, one distractor adjective was opposite in meaning to the target adjective (i.e. 'unfunny' for 'witty'). The four multiple-choice options for each word were presented on the screen in a grid format until a response was made. Position was randomised and participants heard each option once in order: top left, top right, bottom left, bottom right.
The semantic post-test score captured depth of semantic knowledge for the newly learned words. A score of three was allocated for a correct response in the definition task, two for a correct response in the cued definition task, one for a correct response in the recognition task, and zero if the item was not correctly defined or recognised. For this measure, the maximum score was 48 (24 per orthography condition). Reliability (Cronbach's α) was calculated for a binary score (1 = definition or cued definition, otherwise 0), and was acceptable (α = .71).

Orthographic post-test.
This post-test was included to ascertain the extent of orthographic knowledge after training. Children were asked to spell each word and their spelling productions were transcribed so that they could be scored. Responses were scored using a Levenshtein distance measure, using the stringdist library (van der Loo, 2019) in R (R Core Team, 2018). This score indexes the number of letter deletions, insertions and substitutions that distinguish between the target and the child's response. For example, the response 'epegram' for target 'epigram' attracts a Levenshtein score of 1 (one substitution).
Thus, this score gives credit for partially correct responses, as well as entirely correct responses. The maximum score is 0, with higher scores indicating less accurate responses.

Results
Analysis data and code are shared through an OSF repository accessible at: https://osf.io/e5gzk/?view_only=a43914620dae4cc1b56bf3c15ef8d6c6. We fitted Bayesian mixed-effects models using the brms (Bayesian regression models using 'Stan') library (Bürkner, 2017(Bürkner, , 2018 in R (R Core Team, 2018). We adopted Bayesian rather than frequentist methods for three reasons. First, Bayesian approaches are highly flexible, enabling us to model the sequential and categorical nature of the semantic post-test responses. Second, while it is recommended that mixed-effects models fully take into account random effects (i.e. a maximal effects structure; Barr, Levy, Scheepers & Tily, 2013), convergence issues are common (Meteyard & Davies, 2020). Bayesian models will typically converge to accurate values of effects estimates for any sample (Liddell & Kruschke, 2018).

Participant characteristics.
Third, as we discuss below, Bayesian analyses allowed us to combine data sets in Study 2 without risk of elevating Type 1 error (Kruschke, 2013).
More generally, Bayesian models are scientifically advantageous because they yield accurate representations of the posterior distribution. For each parameter (including fixed and random effects), Bayesian models generated a probability distribution representing the differing probabilities of each potential value of the coefficient for an effect. This means that we were able to report the most probable value of the estimate for an effect, given the posterior distribution, data and model assumptions. In tables summarising our models (Tables   2-3), we report each estimate, along with its 95% credibility intervals (lower and upper bound). The credibility interval indicates the range within which we can suppose that the "true value" of a parameter lies (see OSF: word-learning-supplementary_2020-09-30.pdf for a graphical illustration of this). In tables we also report the proportion of the distribution that sits either above or below 0, depending on the direction of the effect. That proportion indicates the probability of an effect in that direction. Where lower and upper bounds of the credibility interval cross zero, the direction and the magnitude of effects are estimated with less certainty. To allow for comparison, equivalent frequentist models with p values are included in Supplementary Materials, though were subject to convergence issues (see OSF: word-learning-supplementary_2020-09-30.pdf for details).
In the semantic post-test, participants worked their way through three steps, only progressing from one step to the next step if they provided an incorrect response or no response. Given the sequential nature of this task, we analysed data using sequential ratio ordinal models (Bürkner & Vuorre, 2019). In sequential models, we account for variation in the probability that a response falls into one response category (out of k ordered categories), equal to the probability that it did not fall into one of the foregoing categories, given the linear sum of predictors. We estimate the k-1 thresholds and the coefficients of the predictors.
Orthographic post-test performance was scored using a Levenshtein distance measure where 0 corresponds to an accurate response and higher scores indicate less accurate responses.
Because, for any response, the distance corresponds to the frequency of edits made, and because there is no upper limit to the potential number of edits, this outcome variable can be treated as count data and analysed under the assumption that values stem from a Poisson probability distribution (Gelman & Hill, 2007). This approach allowed us to estimate the effects of potential influences on scores, whilst allowing that many responses may be partially correct to varying degrees.
For the semantic and orthographic models, we took a hypothesis-driven approach, estimating the fixed effects of time (Time 1 vs. Time 2), orthography (absent vs. present), instructions (incidental vs. explicit) and consistency (standardized H), as well as the interaction between orthography and instructions and the interaction between orthography and consistency. Different levels of the three binary fixed effects were sum coded, with orthography as -1 (absent) vs. +1 (present), instructions as -1 (incidental) vs. +1 (explicit), and time as -1 (Time 1) vs. +1 (Time 2). Consistency H, as a numeric predictor variable, was standardized to z scores before entry to models as predictors. Models were specified to include maximal random effects (after Barr et al., 2013). Table 2  responses and more recognition (category 1) responses, compared to Time 1. Importantly though, at Time 2, our estimates reveal good retention of knowledge about each word, as reflected in the high probability of recognition responses. There was some evidence that instructions influenced performance, with higher responses in the explicit than incidental condition. The credibility intervals for instructions, consistency and the interactions (see Table 2) show that the evidence was not sufficient to resolve the magnitude or the direction of these effects. Table 2  higher scores correspond to less accurate responses. Spelling productions were more accurate for items taught in the orthography present condition, compared to the orthography absent condition. Other effects were estimated with less certainty.

Discussion
Phonological forms and meanings for sixteen polysyllabic words were taught, with half of the words taught with orthography present, and half without orthography. We measured learning for semantic and orthographic information just after teaching (Time 1), and eight months later (Time 2). We analysed our data using Bayesian mixed-effects models.
In relation to our hypotheses, there was evidence for orthographic facilitation, with more accurate spelling responses for words that had been taught with orthographic support than those taught without. In comparison, the orthographic facilitation effect was estimated with less confidence for our semantic learning measure. Stronger effects of orthography on the learning of orthographic rather than semantic information are congruent with previous findings (Colenbrander et al., 2019). We did not observe the hypothesised interactions between orthography and instructions, or between orthography and consistency. An advantage of using Bayesian models was that they allowed us to estimate the magnitude of effects so that we can quantify confidence about our findings, instead of using the significant/nonsignificant dichotomy. There was uncertainty in the estimation of the orthographic facilitation effect for semantic learning, and little confidence in the hypothesised interactions for both orthographic and semantic learning. This uncertainty could reflect limited power or minimal individual differences, and Study 2 set out to explore this possibility. Further discussion of Study 1 findings is included following Study 2, in the General Discussion below.

Study 2
Study 1 provided evidence for orthographic facilitation, though the effect was estimated with more certainty for orthographic than semantic learning. Analyses did not support the hypothesised interactions between orthography and consistency, or between orthography and instructions. In Study 2, the Study 1 sample was combined with an older sample of children (total N = 74) in order to increase diversity within the sample, and provide more power for analyses. Increasing sample size and then re-running analyses does not increase the Type 1 error rate in Bayesian analyses in the way that it does for more traditional significance testing (Kruschke, 2013). The research questions and hypotheses were the same as for Study 1 except that longitudinal analyses were not possible for Study 2.  Table 1. The same exclusionary criteria and ethics procedures were used.

Materials and procedure.
These were identical to Study 1. For the background measures, one child from the additional older age group did not complete the TOWRE. For the experiment, there were now 37 participants completing each condition (explicit and incidental) and for most, there was a 7-day time difference between Sessions 1 and 2 (76%; M = 7.20 days, SD = .83, range = 6-12 days) and Sessions 2 and 3 (76%; M = 7.43, SD = 1.81, range = 6-17). Four children, including the three described for Study 1 were excluded due to an administration error. After the pre-test, a further 22 participants were excluded, including the seven described for Study 1, because they knew dormancy (n = 11), syncopation (n = 8), accolade (n = 5), cataclysm (n = 4), nonentity (n = 2) and debacle (n = 1). Excluded participants are not referred to in the participants section, nor are they included in any tables, figures or analyses. Reliability for the semantic and orthographic post-tests were acceptable for this larger sample (semantic: Cronbach's α = .72; orthographic: Cronbach's α = .74).

Semantic and orthographic post-tests.
We analysed post-test data to test our hypotheses and establish whether the magnitude of our effects would increase with a larger and more varied sample. Models were identical to those used for Study 1 but without the effect of time, including fixed effects of orthography, instructions, consistency, orthography x instructions and orthography x consistency and a maximal random effects structure (see Table 3). Compared to Study 1, the effect of orthography on semantic learning was estimated with more certainty (P = .93 vs. .86), indicating a trend for higher quality semantic responses when orthography was present, rather than absent (for marginal effects plots, see top panel of Figure 3). The increased probability is also consistent with the notion that the presence of orthography influences semantic learning, but that this effect is small and our study was underpowered to detect it. Other effects were estimated with uncertainty, as for Study 1.
Findings for the orthographic post-test also replicated Study 1 (for marginal effects, see bottom panel of Figure 3). There was clear evidence for more accurate spelling patterns when orthography was present rather than absent but other effects were not supported.

General Discussion
Children were taught phonological forms and meanings for 16 unknown polysyllabic words. Half of the words were taught with orthographic forms available, and the remaining words were taught without orthographic forms. Fine-grained measures of semantic and orthographic learning were used to ascertain lexical quality for the newly learned words. In line with our predictions, we observed orthographic facilitation: children were more likely to learn words that they had seen during training. This effect was robust for orthographic learning but less clear for semantic learning. We did not find evidence for our hypothesised interactions: that orthographic facilitation would be moderated by consistency or the instructions that children received. Particularly novel was the longitudinal aspect of our study. Post-tests were administered one week after the end of teaching (Studies 1 and 2), and eight months later (Study 1 only), and analyses showed that over this time frame knowledge was well retained and orthographic facilitation effects endured.

Orthographic facilitation for word learning
The presence of orthography resulted in more accurate spelling responses and shifted the weighting of semantic responses towards deeper semantic knowledge. For orthographic learning, this effect was robust. For semantic learning, it was less clear, though it was estimated with high probability, especially in Study 2, where analyses were better powered.
In a systematic review, Colenbrander et al. (2019) concluded that effects on orthographic learning are strong and consistent whereas effects on semantic learning can be nonsignificant (e.g., Chambre et al., 2017) or range from small to large (e.g., Rosenthal & Ehri, 2008;Ricketts et al., 2009). Colenbrander et al. concluded that the magnitude of the semantic learning effect could not readily be explained by differences in the teaching or assessment approach used in the studies. They called for further research. Indeed, many factors will determine whether an individual can learn a new word meaning, such as the learning context (e.g., in the classroom, in conversation, while reading, background noise), word characteristics (e.g., whether the word has multiple meanings, its meaning is more concrete or abstract) and individual differences (e.g., pre-existing knowledge). It might be that in some cases the presence of orthography exerts only a small influence relative to these other forces.
However, this effect may still be important. Consistently encountering orthography with phonology and semantics may lead to subtle changes in lexical quality that promote reading comprehension (e.g., Perfetti & Hart, 2002). Furthermore, presenting orthography whilst teaching is a strategy that many teachers already use, and it is low cost in terms of time and resources (Ricketts et al., 2015). Even a small effect on learning words on one or two occasions in the classroom can accumulate over the many encounters with words that occur during each hour, each day, each year, resulting in a large effect across words, learning opportunities and development.
We hypothesised that the presence of orthography might be more beneficial to learning if it was explicitly emphasised. However, telling participants that orthography would be present for some items did not influence orthographic facilitation. Therefore, it seems that when orthography was there, children attended to it, even when their attention was not explicitly directed to it (see also Chambre et al., 2017). It is worth noting that our instructions were not very directive and placing more emphasis on processing the orthographic form might influence orthographic processing (see Chambre et al., 2020).

The role of consistency in word learning and orthographic facilitation
In this study, we deliberately characterised the spelling-sound consistency of words to see if this would moderate the orthographic facilitation effect. In so doing, we aimed to test a key mechanistic account of orthographic facilitation: that the presence of orthography confers an advantage on word learning via its impact on phonology. We reasoned that if this is the case, orthographic facilitation should be greater for more consistent items where orthography is a more reliable cue to phonology. However, our models did not support an interaction between orthography and consistency. Our findings indicate that the presence of orthography promoted orthographic learning, and to a lesser degree semantic learning, irrespective of item-level consistency. Notably, whilst our findings resonate with some previous studies It may be premature to conclude that the impact of orthography on word learning is not moderated by consistency. We hypothesised that orthographic facilitation would be greater when orthography-phonology mappings are more consistent. However, the opposite could also occur. Inconsistency may render items more salient, with inconsistent items attracting more attention than consistent items and therefore driving greater orthographic facilitation. Preliminary evidence for this idea comes from a study showing that less 'wordlike' stimuli can be more readily learned than more 'wordlike' forms (Storkel, Armbruster & Hogan, 2006). Another possibility relates to the orthographic skeleton proposal (Wegener et al., 2017), which suggests that when children hear a novel word, some orthography is activated on the basis of what they know about spelling-sound mappings. With this in mind, orthographic learning for consistent items in the orthography absent condition could already be quite high, with little room for improvement. Therefore, the presence of orthography might be particularly beneficial for more inconsistent words with spelling patterns that would be harder to infer from phonology.
There are other more methodological reasons for remaining tentative about our consistency findings. First, the effect of orthography was limited for semantic learning. If this reflected insufficient statistical power, this may also have constrained any interactions.
Second, there was not much variation in our consistency measure across items (see  Rastle et al., 2011). A study that included a greater number of words and therefore a greater range of consistency would be useful, as would further exploration of the appropriate way to capture consistency in multisyllabic words. In our study, we captured consistency from orthography to phonology (variation in pronunciation), though in English this is not the same as phonology-orthography consistency and the latter will be more important in underpinning spelling generation. Further, as in the consideration of monosyllabic consistency, it would be beneficial to consider more carefully the locus of inconsistency (e.g., vowel vs. consonant) and how consistency can be conditional on the context (Kessler & Treiman, 2001).

Lexical learning over time
In Study 1, children completed post-tests one week after teaching ended, and eight months later. Tracking learning of specific words over more than a few days or weeks is extremely unusual (for an exception, see Gellert & Elbro, 2013) and our findings are quite striking: our paradigm supported lexical quality that was well maintained over time.
Orthographic knowledge did not degrade with time and semantic knowledge was well retained, despite no intervening teaching. It is possible that children were exposed to these words in the interim. However, our pilot data (see Method) showed that older adolescents knew little about these words, indicating that this is unlikely. As a cautionary measure, teachers were not given the list of words until after data collection was complete. Notably, semantic responses indicated deeper knowledge of meaning one week after learning, compared to eight months later. Nonetheless, at both time points children exhibited semantic knowledge about many words that was durable and at least sufficient to support recognition of the correct definition. This level of knowledge may well support a range of language processing tasks. For example, even minimal semantic knowledge of debacle, when combined with other knowledge and skills, could allow for the successful comprehension of a text that includes this word.

The importance of using fine-grained outcome measures
Our measures of learning were novel in going beyond simple accuracy to capture knowledge in a more fine-grained manner. For orthographic learning, we administered a spelling task, which is widely argued to be a precise measure of orthographic representations (cf. Andrews, Veldre, & Clarke, 2020). Instead of analysing binary accuracy as usual, we gave credit for partially correct responses, indexing the distance between spelling responses and targets. The semantic post-test followed a 'cuing hierarchy' (Hasson & Joffe, 2007) or 'dynamic assessment' approach to provide progressively greater support for performance and to adequately capture depth of the knowledge learned.
These measures allowed us to look below the 'tip of the iceberg' and capture the partial knowledge that may lie beneath a simple correct or incorrect classification. Lexical learning must be incremental, and our measures capture that. By taking this approach, we were able to observe the way that time, and to some extent orthography, changed the contribution of correct recognitions and cued definitions to responses. If we had measured simple accuracy, this would have obscured these effects. In addition, had we used definition accuracy as our outcome, we would have concluded that our paradigm did not teach semantic information as there were very few correct definition responses. Indeed, our learning task was challenging. Though we provided more teaching than is usual, we taught 16 complex forms with richer meanings than are typically presented in the field (for a review, see Colenbrander et al, 2019). By measuring partial knowledge, it was clear that our paradigm was sufficient to support substantial learning: either cued definition or recognition responses made up 80% of responses. This sensitivity in measurement recommends our approach to future research and brings it closer to practice. It is important to know how close children are to knowing word forms and meanings, not just whether they know them or not.

Strengths and limitations
In order to maximise the relevance of this study to practice, we drew heavily on educational and speech and language therapy expertise, discussing our methods with school teachers, and speech and language therapists. We adopted an unusually naturalistic approach, teaching real words over multiple sessions and carefully selecting words that were just beyond the reach of our participants. This was balanced with idealised learning conditions, where teaching was one-to-one and distractions were minimised. As discussed above, our outcome measures were sensitive to the incremental nature of learning. Our approach was also evidence informed. We aligned our teaching and assessment approach with memory and learning research that highlights the importance of spacing (Carpenter et al., 2012) and sleeprelated consolidation (e.g., Henderson et al., 2013).
One clear limitation of our study is sample size, an issue that plagues learning and longitudinal research as such research is costly and resource intensive. Given that the effect of orthography might be small for semantic learning, or in the real world where learning takes place amongst distractions, larger studies are particularly warranted. As discussed above, our measure of consistency would benefit from further consideration. Finally, for our measure of semantic learning, we provided the phonological form and requested information about meaning. Given the link between orthography and phonology, it may be that orthographic facilitation is greater for tasks that require phonological output. There is some evidence for this (Miles, Ehri, & Lauterbach, 2016;Ricketts et al., 2015, though see Colenbrander et al., 2019) and a large body of evidence supports orthographic facilitation for phonological form learning (e.g., Ehri & Wilce, 1979;Reitsma, 1983). We did not measure phonological learning separately but rather sought to 'pre-train' phonological forms so that we could focus on the learning of semantics, and phonology-semantic mappings. Had we measured semantic learning using tasks that require production of the phonological form, or measured phonological learning separately, we would likely have observed stronger orthographic facilitation effects.

Conclusion
In conclusion, the presence of orthography promoted higher quality lexical representations, particularly in terms of orthographic learning. We did not find evidence that the presence of orthography was more beneficial when it was made explicit, suggesting that the effect of orthography was somewhat automatic. Consistency did not influence orthographic facilitation either and further empirical work is needed to specify how orthography exerts its influence on vocabulary acquisition. Our study provides novel evidence that relatively short learning paradigms can lead to lexical knowledge that is well retained over an extended time frame. In addition, it highlights the importance of using measures of learning that probe the incremental nature of word knowledge, instead of crude accuracy measures that might mask learning. Future studies that capture the incremental nature of word learning will not only inform theory, but will also resonate with vocabulary teaching practice, where even small changes in knowledge may be important for boosting spoken and written language processing.