Validity and sensitivity of the phonics screening check: implications for practice

Background: Introduced in June 2012, the phonics screening check aims to assess whether 6-year-old children are meeting an appropriate standard in phonic decoding and to identify children struggling with phonic skills. Aims: We investigated whether the check is a valid measure of phonic skill and is sensitive in identifying children at risk of reading difficulties. Sample: We obtained teacher assessments of phonic skills for 292 six-year-old children and additional psychometric data for 160 of these children. Methods: Teacher assessment data were accessed from schools via the local authority; psychometric tests were administered by researchers shortly after the phonics screening check. Results: The check was strongly correlated with other literacy skills and was sensitive in identifying at-risk readers. So too were teacher judgements of phonics. Conclusions: Although the check fulfils its aims, we argue that resources might be better focused on training and supporting teachers in their ongoing monitoring of phonics.

instruction is more effective when delivered in the context of a broader literacy curriculum, rather than in isolation (Camilli, Vargas & Yurecko, 2003;Stuebing, Barth, Cirino, Francis & Fletcher, 2008). These research findings are reflected in the current practice of most schools in England, following the Independent Review of the Teaching of Early Reading (Rose, 2006). Amongst the review's key recommendations was that phonics should be taught as the primary approach to learning to read and write and that such teaching should be embedded within a broad language and literacy curriculum (p. 70).
An additional recommendation following the Rose Review was that teachers assess children's phonic skills throughout the first 3 years of formal education (ages 4-7 years). Where implemented, this is typically achieved by teachers tracking children's progress through a series of developmental 'phonic phases', which are outlined in the teaching handbook Letters and Sounds (Department for Education and Skills, 2007). The phases move from demonstrating attention to sounds in the environment (Phase 1) to confident and fluent use of grapheme-phoneme correspondences for reading and spelling unfamiliar words (Phase 6). Moreover, although there are concerns about the ability of teachers to judge the reading skills of their pupils (see Madelaine & Wheldall, 2010, for a review), a number of UK studies have shown that teachers well versed in phonic strategies and monitoring procedures can provide reliable estimates of children's reading abilities as measured by objective tests (Snowling, Duff, Petrou, Schiffeldrin & Bailey, 2011;Snowling, Hulme, Bailey, Stothard & Lindsay, 2011;Snowling et al., 2009).
Despite the focus of government policy on the implementation of systematic phonics in recent years, the proportion of pupils leaving primary school with the expected level in English had stalled at around 80% (Department for Education, 2010). In response to this, in 2012, the UK coalition government introduced a statutory check on early reading progressthe phonics screening check. The stated purpose of the check is, 'To confirm whether individual pupils have learnt phonic decoding to an appropriate standard'. Consequently, 'Pupils who have not reached this standard at the end of Year 1 should receive support from their school to ensure they can improve their phonic decoding skills' (Department for Education, 2012a). Thus, the check forms part of an aim to identify and help struggling readers early in their literacy development (Department for Education, 2010).
The phonics screening check was administered in all maintained schools in England for the first time in June 2012. The check comprises 40 items -20 real words and 20 pseudowords. All items are phonically regular, ranging from items with three letters in a consonant-vowel-consonant format (e.g., the pseudoword pib) to two-syllable words containing consonant clusters and vowel digraphs (e.g., the word portrait). To 'meet the standard', children had to read at least 32 of the items correctly without support or prompting. Figures show that 58% of pupils met this standard in 2012, and 69% in 2013 (Department for Education, 2012b, 2013b). Children's scores are reported to their parents/guardians and to the Department for Education; and each school's results can be used as evidence within school inspections.
The phonics screening check has been met with a significant amount of controversy. Educators have questioned its necessity, voicing concerns about whether the check will add any valuable information to what teachers already know about their pupils' progress (e.g., National Union of Teachers, 2012). There have also been objections to the statutory nature of the check, with concerns about the resource implications of mandatory testing and the negative consequences when such tests become 'high-stakes' (e.g., Association of Teachers and Lecturers, 2011;Brooks, 2010). Indeed, a survey of nearly 3000 teachers conducted after the administration of the check but before its resultsreported that 87% of respondents did not agree with the statutory implementation of the check and thought that it should be discontinued (ATL/NAHT/NUT, 2012).
In this paper, we ask two critical questions about the phonics screening check: First, is the check valid (i.e., does it function as a useful measure of decoding skills); and second, is the check sensitive (i.e., can it identify children who are showing early signs of being at risk of a reading difficulty)? We also consider whether, given our findings, it is necessary. Answering the first question is relatively straightforward: We can investigate how well scores on the check correlate with reading skills measured by objective tests. The second is more difficult because there is no 'gold standard' for the identification of reading difficulty, and any cut-off between 'impaired' and 'normal' reading is arbitrary. The recently published American Psychiatric Association diagnostic manual (American Psychiatric Association, 2013) suggests low achievement is indicated by a score of at least 1.5 standard deviations (SDs) below average, translating to a standard score of 78 (p. 69). However, most agencies would contend that this is much below the desirable level of literacy for the UK population. We propose that an attainment level of 1 SD below the age norm (a standard score of 85) would be sufficient to place a child 'at risk' of significant underachievement, and hence, we use this as a cut-off against which to determine the sensitivity of the check.
To answer the preceding questions, we analysed attainment data from eight primary schools in York, selected to be representative of their particular local authority; these data included phonics screening check results and teacher assessments of children's phonic phase attainments. A subsample of children from these schools was seen on an individual basis to obtain psychometric data from tests of reading and associated skills, which were then related to the teacher's assessments.

Participants
Eight primary schools participated in this study. They were chosen from 18 that volunteered and were selected to be representative of their local authority. The sample included schools in city centre and suburban settings, with catchment areas of varying socioeconomic status (SES). School-based assessment data were collected from 292 Year 1 children (5-to 6-year-olds); 160 of these children were also seen individually by researchers for more in-depth testing (mean age = 6 years 4 months [SD = 4 months]). No exclusion criteria were applied. Parental consent was obtained for all children to take part in the individual testing sessions. Background information was available for all children with respect to SES (measured by receipt of free school meals [FSM] and the Index of Multiple Deprivation [IMD] based on home postcode information), whether they had English as an additional language (EAL) and whether they were on the Special Educational Needs (SEN) register. For the full sample, 15% received FSM, 12% were on the SEN register and 6% had EAL, and the average IMD score was 15.32 (11.98)where a higher score reflects greater deprivation. For the subsample, 10% received FSM, 8% were on the SEN register, 5% had EAL and the average IMD score was 13.32 (10.93). Averages for state-funded primary schools in England for 2011 were 15% FSM (Department for Education, 2011a), 19% SEN (Department for Education, 2011b) and 17% EAL (Department for Education, 2011a); and the average IMD in 2010 was 21.67 (Department for Communities and Local Government, 2011).

Materials
School-based assessments • Phonic phases. Introduced in the Rose Review (Rose, 2006), this hierarchical scale of six phases provides teachers with descriptors of phonic knowledge and skills, from the earliest attention to sounds in the environment to decoding and spelling multi-syllabic words with alternative grapheme-phoneme correspondences. Teachers assess each pupil against the phases, making a 'best-fit' judgement of the phase that most accurately describes the pupil's current attainment in phonics. The expected level of attainment for children at the end of Year 1 is phonic Phase 5. Judgements for the current study were recorded in May 2012.
• Phonics screening check. As described earlier, this is a 40-item test, comprising 20 real words and 20 pseudowords, administered one to one by a teacher with no prompting or help. The pupil's responses are scored as correct or not (internal reliability, α = .96; Standards & Testing Agency, 2012).
• Spelling. A spelling task was group administered by Year 1 teachers to all children in their class. This was adapted from the British Ability Scales II (Elliot, Smith & McCulloch, 1997). There were 14 items: on, and, go, sit, was, home, play, that, are, well, bird, boat, friend and know. Children were required to spell to dictation: Each word was said in isolation, embedded in a sentence and repeated in isolation. Children were not credited for reversed letters.

Individual assessments
• Word reading. The Single Word Reading Test from the York Assessment for Reading Comprehension (YARC -Snowling et al., 2009) was administered to assess word reading accuracy. This test comprises 60 words of increasing difficulty and includes both regular and exception words. Testing was discontinued after five consecutive errors (internal reliability, α = .98).
• Reading comprehension. Reading comprehension was assessed via the YARC Passage Reading test (Snowling et al., 2009). All children were administered the Level 1 passage, which was devised to be suitable for children in Year 1. Children were timed while they read the passage aloud. They were then asked eight open-ended questions about the story. Measures of prose reading accuracy, rate and comprehension were derived. Reading was discontinued if more than 15 errors were made while reading the passage aloud. In these instances, an accuracy score of 16 errors was assigned, timing was not recorded and the comprehension questions were not administered, yielding a score of 0 (reliability coefficient, α = .64).
• Nonword reading. The Graded Nonword Reading Test (Snowling, Stothard & McLean, 1996) was used to measure decoding skills. Children were presented with nonwords that increased in difficulty from 'hast' to 'sloskon' and were asked to read them aloud. There were five practice items and 20 test items, and the task was discontinued after six consecutive errors (internal reliability, α = .96).
• Phonological awareness. Phonological awareness was measured using the Sound Deletion subtest from the YARC . Children were presented with spoken words and corresponding colour pictures, asked to repeat the word and then to delete a sound (e.g., say 'sheep' without the 'sh'). Children completed all 12 items, which tapped deletion of syllables and phonemes in initial, medial and final positions (internal reliability, α = .93).
• Expressive Vocabulary. The Expressive Vocabulary subtest from the Clinical Evaluation of Language Fundamentals IV (Semel, Wiig & Secord, 2003) was used to assess production of single words. Children were asked to name pictures that increased in difficulty. There were 27 items, and the task was discontinued after seven consecutive errors (internal reliability, α = .85).
• Mathematics. The One Minute Addition Test and One Minute Subtraction Test provided measures of children's mathematical skills (Westwood, Harris-Hughes, Lucas, Nolan & Scrymgeour, 1974). Children had 1 minute to complete as many additions as possible, followed by 1 minute for subtractions. There were 30 items on each test. Children were credited for reversed numbers (test/re-test reliabilities, α = .91 and .90; K. Moll, personal communication, 23 January 2013). Children's scores on the two tests were summed to form a combined mathematics score.

Procedure
Children were tested individually for one 30-minute session in the 3 weeks after the national phonics screening check took place in 2012. The tasks were administered in the following order: Expressive Vocabulary, Single Word Reading Test, Sound Deletion, Graded Nonword Reading Test, One Minute Addition and Subtraction Tests and Passage Reading. The spelling task was administered by teachers on a group basis also within this 3-week period.

Results
The attainments of the full sample and subsample on measures of language, literacy and maths are summarised in Table 1. As a group, the subsample's performance on all standardised reading measures was above the population mean of 100 but within the average range. The distribution of the scores on the phonics screening check from our full sample is shown in Figure 1. This shows that the data are not normally distributed. Notably, although there was a long tail in the distribution, some 19% of the sample scored between 30 and 33, with a sudden jump in frequency for those attaining scores of 32-33 compared with 30-31 on the test. With respect to teacher judgements of phonic skill, the number of children distributed across each phase was as follows: Phase 1 = 2, Phase 2 = 15, Phase 3 = 28, Phase 4 = 88, Phase 5 = 111 and Phase 6 = 47.

Is the phonics screening check valid?
The validity of an instrument refers to whether it measures what it claims to. This was assessed by correlating the phonics screening check with teacher judgements of phonic skill and with various standardised measures (Table 2). Owing to the nonnormal distribution of the phonics screening check (Figure 1), Spearman's correlations were run. Only a limited number of variables were collected for the full sample. However, the pattern of correlations between these variables and the phonics screening check is very similar across the full sample and subsample. Thus, reporting will focus on the correlation coefficients from the subsample.
The phonics screening check correlates strongly with teacher judgements of phonic phases (r = .72) and with standardised measures of reading accuracy (nonword reading, single-word reading and prose reading accuracy, r's = .75-.83) and spelling (r = .72). It also correlates well with phoneme awareness, prose reading rate and comprehension (r's = .57-.68). In contrast, there are more moderate correlations between the phonics screening check and vocabulary and maths (r's = .45), indicating that the check is more specific to the domain of literacy and does not simply measure general abilities. Thus, the phonics screening check shows convergent and discriminant validity.

Is the phonics screening check sensitive?
A score of 32 out of 40 was set as the threshold for the phonics screening check in 2012 and 2013that is, children at or above this level are said to have met the required standard of phonic decoding (Department for Education, 2012b, 2013b). Nationally, 58% of pupils met this standard in 2012 (Department for Education, 2012b). In the current sample, 194 of 292 children met the standard (66%); in the subsample, 115 of 160 met the standard (72%). As with any new screening test, it is important to compare its ability to identify risk for difficulties with that of more established approaches. Thus, the classification function of the phonics screening check was compared with that of standardised measures of reading, and a routine teacher assessment (phonic phase judgements). In terms of the standardised measures, the subsample was performing in the high-average range. On these standardised measures, we defined risk of a reading difficulty as a score of more than 1 SD below the population mean (i.e., a standard score of <85) on one of two tests: single-word reading or prose reading rate. 1 In this way, 16 out of 160 children (10%) met the criteria for being at risk of reading difficulties (nine children based on single-word reading accuracy and seven (non-overlapping) children based on prose reading rate). With respect to teacher judgements of phonic phases, risk of a reading difficulty was defined as scoring less than the expected level for children at the end of Year 1 (i.e., <5). 2 This method identified 133 children (46%) in the full sample and 62 children (39%) in the subsample. Table 3 shows the classification of children according to whether or not they 'miss' the expected standard on the phonics screening check (<32) as a function of their risk of reading difficulties (i.e., whether or not they 'miss' the expected standard on standardised tests of reading [<85] or on the teacher assessment of phonic phases [<Phase 5]). Sensitivity gives a measure of those who are identified as at risk of a reading difficulty on the basis of the standardised reading tests or phonic phases and who are also identified as being at risk on the phonics screening check (a raw score of <32)that is, the rate of true positives. Specificity gives a measure of those who are not defined as at risk of a reading difficulty on the basis of the standardised reading tests or phonic phases, nor on the phonics screening check (i.e., the rate of true negatives).
When the classification function of the phonics screening check (i.e., its ability to detect risk for reading difficulty) is compared with that of standardised tests of reading, the phonics screening check identifies all but two of the 16 children considered at risk of reading

Notes:
Correlations are above the diagonal for the subsample (n's = 117-160) and below the diagonal for the full sample (n's = 216-291). For prose reading accuracy and rate, a higher raw score represents poorer performance. ***p < .001, **p < .01, *p < .05. Table 3. difficulties; its sensitivity is .88 (Table 3). However, it also identifies 31 children who were not classified as at risk (its specificity is .82). This suggests the phonics screening check is good at identifying children who are truly at risk of reading difficulties but slightly overestimates the number of children at risk. In comparison, when the classification function of the phonics screening check is compared with that of teacher judgements of phonic phases, the check shows reduced sensitivity in identifying risk (.60 for the subsample and .61 for the full sample), but increased specificity (.92 for the subsample and .90 for the full sample). Finally, we evaluated how effectively teachers might identify risk for reading difficulty, compared with identification through standardised reading tests. This was carried out by comparing rates of classification of risk from the phonic phases to those from the standardised reading tests. The phonic phases identify all but 1 out of 16 children who are at risk of reading difficulties (its sensitivity is .94) but also identify 47 children who are not deemed at risk (its specificity is .67). This suggests that when using phonic phases, teachers are very good at identifying children truly at risk of reading difficulties but that the number of children at risk is overestimatedmore so than the phonics screening check.

Is the phonics screening check necessary?
We were also interested in whether the phonics screening check added any value to what teachers who were well trained in phonic teaching and assessment already know about children's literacy skills, on the basis of their routine assessments. The strength of the correlations that the phonics screening check and the phonic phase judgements had with all the literacy measures was compared (from Table 2). There were no significant differences in how the assessments correlated with spelling, single-word reading, prose reading accuracy or prose reading rate (differences in r = .00-.06, Z's = À1.43 to 0.00, p's = .152 to <.001). However, the phonics screening check correlated more strongly with nonword reading than did phonic phase judgements (difference in r = .12, Z = 3.24, p = .001); but the phonic phase judgements correlated more strongly with reading comprehension than did the phonics screening check (difference in r = À.09, Z = À1.99, p = .047).

Discussion
In June 2012, the UK government introduced a new statutory assessment to assess the phonic decoding skills of Year 1 pupils. The aims of this phonics screening check are to ascertain whether pupils' phonic skills are at an appropriate standard, to identify children who may be struggling in this area and ultimately to raise literacy standards. We obtained the results of the phonics screening check from a sample of Year 1 pupils in one local authority, together with teacher assessments of phonic phases and psychometric test data. In light of the various criticisms levelled at the phonics screening check, we assessed whether the check was valid and sensitive; we also considered its necessity.
The validity of an instrument refers to whether it measures what it claims to measure. Our analyses show that the phonics screening check is a highly valid measure of children's phonic skills. The check showed convergent validity by correlating strongly with other measures of phonic skills (e.g., teacher judgements of phonic ability and psychometric tests of nonword reading and spelling) and with broader measures of reading (e.g., single-word reading accuracy, prose reading accuracy and comprehension). It also demonstrated discriminant validity, by showing weaker correlations with more distal skills (e.g., vocabulary and maths).
Furthermore, the phonics screening check seemed to be sensitive with respect to identifying children at risk of reading difficulties. We compared the check's ability to identify struggling readers (those who scored <32) with identification rates based on preexisting standardised tests of reading accuracy and fluency (for which we defined risk of a reading difficulty as a standard score of <85 -1 SD below the age-expected mean). It is worth noting that caution should be applied when interpreting these analyses, which have involved dichotomising continuous data and identifying a group of putative poor readers when dyslexia is essentially a dimensional disorder (Pennington, 2006;Rose, 2009). Note also that categorisation according to the standardised reading tests only identified a small number of at-risk readers (n = 16, 10%), potentially limiting the extent to which our findings can be generalised, and that sensitivity and specificity rates are intrinsically linked to where the cut-point between 'typical' and 'at-risk' reading is drawn. Nevertheless, the phonics screening check classified most of the same children as at-risk readers as the reading measures had done, resulting in a high sensitivity of 88%. The check also had an acceptable specificity of 82%. This lower figure demonstrates that the check slightly overestimated the prevalence of at-risk readers; however, this might be considered a good property for a screening instrument, given that it is preferable to overidentify the children who might need additional help in order to catch all those who certainly do.
Taken together, these observations lead us to conclude that the phonics screening check is a valid instrument for measuring word-level reading ability and sensitive in identifying young children at risk of a reading difficultyalthough this needs to be verified by followup data. However, it is important to note that analysis of the distribution of scores on the phonics screening check raises some doubts about its integrity. Notably, there was a sudden spike in the frequency of scores at 32the threshold for meeting the standard. This trend is accentuated in the national data obtained in both 2012 and 2013 (Department for Education, 2012b, 2013b). Reasons suggested for this unusual distribution include the release of the score for meeting the standard prior to the test administration and subsequent marking up of scores that fell on the borderline of meeting the standard (e.g., Bishop, 2012;Standards & Testing Agency, 2012;Townley & Gotts, 2013). If this interpretation is correct, it questions the objectivity of the instrument and the utility of the national data for tracking standards over time (Bishop, 2013). It is noteworthy that the threshold for meeting the standard in 2014 will not be released until after the check has been administered (Standards & Testing Agency, 2013, p. 17).
A survey of nearly 3,000 teachers (ATL/NAHT/NUT, 2012) found that 91% thought the phonics screening check failed to tell them anything they did not know already about children's reading abilities. To the contrary, the Department of Education (2013a) maintains that schools with good reading assessment systems benefited from the phonics screening check because it helped to identify children whose difficulties with phonics were hidden by strong sight word reading skills. This argument implies that teacher assessments of phonic skills may not be well correlated with objective assessments of sight word reading. However, our results showed that well-trained teachers were as good at judging children's reading skills as was the phonics screening check in so far as both tended to correlate to the same extent with standardised reading measuresnotably with tests of sight word reading and prose reading accuracy. There was one significant difference in favour of the phonics screening check, such that it correlated more highly with a standardised test of nonword reading. This might be a reflection of the fact that both these tasks are highly focused on the reading aloud of fully decodable items. In contrast, the phonic phases correlated more strongly with reading comprehension, which suggests that, although the assessment entirely focuses on phonic skills, the phonic phase judgements are also capturing broader aspects of literacy.
Furthermore, teachers' judgements showed at least as good sensitivity in terms of identifying children who were defined operationally from standardised reading measures as at risk of reading difficulties (94% sensitivity of phonic phases, compared with 88% sensitivity of the phonics screening check). However, the phonics screening check was less likely to overestimate the prevalence of at-risk readers (82% specificity of the phonics screening check compared with a 67% specificity of the phonic phases). Care must be taken when interpreting this apparent tendency of teachers to overidentify risk of reading difficulties. First, we chose to define risk as not yet attaining the phonic phase expected by the end of Year 1 (i.e., not attaining phase 5). Because teacher judgements were recorded in May 2012, it is possible that more children would have been judged to have reached Phase 5 by the end of that summer term (July 2012). Second, this cut-off point had been imposed on the data at the point of analysis; sensitivity of teacher judgements may well have been higher had teachers been asked explicitly to state which pupils they considered to be at risk.
On balance, given the strong stability of reading once the system is set up (Lervåg & Hulme, 2009), we concur that a rigorous assessment of phonic skill is important for early identification of children at risk of reading difficulties. However, we argue that when teachers are well educated about the cognitive mechanisms involved in reading, and they have training in the teaching and assessment of phonics, their judgements are sufficient for this purpose, and a mandatory phonics screening check is not necessary. Although a likely counterargument is that well-monitored phonics teaching is not systematically implemented on a national level, we believe that the use of resources to better equip teachers to conduct ongoing phonic assessments would be more cost-effective, not least because this would place them in the best position to intervene before reading difficulties set in.
There are some additional limitations to acknowledge. First, the sample size for our study was relatively small for the validation of a national screening measure. Second, the focus on a local authority in which teachers have been well educated in order to implement the phonics strategy and in which pupils were highly attaining may have introduced bias. Third, as teacher judgements of children's phonic phases influence the level at which they are taught, an underestimate of phonic ability could constrain children's phonic development and in turn their performance on the check. It might be for this reason that teacher judgements and performance on the phonic check correlate well. The risk of this is reduced, however, where teachers have been well trained in phonic assessment and teaching, as in this study. These limitations aside, our main point is to emphasize what can be achieved by teachers who are properly trained and empowered to implement systematic phonics in their classrooms as well as to accurately and consistently monitor their pupil's progress.

Conclusion
We have shown that the new phonics screening check is a valid measure of phonic skills and is sensitive to identifying children at risk of reading difficulties. Its slight tendency to overestimate the prevalence of at-risk readers (as compared with standardised tests of reading accuracy and fluency) is arguably a favourable property for a screening instrument.
We agree that early rigorous assessment of phonic skills is important for the timely identification of word reading difficulties. However, combining our observations about the integrity of the national phonics screening check data with our findings that teachers perform reliable and sensitive assessments of phonics progression, we argue in favour of using resources to continue to train and support teachers in the knowledge, assessment and teaching of early literacy skills.
Notes 1. Note that prose reading rate scores are not calculable for children with the poorest prose reading accuracy scores; children who made >15 reading errors were not given a time score. Thus, our estimation of the number of children at risk is conservative. 2. With respect to the school-based assessments, we have taken age-related expectations as determined by the Department for Education as cut-off points and used these to distinguish between those considered at risk or not at risk of a reading difficulty. However, it is important to note that the Department for Education guidelines do not specify that scores below the cut-offs constitute a reading difficulty.