How stable is program quality in child care centre classrooms?

In the Early Childhood Education and Care (ECEC) sector there is a move to reduce oversight costs by reducing the frequency of quality assessments in providers who score highly consistently across time. However, virtually nothing is known about the stability of ECEC quality assessments over time. Using a validated measure of overall classroom quality, we examined stability of quality in a sample of over 1000 classrooms in licensed child care centres in Toronto, Canada over a 3-year period. Multilevel mixed-effects linear regression analyses revealed substantial instability across all types of ECEC centres, although publicly operated centres were somewhat more stable and tended to have higher quality scores. We also found substantial variance between classrooms within ECEC centres. None of the structural, child/family and neighbourhood characteristics we examined were significantly related to stability of quality ratings. The lack of stability found in our sample does not support the use of a risk-based approach to quality oversight in ECEC. Large within centre classroom quality variance suggest that all classrooms within a centre should be assessed individually. Furthermore, classroom level scores should be posted when scores are made public as part of accountability systems. Future research should, in addition to administrative data used in our study, explore how factors such as educator training, participation in program planning, reflective practices and ongoing learning might improve stability of quality over time.


Accountability mechanisms in the ECEC sector
While the consensus about the need for quality and for monitoring of ECEC services is clear (OECD, ), as describe as described above, definitions of quality vary. Furthermore, the actual initiatives undertaken by individual jurisdictions greatly vary. Licensing requirements usually cover structural components, such as staffing ratios, qualifications, group sizes as well as health and safety requirements. However, the possession of a license generally means having met the minimum acceptable standard of service at the time of the inspection. In many countries outside USA and Canada, additional requirements may also include program planning, curriculum implementation, financial and human resource management and working conditions (OECD, 2015). Unlike in most developed countries, the practices in USA and Canada appear to be different as a result of the de facto separation of mandatory licensing from ongoing program assessment and supports.
Compared to other developed countries, including Canada, licensing regulations that define the minimum acceptable standards in most US states tend to be weaker (Karoly, 2014;Perlman et al., 2019). According to Child Care Aware (2013) in 31 out of 50 states the minimum qualification for a lead ECEC educator was a high school diploma or less; 5 years later the number decreased marginally to 29 (Whitebook et al., 2018). However, it is important to note many individual ECEC services or types of programs (e.g., publicly funded preschool programs) require much higher levels of educational qualification. To improve quality standards, many state and local administrations have added QRISs as another layer of voluntary oversight. QRISs involve in-depth assessments of ECEC providers that serve multiple goals, including giving ECEC providers useful quality improvement feedback, using ratings for accountability purposes and enabling parents to make more informed decisions for their children. While participation may be required to maintain eligibility for state funding, in most states less than 50% of child care centres participate in their local QRIS (The Build Initiative & Child Trends, 2018). Minimum ratings for a QRIS tier/level are often established by local consensus. In many instances, an observational assessment that captures, among other things, the quality of educator/child interaction, is required only for the higher QRIS tiers (The Build Initiative & Child Trends, 2018). These are often captured using one of the measures in the suite of measures referred to as Environmental Rating Scales (ERSs, e.g., the Early Childhood Environmental Rating Scale-3 and the Infant Toddler Environmental Rating Scale-Revised) (Sylva et al., 2006;Vermeer et al., 2016).
ERS scores, together with structural indicators, such as educator levels of formal education, are used to create a composite score reflecting different tiers of quality that are usually reported on a scale of one to five stars. The QRIS ratings in the USA are usually valid for 1-3 years, with 3 years being the most common duration (The Build Initiative & Child Trends, 2018). According to the OECD Starting Strong IV study (2015) the frequency of monitoring practices can vary from several times per year (Luxemburg and Mexico) to annual (e.g., Japan, Chile, Mexico, Netherlands), to every 2 or 3 years (e.g., Ireland, France, Korea, Belgium).
Many researchers Blau, 2007;Gorry & Thomas, 2017;Hotz & Xiao, 2011;Loeb et al., 2004;Scarr, 1998) argue that regulations unnecessarily increase the burden on operators and reduce access to ECEC services for low-income families in particular. Blau (2007) also finds that regulations negatively affect ECEC workers' wages, possibly contributing to lower program quality. This view argues for reducing regulatory burden as a way of reducing costs to users (Gorry & Thomas, 2017) and promoting creativity in service delivery. One way that has been proposed for reducing the burden of oversight is to decrease the frequency of oversight visits.
The cost of assessments is deemed to be too high and different state administrations have explored different ways to reduce those costs. For example, some allow for selecting a smaller number of classrooms to be assessed within a centre (Tout et al., 2010). However, this approach is problematic because the limited research that exists on this issue suggests that there is substantial variability in quality between classrooms within centres (Karoly et al., 2013;Perlman et al., 2019;Sabol et al., 2019). Another approach that is gaining in popularity is reducing the frequency of assessments. An increasing number of US jurisdictions engage in "differential monitoring", both in licensing and in QRISs, based on previously assessed levels of compliance (The National Center on Early Childhood Quality Assurance, 2015). This means longer lags between visits for providers who have a consistently high track record in terms of regulatory assessments. Similarly, in many countries including Australia, England, and New Zealand, the frequency of assessment visits is based on previous ratings and risk assessments (OECD, 2015).
The differential monitoring approach assumes that ECEC providers maintain a stable level of quality across time, and therefore, a longer interval between visits of high functioning ECEC settings will still accurately capture ECEC provider performance. While this may allow a rechanneling of oversight resources to higher risk programs, to our knowledge this assumption of stability, which underlies increasing lags between assessments for licensing and QRIS's has yet to be empirically tested.

Toronto's QRIS
In Ontario, where the current study took place, the provincial government provides the majority of funding as well as the regulatory framework including control over staffing, group sizes and issuance of licenses for all types of ECEC services. The government has recently introduced a tiered licensing system based on the approaches advocated by the National Administration for Regulatory Administration (NARA).
Upper tier municipalities (counties and regions) are designated as local service managers of the ECEC system in Ontario, of which licensed centre-based care is only one component. The City of Toronto is the largest municipality in Canada with approximately 170,000 children below the age of six. At the beginning of 2014, the City of Toronto reported having 41,646 licensed child care spaces in 852 centres. Approximately 70% of these centres were eligible to provide care for subsidized children and represent the frame of our study. Child care subsidies (24,264 in January 2014) (City of Toronto Children's Services, 2014) are portable vouchers accepted by any eligible provider. To be eligible for a subsidy, parents must be either working or studying full-time and meet income requirements. The value of the voucher is set by the actual costs of delivering care reported by each specific ECEC provider. This means that in Toronto, the cost of child care should not play any role in selection of the child care centre and organizational characteristics of the service provider for parents receiving a child care subsidy.
The City of Toronto operates a QRIS that, together with other components, includes mandatory annual AQI assessments for all programs eligible for placement of subsidized children. The AQI was developed in Toronto. To be eligible to provide care for children receiving a subsidy, providers must score a minimum of 3 out of 5 on the AQI. The AQI was developed to be relatively efficient (it takes approximately 90 min to administer, which is far more efficient than 3-5 h of other frequently used ERSs). The preschool version of the AQI that is used in the current study has been shown to significantly correlate to other measures of ECEC classroom quality (e.g., it is correlated to the ECERS-R at 0.61, p < 0.01). Following each assessment, the results are reviewed with the centre supervisor and posted on the centre's notice board. Centre staff are encouraged to discuss the visit results as apart of ongoing program development. Results are also used as a basis for providing tailored quality improvement supports provided by City of Toronto coaches/consultants. Finally, results are posted online to enable parents to use ratings when selecting ECEC providers for their children. Given the policy context of this research (i.e., the trend towards reducing the frequency of quality assessments) we set out to examine whether certain centre and classroom characteristics may predict higher/lower stability over time.

Factors that might impact stability of classroom quality
In the absence of published literature on the stability of quality ratings, and because the assumption of the tiered licensing and quality assessments links quality to stability, we selected a number of age group and centre characteristics for inclusion in the analysis because we hypothesize that they may mediate quality, as well as the stability of quality rating. These include centre type (auspice and organizational status), neighbourhood status, percentage of service delivered by educators with a qualification in Early Childhood Education, hourly wages for early childhood educators and centre supervisors, program size, presence of other age groups, proportion of subsidized children and proportion of subsidized children who come from single-parent families. These variables have been examined in the context of studies on quality; here we focusing on their implications for the stability of quality over time. The rationale for inclusion of each of the covariates is discussed in more detail below.
In Canada, as in many other countries, there is an ongoing debate about the differences in program quality related to program auspice 1 (Brennan et al., 2012;Cleveland, 2008;Cleveland & Krashinsky, 2009;Penn, 2011;LLoyd & Penn, 2012;Mitchell, 2012;Morris & Helburn, 2000;Moss, 2012;Paull, 2012;Sosinsky & Kim, 2013;Sosinsky et al., 2007). Although the majority of researchers find that there are some differences in quality, often they are confounded by the market conditions Cleveland & Krashinsky, 2009) or neighbourhood conditions (Small et al., 2008;Sosinsky et al., 2007). Based on previously published research in Canada (Cleveland, 2008 we expect non-profit centres to show greater stability in terms of quality over time. Furthermore, provincial and municipal funding and system management policies in Toronto are applied differentially to commercial, non-profit and public programs. As a result, we used the type of centre is used as a control (stratifying) variable. We also distinguish between single centre and multiple centre organizations, expecting that multisite organizations may be more stable over time since they may have better policies and procedures in place to streamline their operations and service delivery. We hypothesize that larger, multi-site operations would exhibit smaller standard deviations in AQI scores across time and, presumably, higher AQI scores.
Neighbourhood status has been found to relate to the type and quality of available child care (Bassok & Galdo, 2016;Burchinal et al., 2008;Hatfield et al., 2015;Vandenbroeck & Lazzari, 2014). The actual mechanisms of neighbourhood effects are not clear and are empirically difficult to prove (Galster, 2012;Galster et al., 2011;Ham et al., 2012;Manley et al., 2013). Some have even questioned whether they exist within the Canadian context at all (Oreopoulos, 2008). Nonetheless, given past findings, we control for neighbourhood effects by deploying the Child and Family Inequity Score (CFIS), described in detail below.
Educator qualification is an important contributor to the quality of care (Arnett, 1989;Manning et al., 2019;Whitebook et al., 2001). Ontario's child care regulations require at least one staff with a minimum 2-year degree or diploma in early childhood education (ECE) in every preschool classroom. However, many ECEC programs operate with a higher number of ECEs in all classrooms. We expect that the higher levels of training will be associated with greater stability. Similarly, better remunerated staff provide higher levels of care (Schleicher, 2019). We expect that higher rates of pay would be associated with higher rates of stability in quality across time.
From the authors' ongoing work with City of Toronto administrative data on an unrelated study, we learned that majority of subsidized children who start care as infants remain enrolled at least until they reach the kindergarten age. Child care centre programs that serve infants in effect generate their toddler and preschooler enrollment from the children who started in the centre as infants. Given what the literature suggests about the positive effects of early child care enrolment (Sylva et al., 2011), we hypothesize that, besides the individual child effects, the preschool programs potentially composed of children who were enrolled as infants should exhibit higher and more stable quality scores.
Every centre in our database accepts subsidized children as a condition of its contract with the City of Toronto. However, the proportions of subsidized children and the proportions of subsidized children who live in single-parent families vary greatly between centres, primarily as a result of the neighbourhood status. Centres with a higher proportion of children from low-income, predominantly single-parent families experience a higher level of enrolment turnover due to changes in families' subsidy eligibility. We theorize that these family characteristics will negatively affect the quality and stability of quality in centres with a high proportion of subsidized and single-parent families.

The research questions
Our primary goal was to examine the stability of preschool classroom quality over time. Our secondary goal is to test whether specific classroom characteristics might predict stability, enabling the identification of classrooms that might be good targets for less frequent and, therefore, less costly oversight through longer lags between quality assessments.
To do this we examine classroom level stability in quality over a 3-year period. We use the population of centres that are part of the City of Toronto's QRIS using the AQI. Specifically, our research questions are: 1. Are classroom quality scores stable across the 3-year period? 2. Are quality scores in some programs more stable than others over time? a.
Is stability related to program quality? Specifically, we expect that higher quality classrooms would be more stable over time. b. If higher quality classrooms are more stable, what are their distinguishing characteristics? We expect that covariates associated with high quality (specifically auspice, neighbourhood, proportion of educators with ECE degrees, hourly wage rate, presence of infant classrooms, proportion of children receiving a subsidy, and proportion of subsidized children in single-parent families) would also be associated with stability. b. Are higher quality classrooms sufficiently stable to reduce the frequency of assessments?

Data
The present study includes all preschool child care classrooms in the City of Toronto that were part of the City of Toronto's QRIS every year between 2014 and 2016. The municipal government maintains an extensive administrative database which includes budget information, staffing, public and subsidy fees, as well as data on adherence to performance standards measured using the AQI. Any program interested in delivering subsidized child care in Toronto must be part of the City's QRIS. As a result, our study consists of the entire population of centres that took part in this system. The author requested and obtained permission to access the data in raw form from the City of Toronto Children's Services. In 2014, the 1st year of the data utilized in this study, 70.3% of preschool programs, 73.9% of toddler programs and 82.8% of infant programs in Toronto participated in the City's QRIS. The remaining centres either did not wish to provide access to subsidized children or were deemed ineligible by either City of Toronto policy on restricting growth of the commercial child care sector or a Council approved Child Care Service Plan. The final study frame consisted of 501 centres with 1019 preschool classrooms. Table 1 shows the number of centres and classrooms for which 3 years of data are available for the analysis presented in this article. The table distinguishes between different types of operators based on auspice and whether the operator owns multiple sites.

AQI-the assessment for quality improvement initiative
The preschool version of the AQI is a measure of overall quality consisting of 31 items (see the list of individual items in Appendix 1). Similar to the ECERS-3, the scoring system requires that all sub-item scores on one level meet the standard before moving to the next higher level. The original validation study (Perlman & Falenchuk, 2010) found a one factor solution in which the mean of all individual items is taken as the reported score. It also found that a factor comprised of the mean score of items related to the quality of teacher-children interactions can be calculated and used as a stand-alone factor. The Spearman correlation between the measure and ECERS-R was r = 0.61 p < 0.01. The Spearman correlations for the CLASS subscales of Emotional Support, Classroom Organization and Instructional Support were r = 0.39 p < 0.01, r = 0.36 p < 0.01 and r = 0.47 p < 0.01, respectively. The current version of the AQI is measured on a five-point scale where scores of 1 and 2 represent inadequate quality, a score of 3 meets the City's minimum standards to maintain a service contract, and individual item scores of 4 and 5 exceed minimum expectation. Any items with scores below three identified during the assessment are subject to a remediation order and further sanctions if the identified problems persist.
AQI assessments are conducted unannounced annually by trained observers who are randomly assigned to ECEC centres. All classrooms within a centre are assessed by the same rater. Raters' interrater reliability is established every 4 months and assessors must achieve 80% or higher agreement with gold standard expert ratings. The average interrater agreement for 2014, 2015 and 2016 was 94%, 96% and 92%, respectively. During this period individual raters' percent agreement scored ranges from 81 to 100%. For publication on the City of Toronto website the AQI scores are aggregated to the age group level; however, classroom level scores were available for analysis in our data set. In this paper we focus on the cross-sectional and longitudinal characteristics of total AQI ratings.
Descriptive statistics for the cross-sectional data for each of 3 years is presented in Table 2. The overall mean AQI value for all classrooms has increased over 3 years by given that the effective range of acceptable scores is between three and five, we interpret the standard deviations as large. Finally, we note that the mean scores differ between the individual centre types as does the rate of change over type.

Centre type
Because provincial and municipal funding and system management policies are applied differentially to commercial, noon-profit and public programs, the type of centre is used as a control (stratifying) variable. It is defined by the combination of auspice and the number of centres operated by individual service providers. Auspice is defined as commercial, non-profit, and publicly operated. Any organization that comprises three or more preschool sites is categorized as multi-site operator.

The Child and Family Inequity Score (CFIS)
CFIS is an index developed by the City of Toronto in co-operation with representatives of community and post-secondary institutions. The index is composed of the following items: incidence of children in low-income families, female education unemployment rate, lack of affordable housing and proportion of families with English as a second language. The individual items are assigned weights by consensus and calculated for each of Toronto's 140 neighbourhoods. In this sample CFIS scores range from − 1.5 to 2.14 with the lower values representing more child and family friendly and affluent neighbourhoods.

Centre and staff characteristics
The number of preschool spaces in each centre, as well as the presence of infant and kindergarten age groups are used as indicators of size of each centre. Staff characteristics include the percentage of care hours delivered by Early Childhood Educators (ECE) with a minimum 2-year post-secondary degree, trained staff (ECE), and average ECE and centre supervisor hourly wage. This information is only available at the program level (i.e., the same score is used for all classrooms that serve preschool aged children within a centre).

Child and family demographics
The proportion of subsidized children in this study is extracted from the administrative database for 2015 to match the centre profile and neighbourhood data. We include the proportion of subsidized children in one parent families as an additional proxy of low family income. As with the case for educator level variables, this information was only available at the program level.

Analytic approach
We begin with a cross-sectional analysis of mean AQI scores for each of 3 years by individual centre type. Because of the unequal group sizes and variances, where appropriate we analyze individual centre types separately as opposed to including centre type as a predictor.
To answer our first research question, we use Stata 15 (Stata Statistical Software: Release 15, 2017) software to build a growth model specifying random slope and random intercept to partition within and between variances over 3-year period. The null model consists of annual observations of AQI scores for each preschool classroom within a given centre. We test the model fit by examining the interclass correlations and residuals. We then built a separate null model for each centre type to assess their individual within and between classroom variances.
To analyze further the characteristics of classrooms with stable scores we compute the maximum difference between the highest and lowest classroom score and define as stable those classrooms with differences between zero and − 1 standard deviation. We then conduct a visual analysis to investigate the distribution of stable classrooms across the range of AQI scores.
An OLS regression using the maximum score difference as a dependent variable with a full set of independent variables described above is performed to answer RQ 2-a. To validate our conclusion, we also execute separately for each centre type a logistic regression with AQI stability as the dependent variable. Finally, to answer question RQ 2-c we calculate the percentages of classrooms in each centre type that retained a position in the top 75th percentile of AQI scores in each year. Table 3 provides the average values for the covariates taken from year zero (2014) of the study, with the exception of the CFIS which was based on 2016 Canadian census data. As can be seen in Table 3, the average values of individual covariates vary substantially for the different types of centres (see Appendix 2 for statistical comparisons of their means). These differences along with findings that the different centre types had unequal variances indicate that centre types need to be examined separately in this sample. Pairwise comparison of means finds that CFIS for municipal and commercial centres of either type are not significantly different from each other. Non-profit centres, both single and multi-site, are primarily located in neighbourhoods with significantly lower CFIS values (i.e., more affluent neighbourhoods). At the same time, the mean CFIS of multi-site, non-profit centres is significantly higher than that of single site non-profits (F(1,365) = 8.49, p < 0.01). Although there is a large correlation (r = 0.582, p < 0.001) between CFIS and proportion of subsidized children with the preschool age group, the proportion of children receiving a subsidy is included because it reflects the characteristics of the actual children in each centre. The histograms presented in Fig. 1 demonstrate the substantial differences in distributions of subsidized children, which are not apparent when all centre types are combined. Appendix 2 presents the results of mean comparisons for all covariates, including significance levels adjusted for unequal group sizes and unequal variances.

Description of the data
Three-hundred and seventy-one (36.4%) out of the 1019 classrooms are in centres that provide service to the infant age group. The presence of younger age groups is associated with different levels of AQI scores in classrooms that serve preschool aged children. Specifically, using a t test with unequal variances option we find that these classrooms (M = 4.19, SD = 0.41) were rated significantly higher than classrooms in centres without infants (M = 3.99, SD = 0.40), t(1) = − 7.22, p < 0.001).

Can classroom AQI scores be aggregated in centres with multiple preschool classrooms?
The way the AQI is used as part of the City of Toronto's accountability system involves quality aggregation across classrooms that serve children of the same age within centres. These aggregations make assumptions about homogeneity in classroom quality that have received only limited attention from researchers (Karoly et al., 2013;Pauker et al., 2018).
To determine whether it is appropriate to combine across preschool classrooms within a centre, we examine the variability in classroom quality within centres. In the 345 centres that had more than one preschool classroom, the mean range between the lowest and highest score is 0.32 with a standard deviation of 0.25. A decomposition of variance into between classrooms and across centres shows only a moderate level of intraclass correlation between classrooms (ICC = 0.597, SE = 0.030 CI 95(0.536, 0.654)). Not surprisingly, the mean range of values increases with the number of rooms. However, even centres with only two rooms have an average range of 0.26. A range of 0.32 represents 16% of the acceptable range between 3.00 and 5.00.

Descriptive results: AQI scores by type of centre and year
Given this level of heterogeneity in the quality scores of classrooms within centres, we analyzed stability across time for individual classrooms. Average AQI scores were comparable across centre type with one exception. Using a method that adjusts for unequal variances and unequal sample sizes, a pairwise comparison of AQI means between individual centre types (Table 4) reveals that only publicly operated centres score consistently higher than the other centre types. Multi-site commercial operators show increasing, statistically significant differentiation from the non-profit and single commercial centres over the study period.

RQ 1-Are classroom quality scores stable across the 3-year period?
To avoid potential problems of different group sizes and unequal variances, we have estimated the multilevel model separately for each individual centre type as well as for the entire sample. The results of all estimation including the intraclass correlations are presented in Table 5.
The fixed effects part of the model across all classrooms shows an intercept of 4.05 with a slope of 0.04; in the random effects part of the model the variances for year and intercepts are displayed together with the residual variance. The total variance of random effects is used as a numerator in the calculation of intraclass correlation (ICC) with the denominator being the total variance of random effects plus residual variance.
The ICC values of 0.457 for the age group level and 0.518 for the combined age group and classroom levels suggest that slightly more than half of the total variance originates between classrooms, while the remainder represents the within classroom variance. In other words, the chance of accurately predicting the next score is only slightly better than 50%, 2 thus allowing us to answer the Research Question #1 in negative. Focusing on individual centre types reveals a range of annual growth from 0.026 AQI in single-site commercial centres to 0.083 in multi-site centres. Similarly, the joint age group-classroom intraclass correlations range from as low as 0.279 in public programs to 0.528 in singlesite non-profit programs further confirming that large within classroom variances negate the possibility of safely predicting the AQI scores in succeeding years. The magnitude of the residual variances suggests possible issues with the estimation itself. To begin with, although the residuals are approximately normally distributed, a  -Hesketh & Skrondal, 2008) are very low as is the case regarding publicly operated centres where both random effects variance are low (especially for the AQI score) and residual variance is less than one half of that for all classrooms combined. As can be seen from Table 4, the AQI scores of publicly operated centres have a substantially smaller standard deviation than other programs. This, combined with higher intercept scores, leads to lower variances of the intercept; in this case 0.0160 for public centres vs 0.0837 for all centres combined. Thus, despite having the lowest variance of residuals, it also has the lowest ICC scores.
plot of residuals against fitted values (Fig. 2) reveals a lack of random distribution around zero. Positive residuals indicate that the fitted value underestimates the actual value, while a negative residual indicates overestimation. Because there is an upper bound on positive residual values that is equal to fitted value plus residual being less or equal to 5.00, the plot shows a reduction in positive residuals around the 4.50 level of the AQI. At the same time, the strong correlation between residuals and fitted values (r = 0.778, p < 0.001) suggests that the linear estimation process does not represent well the actual AQI trajectories.
RQ 2-Are quality scores in some programs more stable than others over time?

RQ 2a-Is stability related to program quality?
To determine the stability of scores we calculate the absolute difference between the highest and lowest scores for each classroom in the 3-year period. The mean difference between the highest and lowest score for all rooms is 0.49 with standard   (Table 6) with municipal centres' mean difference being significantly lower than that of the other centre types. All classrooms with difference values lower than minus 1 SD (0.22) are then deemed to be "stable". Using this approach, 192 or 18.8% of the 1019 classrooms are deemed stable. This percentage varies by type of centre, ranging from a low of 16.2% for non-profit single site centres to a high of 32.2% for the municipal programs. Notably, the absolute difference values tell us nothing about the direction of change; of all 1019 classrooms only 13% improved their AQI score in each year, while 7% declined every year. The remaining 80 experienced a variety of patterns that included growth, reduction or stability of AQI scores.
As shown in Fig. 3, there is no easily discernible relationship between the maximum score difference and the AQI at year 0 for stable classrooms; in other words, classrooms with stable scores can be found across the whole range of AQI scores. However, the Pearson correlation between the year 0 score and the maximum difference is weak at r = − 0.2703, p < 0.0001. We confirm this finding by plotting the results of separate logistic regressions with stability as the dependent variable and AQI score at year 0 for each centre type as the independent variable. While the probability of any classroom having stable scores increases with their initial (year 0) scores, it never reaches even a 40% level (Fig. 4).

RQ 2b-If higher quality classrooms are more stable, what are their distinguishing characteristics?
Multiple regression analyses for the full sample as well as for individual centre types are used to test whether the difference in classroom scores over the 3-year period could be predicted from the covariates used in this study. The results reveal that the program characteristics explain only 2% of the variance (adj. R 2 = 0.0233, F(9,1006) = 3.69, p < 0.001) for the full sample and non-significant results for individual centre types. The full results are presented in the Appendix 4.

RQ 2c-Are high quality classrooms sufficiently stable to reduce the frequency of assessments?
To answer this question, we identify the 25% of classrooms that scored highest on the AQI in year 0 and track the changes in their AQI score to year 1 and then from year  However, because we are interested in the stability of the scores for the purpose of reducing the frequency of oversight assessments, it is illustrative to focus on the number of classrooms that manage to retain their membership in the top 25% in each of 3 years. The distribution of top scoring classrooms in year 0 among the types of centres is shown in Table 7. The proportions of high scoring programs range from 15.9% for single site commercial operators to 47.7% for municipally operated programs. Over the 3-year span less than 7% of all 1,019 classrooms manage to remain in the top quartile of AQI scores.
The percentage of programs that consistently maintain their top ranking is shown in the last row of the same table. Of all the 254 classrooms that are in the top 25% in year 0, only 27.6% (70) remain in that group in each of the following 2 years. Across the types of centres, the rate ranges from low of 14.3% in the single site commercial programs to a high of 54.8% for the municipal programs. Even at the much higher retention level in municipal programs is not sufficient to exempt these programs from annual assessment.

Stability of classrooms scores over a 3-year period
One of the main motivations for this study was to empirically test the degree to which program quality is stable over time. Using multi-level modelling we establish that the within classroom variances are almost as high as the between classroom variances. In practical terms this means that it is almost impossible to accurately predict the next year's score from the current year. 3 When focusing on centres that scored in the 75th percentile or higher in year 0, we find a substantial differences in their ability to maintain the high ranking according to the centre type (Table 7); however, even within the highest scoring municipal sector, only 54.8% managed to retain that ranking over the 3-year period. We also find stable scores at all levels of quality (Fig. 3). This, of course, is problematic as classrooms at lower levels of quality should focus on continuing improvement. Finding stability at all levels of quality also helps to explain the lack of associations between stability and the structural characteristics usually associated with program quality.
On the other hand, the programs at the high end of the scale are expected to maintain their ratings over time. We have defined high quality scores as belonging to the 75th percentile or higher in year 0. Contrary to our expectations, less than 28% of classrooms managed to remain in the top category every single year in the 3-year period. This finding leads us to reject the suggestion that the frequency of assessments can be reduced on the basis of belonging to the top scoring programs. In a post hoc analysis we employ the same approach to analyze classrooms with scores in the 90th percentile and above; only 15% of classrooms manage to maintain their place in that category in each of 3 years. Therefore, even among exceptionally strong programs, instability is very high.
There are significant differences based on the type of the centre in the membership in the top-quality groups as well as in the rate of remaining in the group over 3 years; we address these differences below.

Centre type
Based on the differences in centre characteristics, we expected differences in levels of quality and stability, albeit tempered by the strong system management role provided by City of Toronto. Compared to both types of commercial centres, on average, the non-profit sector pays significantly higher wages, operates with a higher proportion of ECE trained staff, is located in more affluent neighbourhoods and serves a lower proportion of subsidized children (Appendix 2). Staff characteristics of the public sector are comparable to the non-profit sector, while the child and family demographics, and neighbourhood characteristics are similar to those of the commercial sector centres. A comparison of the quality scores across the five centre types showed no significant differences between the commercial and non-profit centres in year 0 and significant difference between the public centres and the rest. The public centre advantage remains in the following 2 years while the commercial multi-site sector improved its score enough to separate itself from the other centre types.
To understand the similarity of scores between commercial and non-profit providers it is important to note that the commercial centres in this sample have all been part of the City's QRIS for decades. The relatively tight enforcement of the standards and supports within the City's QRIS may explain the relatively high performance of the commercial centres (Cleveland, 2008). Since the early 1990's the City of Toronto has had a policy of not entering into any new contracts with commercial operators as well as eliminating profit as a component of approved operating budgets. All ECEC operators with a purchase of service contract with the City of Toronto are under a non-distribution constraint, making them effectively "commercial, entrepreneurial" non-profits (Bushouse, 1999;Hansmann, 1980). At the same time, the operators officially designated as nonprofit receive higher operating grants making it possible for them to pay higher wages or hire more than the legislated minimum of educators with relevant educational backgrounds. The variability in AQI scores within the commercial and non-profit sectors suggests that more attention should be paid to supporting quality of service rather than whether the centre falls into commercial or non-profit category.
We categorize centres into single and multiple site operators to explore whether being a part of a larger organization contributes to higher consistency of practices and standards of operations. A more consistent operation would be expected to exhibit smaller standard deviations and, presumably, higher AQI scores. However, a closer examination of several multi-site agencies reveals no consistent relationship between higher level of stability and quality scores. Although the public centres exhibit substantially higher level of stability (Table 4 and Fig. 4), they are still well below the rate that would allow for reduced frequency of quality assessments.

Classroom or age group level?
Aggregation of scores across classrooms to the centre level is generally adopted in many QRIS systems; including the one currently operated by City of Toronto's QRIS. However, we find that in many cases there are substantial differences (M = 0.32, SD = 0.25) between individual classrooms in centres. Because all of a centre's classrooms are assessed by one, and only one, trained observer, the issue of inter-rater reliability does not apply in any given year. Nevertheless, above and beyond the analysis presented in this paper, the aggregation of individual scores has some serious implications. First, it potentially misleads users about the program quality of their child's classroom; in this study the difference of 0.32 on the AQI scale represents a 16% difference. Second, any substantial difference should give rise to questions about program supervision and management practices. Finally, it supports the recommendation that, rather than a sample within the centre, all classrooms should be assessed on a regular basis.

Cost of child care quality assessments in City of Toronto
The cost of assessments is covered by the municipality and at the time of data collection was approximately 8 cents per child per day (A. Hepditch, personal communication, January 8, 2020). This represents less than 0.02% of a median price of a preschool space in Toronto.

Limitations
This study suffers from several limitations. In limitation is that the study data come from the City of Toronto which is a high demand market area (i.e., a seller's market); this is demonstrated by the growth in commercial and non-profit operators who opt to remain outside the subsidy system, primarily in affluent areas of the city. Results from this study are primarily generalizable to localities with a similar market profile and level of oversight and program support.
The administrative data used in this study were collected for accountability and performance improvement of programs that serve various proportions of subsidized and full-fee children. Although these programs represent over 70% of preschool services in Toronto, no conclusions should be drawn about the quality of the non-funded programs or programs that were established in 2015 or later. However, it is important to note the 70% coverage rate is relatively high compared to other studies.
Another limitation of the administrative data we use is that information that would be valuable in exploring program quality and stability is simply not available. Under the provisions of protection of privacy legislation, the municipality is not allowed to collect data for the purpose of determining eligibility for subsidized child care related to ethnic background, language spoken at home and parent education. Instead we rely on neighbourhood information (CFIS) which is more distal to the actual service than the parents of children enrolled in the centre. Another limitation of our data is that the three datapoints available to us did not enable us to test for curvilinear patterns in that data. It is important that future studies including longer term follow-up of quality test for such patterns.
The information about the proportion of ECEs in each classroom, their hourly pay, and the number of children receiving a subsidy and who came from single-parent households is only available as an aggregate across all preschool aged classrooms in each centre.
Finally, the issue of human error in measurement is an important one to consider. Although the program observers are regularly tested, and had high levels of interrater agreement exceeding 90% across raters and time, an interrater agreement rate of 100% is generally not feasible. This means that some level of disagreement exists between raters and across time and this likely explains some of the fluctuations in scores observed in this study. One way we reduced the role of measurement error/noise in our analyses is that we did not consider very low levels of fluctuations as reflective of instability. In general, the question of the levels of fluctuation that are significant requires further study and it will be important to include measures of child wellbeing as a way to determine which levels of fluctuations in quality are meaningful.
Two different findings are noteworthy and merit further investigation. First, there is no evidence that program auspice can be used in predicting stability of quality scores. Even though as a group, publicly operated programs had significantly higher AQI scores and lower variability, a small number of those programs fell below the expected quality and stability range.
Second, further investigation is needed to identify factors underlying the stability of quality, or the lack thereof. The covariates-centre, user and neighbourhood characteristics-used in this study shed little light on this question. Future research should examine these variables at the classroom level as well as explore additional variables such as educator training, staff retention, participation in program planning, reflective practices and ongoing learning might improve stability of quality over time. Because the data available to us consist of annual assessments we were not able to test the extent to which quality varies within the year.

Conclusion
We set out to provide empirical evidence of the stability of quality ratings over a 3-year period, and to investigate whether quality assessments can be carried out on a less frequent than annual basis. Our findings do not support such a change. In fact, our findings suggest that the frequency of assessments should not be reduced because attaining a high score in any given year is not a guarantee of doing so again in subsequent years. Furthermore, the chance of remaining in the top quartile scoring group over the 3-year period is less than 28%. In addition, if we accept that one of the purposes of any QRIS is to provide evidence of effective program intervention and supports, then such evidence has to be available in a timely manner. Reducing the frequency of independent assessments will only make it difficult to identify critical issues, devise corrective strategies and provide required program supports. Finally, if the information resulting from QRIS or, more specifically quality assessments, is to be useful to parents in making their child care choices, then it has to be current. Together, these highlight the need to maintain the frequency of quality assessments, conducted annually at a minimum, as part of ECEC quality oversight regimes. Standard errors in parentheses*p < 0.05, **p < 0.01, ***p < 0.001