When Promising Interventions Fail: Personalized Coaching for Teachers in a Middle-Income Country

IZA DP No. 15021 JANUARY 2022 When Promising Interventions Fail: Personalized Coaching for Teachers in a Middle-Income Country Children in developing countries have deep deficits in math and language. Personalized coaching for teachers has been proposed as a way of raising teacher quality and child achievement. We designed a coaching program that focused on one aspect of teacher quality—teacher-child interactions—that researchers in education and psychology have argued is critical for child development and learning. We implemented the coaching program in Ecuador, with 100 1st grade teachers randomly assigned to treatment and 100 to control. Coaching improved the quality of teacher-child interactions but reduced child achievement. Our results underline the importance of evaluating new forms of professional development for teachers, even those that follow best practice, before these interventions are taken to scale. JEL Classification: I20


Introduction
Teachers are the most important input into the production of learning within schools. Improving teacher quality is a central goal of policymakers in developed and developing countries. In this paper, we evaluate an innovative program that sought to improve teacher effectiveness in a sample of schools in Ecuador, a middle-income country in South America.
Our paper is motivated by four observations from the recent literature on teachers. First, there is substantial variation in teacher quality, even within the same schools, and these differences have important consequences for achievement, college attendance, and labor market outcomes. 2 However, there is considerable uncertainty about policies that can increase the effectiveness of current teacherssee Jackson, Rockoff, and Staiger (2014) and Fryer (2017) for a discussion focused on the U.S., and Evans and Popova (2016), Gaminian and Murnane (2016), and Glewwe and Muralidharan (2016) for evidence from developing countries.
Second, most countries, both developed and developing, spend substantial resources to improve teacher quality. The U.S., for example, spends an estimated $18 billion a year on in-service training for teachers (Education Next 2018). A recent paper (Loyalka et al. 2019) reports that between 2012 and 2017, India's national government allocated US $1.2 billion to teacher professional development programs, and that teachers in Mexico spend an average of 23 days in professional development each year. Yet much of this in-service training is thought to be ineffective because it does not give teachers actionable guidance on how to improve their teaching practices (Popova, Evans, and Arancibia 2016;Popova et al. 2018).
Third, given the perceived shortcomings of traditional in-service training for teachers, policymakers have begun to experiment with alternative approaches, in particular coaching by expert teachers. Coaching has been identified as a promising way of improving classroom quality and learning for young children (Yoshikawa et al. 2013;Evans and Popova 2016), and a recent meta-analysis of coaching programs in the U.S. and other developed countries finds pooled effect sizes of 0.49 SDs on instruction and 0.18 SDs on achievement (Kraft, Blazar, and Hogan 2018). The Biden Administration's American Families Plan, which aims to provide universal preschool to all 3-and 4-year-old children, specifies that this must be accompanied by regular "job-embedded coaching for teachers" (Weiland and Yoshikawa 2021, p. 1). However, the evidence base from developing countries on these more innovative, personalized forms of in-service training for teachers is still thin.
Finally, much recent literature in education and child psychology has emphasized the importance of interactions between teachers and children, especially in preschool and the early years of elementary school. Indeed, as Perlman et al. (2016) write, the focus on interactions is "driven by some of the most fundamental theories of developmental psychology", including "attachment theory, Ecological Systems theory's focus on the child's interactions with his/her most immediate environment, and Vygotsky's emphasis on learning through social exchanges by supportive 'experts'". As a result, a handful of pilots that seek to improve the quality of teacher-child interactions have been implemented, especially in the U.S.
With these insights in hand, we designed, implemented, and evaluated a coaching program for 1 st grade teachers in Ecuador. The program was directly based on two coaching programs for teachers in the U.S., Making the Most of Classroom Interactions (MMCI) and My Teaching Partner (MTP). We worked closely with the creators of these programs at the University of Virginia and with officials at the Ministry of Education in Ecuador to adapt them to the Ecuadorean context.
The coaching intervention we study provided 1 st grade teachers with bi-weekly, personalized coaching. It had what has been argued are critical elements for success: It was focused on a particular determinant of learning-namely, the nature and quality of teacher-student interactions, and how to improve them; it was semi-structured, using a curriculum that gave teachers a framework to think about classroom quality, but also providing concrete recommendations to improve classroom practices; and it was personalized to what coaches observed in the classrooms of individual teachers.
Coaches were teachers who had been nominated by headmasters and fellow teachers in their schools. They received two weeks of full-time training before the intervention began and were then taken out of their classrooms for a year so that they could work full-time as coaches. The program worked on a two-week cycle and lasted for a full school year. Each teacher in the treatment group received 13-14 personalized feedback sessions from her coach. Earlier research from the MTP coaching program in the U.S. suggests that 8-12 biweekly coaching cycles are the necessary dosage to change teacher behaviors (Downer et al. 2009;Pianta et al. 2014).
We evaluate the impact of the coaching program on the quality of teacher-child interactions, and on achievement in math and language. To measure interactions, we filmed teachers teaching for a full day, and coded the video footage with a much-used classroom observation tool, the Classroom Assessment Scoring System (CLASS; Pianta, LaParo, and Hamre 2007). The CLASS measures the quality of teacher-child interactions in three broad domains: Emotional Support, Classroom Organization, and Instructional Support.
We first show that, at the end of 1 st grade, teachers in the treatment group had higher-quality interactions with their children, about 0.26 SDs. We find, however, that these improvements in classroom quality did not translate into higher achievement: The point estimates from regressions of test scores on an indicator of random assignment to mentoring are generally negative, in some cases significantly so.
We discuss various possible explanations for our results. First, as we show, coaching primarily improved Emotional Support, a dimension of quality that is only weakly correlated with achievement. Thus, the coaching intervention improved aspects of teaching practices that were unlikely to raise test scores, at least in the short run. Second, coaching may have taken time away from other activities carried out by teachers (like lesson planning). Third, it may be that encouraging teachers to teach in a way that was unfamiliar to them improved teacher-child interactions but disrupted the learning process. Possibly, more time is needed for teachers and children to adjust to a new pedagogical approach.
Our paper contributes to a literature on professional development for teachers. It shows that more innovative-but also more expensive-forms of professional development, like coaching, may not always have the expected results, even when they have a strong foundation in theories of child learning, are carefully designed, and are faithfully implemented.
The coaching program we analyze required teachers to be responsive and adjust to the needs of individual children in their classroom. Pedagogical approaches that are more child-centered are easier to implement where classroom sizes are relatively small and teacher capabilities are high. However, this is rarely the case in developing countries: While average class size in elementary school in OECD countries is 15, it was 31.5 in the 1 st grade classrooms in our sample, and is 50 or more, on average, in nine countries in Sub-Saharan Africa. 3 We speculate that, under these circumstances, coaching may be more effective when it focuses on a particular child outcome, gives teachers specific guidance on the steps needed to improve that outcome, and provides teachers with complementary material to guide them as they adjust their classroom practices.
To illustrate this point, we contrast our results with those from two recent evaluations of coaching programs in developing countries. Yoshikawa et al. (2015) analyze Un Buen Comienzo, a coaching program for pre-k and kindergarten teachers in Chile. The program focused on the quality of teacher-child interactions but gave teachers little guidance on specific instructional or classroom practices. The authors show that Un Buen Comienzo improved teacher-child interactions, as measured by the CLASS, but did not raise child development or achievement. Cilliers et al. (2020) analyze a coaching intervention (as well as, separately, more traditional inservice training) for 1 st and 2 nd grade teachers in a sample of South African schools. The program sought to change how reading was taught to young children. Specifically, it encouraged teachers to switch from reading out loud in front of the class to group reading by the children themselves. The coaching intervention also provided fully scripted lesson plans, encouraged teachers to group children by ability, and promoted frequent assessments of children. Like our paper, and like Yoshikawa et al. (2015), Cilliers et al. (2020 test for changes in classroom practices, but these were practices related to reading-for example, the frequency of group reading-rather than teacher-student interactions more broadly defined, as measured by the CLASS. The authors show that the intervention changed classroom practices and significantly improved reading outcomes. Our paper also adds to a recent literature in economics on the challenges inherent in replication and scale-up of promising interventions (see the collection of papers in List, Suskind, and Supplee, 2021, which focus on interventions for young children; also, Al-Ubaydli, List, and Suskind 2017;Banerjee et al. 2017).
The rest of the paper proceeds as follows. We describe the setting, the coaching intervention, and data in section 2. Section 3 presents results on the cross-sectional association between the CLASS and achievement. We describe our identification strategy in section 4. Our main results on the effects of coaching are in section 5, and we conclude in section 6.

A. Setting and intervention
Ecuador is a middle-income country in South America. The elementary school cycle runs from kindergarten to 6 th grade. The overwhelming majority of children attend public (rather than private) schools. Enrollment in elementary school is essentially universal. The key educational challenge in elementary school is quality: On an international test of 3 rd graders, 38 percent of children in Ecuador had the lowest of the four levels of performance on math, very similar to the average for the 15 countries in Latin America that participated in the test (40 percent), but substantially more than higher-performing countries like Costa Rica (18 percent) or Chile (10 percent) (Berlinski and Schady 2015).
Our study took place in 198 schools in the province of Pichincha, the most-densely populated province in the highlands region of Ecuador. We selected a sample of 10 coaches through a process that included nomination by headmasters and peers, and performance on tests that assessed how suitable teachers would be as coaches.
Specifically, the selection process had three stages. First, the Ministry of Education identified 115 potential coaches in 33 schools in Pichincha province; all potential coaches were tenured and had worked as teachers in k-3 rd grade in the last 5 years. Next, we had these teachers take a test ("Ideas about Children") that was meant to assess whether they were receptive to the notion that children, not teachers, should be at the center of the learning process. At this stage, we also asked all other teachers in the 33 schools to name 3 teachers they would go to if they needed guidance on some aspect of teaching practice and calculated the proportion of all votes cast in a school that were cast for each potential coach. Similarly, we asked school principals to identify the 3 teachers in their school who had the most potential to serve as coaches. On the basis of these three pieces of data-score on the test, recommendations from peer teachers, and recommendation by the principal-we calculated an aggregate score for each potential coach.
We invited 24 teachers with the highest scores to a 3-day training on the CLASS, and scored them on class participation, comprehension of the material, and fidelity in scoring a sample of videos according to the CLASS. With this information, we selected 10 coaches, all of whom accepted the offer of employment for a year.
Once coaches had been selected, we matched each coach to the 20 elementary schools that were closest to her place of residence (excluding her own school) and, within these schools, we randomly assigned 10 schools to treatment and 10 to control. In each treated school, the coach was assigned one 1 st grade teacher. Control schools continued business as usual. In total, there were 100 1 st grade teachers in the treatment group, and 98 in the control group. 4 Coaches received two weeks of training administrated by expert CLASS trainers, some of whom had also participated in the MMCI and MTP programs that were the basis of the Ecuador coaching pilot.
In turn, teachers assigned to the coaching treatment received one week of training on the general framework of the CLASS, with a focus on the importance of teacher behaviors and teacher-student interactions.
The coaching pilot was semi-structured. It followed a two-week cycle. Every cycle focused on a particular topic-for example, how to give feedback to children. Each teacher in the treatment group was recorded teaching for a full day every two weeks. The coach reviewed the video, and looked for specific moments that showed desirable behaviors, as well as those that could be improved-focusing mainly, but not exclusively, on the topic for that cycle. She also compared these video clips with selected videos from a large "library" of videos from Ecuador that had been prepared for this purpose. Coach and teacher then had an in-person meeting, viewed the video and the relevant clips from the library together, and agreed on concrete actions that the teacher could take in her classroom.

B. Data
We use the CLASS (Pianta, LaParo, and Hamre 2007) to measure the quality of teacher-child interactions. The CLASS is based on developmental and education theories that argue that the daily interactions between teachers and children are the "primary engine" for child development and learning in preschool and early elementary school Leyva et al. 2015).

The CLASS measures teacher behaviors in three broad domains: Emotional Support, Classroom
Organization and Instructional Support. The behaviors that coders are looking for in each dimension are quite specific-Appendix Table A1 in Araujo et al. (2016) gives an example. For each of these behaviors, the CLASS protocol gives coders concrete guidance on whether the score given should be ''low'' (scores The CLASS has been widely used both for research and policy purposes in the U.S. Head Start grantees need a minimum score on the CLASS to be re-certified for funding, and several states have integrated the CLASS into their quality rating and improvement systems. The CLASS has also been used as a measure of teacher quality in research on several Latin American countries, including Chile Yoshikawa et al. 2015; Bassi, Meghir, and Reynoso 2020), Ecuador (Araujo et al. 2016;Campos et al. 2020), and Peru (Araujo, Dormal, and Schady 2019).
We carefully followed CLASS protocols to code the videos of treated and control teachers recorded at the end of 1 st grade. 5 Specifically, each day of film was cut into 20-minute segments. We took the first four segments, and each segment was coded by two separate coders. Coders were blinded to treatment status. The correlation in the scores given by different coders is high-the inter-coder reliability ratio is 0.84, on average. On the other hand, there is more variation in the CLASS scores given to different segments from the same day-the inter-segment reliability ratio between the 1 st (earliest) and In Table 1, we summarize the characteristics of schools (Panel A), teachers (Panel B) and children (Panel C) in our sample. Panel A shows that the average school in the sample had 4.7 teachers between kindergarten and 3 rd grade. These values, as well as the proportion of teachers in different pay grades in the salary scale, are similar in treatment and control schools. Panel B shows that essentially all (98 percent) of teachers are women, and most (90 percent) are tenured. On average, teachers have almost 16 years of experience and 31 children in their classrooms. Panel C shows that half of the children in the sample are girls, and average age (measured at the end of the grade, when children took the achievement tests) is 7 years. There are small differences in child age and gender by treatment status: Children in the treatment group are about 0.7 months older than those in the control group and are 4 percentage points more likely to be female. In our estimates, we control for all school, teacher, and child characteristics in Table 1.
We provide further details on the CLASS in the Appendix. Figure A1 graphs univariate densities of the distribution of CLASS scores for teachers in the control group, by domain. The figure shows that CLASS scores are highest in Classroom Organization, with teachers distributed in the "medium" and "high" parts of the distribution; somewhat lower in Emotional Support, with most teachers in the "medium" range; and lowest in Instructional Support, where all teachers have "low" CLASS scores. There are clearly floor effects for Instructional Support, but not the other domains. On average, the CLASS scores in this sample are somewhat higher than those found in a nationally representative sample of kindergarten classrooms in Ecuador, but substantially lower than those generally found in U.S. settings (Araujo et al. 2016). Table A1 shows that CLASS scores across domains for the same teachers are positively correlated, with correlations that range from 0.43 for Emotional Support and Instructional Support to 0.69 for Emotional Support and Classroom Organization. The fact that the correlation between teacher scores on the different domains of the CLASS are far from unity likely reflects a combination of factors: Different teachers may genuinely excel in the behaviors in different domains, but measurement error would also tend to reduce the magnitude of the correlations.
Within each classroom, we used a random number generator to select a sample of 20 children who would be tested at the end of the year. 6 To measure achievement, we applied two language and two math tests. The language tests were a test of letter and word recognition and a test of receptive vocabulary, while the math tests were a test of number recognition and a test of simple addition and subtraction. To measure receptive vocabulary, we use the Test de Vocabulario en Imagenes Peabody (TVIP) (Dunn et al. 1986), the Spanish-speaking version of the much-used Peabody Picture Vocabulary Test (PPVT). The TVIP has been used widely to measure development among Latin American children-see Paxson and Schady (2007) for a comparison of vocabulary scores between children in Ecuador and the U.S., and Schady et al. (2015) for evidence on levels and socioeconomic gradients in the TVIP in five Latin American countries, including Ecuador. The other three tests were taken from the Woodcock-Johnson battery of achievement tests (Woodcock and Muñoz-Sandoval 1996) and have been applied in other evaluations of interventions in settings similar to ours, including in Ecuador (Paxson and Schady 2011;Araujo et al. 2016).
Unsurprisingly, performance on the four tests is positively correlated in our data. The lowest correlation, 0.34, is between vocabulary and number identification, and the highest, 0.64, is between number identification and basic arithmetic. As with the CLASS, the magnitude of the correlations likely reflects that different tests measure different dimensions of knowledge or achievement, but also measurement error in the tests.

Cross-sectional associations between CLASS and achievement
To motivate our analysis, we first calculate the associations between the CLASS and child test scores.
Panel A of Table 2 reports the results from regressions of overall achievement on the CLASS, with the sample limited to children in the control group. Column (1) shows that a 1 SD increase in the CLASS is associated with a 0.097 SD increase in test scores (p-value: <0.001). In the following columns we report the association between individual domains of the CLASS and achievement. These columns, and in particular the specification in column (5), which includes all three CLASS domains at the same time, show that Classroom Organization is most strongly associated with test scores. In contrast, the association between Emotional Support and achievement is not significant once other CLASS domains are included as controls.
To further explore the associations between the CLASS and achievement, we next make use of data from another experiment, carried out in a different sample of 200 schools in Ecuador (see Araujo et al. 2016;Campos et al. 2020). In that experiment, children were randomly assigned to classrooms within schools in kindergarten and were then reassigned to different classrooms in every grade between 1 st and 6 th grades. Much as in the coaching pilot, teachers were filmed, and the video was used to calculate teacher CLASS scores. At the end of each grade, children were given a large battery of age-appropriate tests in math and language. In 1 st grade, these tests included the same math and language tests as were applied in the coaching pilot (as well as other tests). For the calculations we report below, we limit the sample to teachers and children in 1 st grade and use only those tests that were also applied in the coaching pilot, scored and aggregated in the same way.
We first use these data to run similar regressions to those reported in Panel A. The results in Panel B show that the association between the CLASS and achievement in these data is similar to that which we estimate in the control group of the coaching pilot: A 1 SD increase in the CLASS is associated with a 0.082 SD increase in achievement. When we look at different domains of the CLASS simultaneously, only Classroom Organization is consistently and significantly associated with test scores.
The results in Panels A and B of Table 2 are associations, not necessarily causal effects. Indeed, if better-off children attend higher-quality schools, as seems likely, the coefficients from these crosssectional regressions would overstate the importance of teacher behaviors for achievement. In the multigrade experiment, however, there was random assignment of children to classrooms within schools. As a result, estimates that use only the within-school, cross-classroom variation in the CLASS and achievement are more likely to have a causal interpretation (see Campos et al, 2020, for a detailed analysis of these data).
In Panel C of Table 2, we report the results from regressions of achievement on the CLASS, including school fixed effects. In these school fixed effects regressions, teacher CLASS scores should not be correlated with the observable or unobservable characteristics of the students in their classrooms, although the CLASS could still be correlated with other teacher attributes that affect test scores, as discussed in Araujo et al. (2016). The coefficient on the CLASS in these fixed effects regressions is somewhat smaller than in Panel B-0.062, rather than 0.082. Importantly, however, the last column of Panel C shows that-much as is the case in the results without school fixed effects-only Classroom Organization, not Emotional Support or Instructional Support, consistently predicts achievement.

Identification strategy
To estimate the impacts of the coaching program on teaching practices, we run regressions of the following form: The dependent variable is the CLASS (or one of its domains) of teacher t in school s and block b, where the blocks refer to the groups of 20 schools that were the basis for the block randomization; is a set of block fixed effects; takes on the value of 1 for teachers (in schools) randomly assigned to the coaching intervention, 0 otherwise; and are the teacher and school controls in Table 1; and is the regression error term. The coefficient of interest is . To test for the possibility of differences in effects at the top and bottom of the distribution, we also run regressions in which the dependent variable is an indicator variable which takes on the value of 1 if the CLASS is below the 10 th or 20 th percentiles, or above the 80 th or 90 th percentiles, respectively (four separate regressions).
We proceed in a comparable fashion to estimate treatment effects on child achievement. The estimating equation is now: where ℎ is achievement on a given test or total achievement, and consists of child gender and age and its square. Here too we run regressions in which the dependent variable is total achievement or achievement on one of the four tests. As with the CLASS, we also test for the possibility of differences in effects at the top and bottom of the distribution.
Finally, to gain some understanding of the mechanisms whereby coaching affects achievement, we run regressions of the following form: as well as: is in the spirit of standard mediation analysis. It estimates, under strong assumptions, whether any effect of coaching on achievement can be accounted for by its effect on teacher-student interactions. In equation (3b) we add the interaction between the CLASS and treatment-see Imai, Tingley, and Yamamoto (2013) and Huber (2020) for discussion. In this regression, the coefficient estimates whether the slope of the association between the CLASS and achievement is different in treatment and control groups.
In all regressions, we normalize the CLASS and each of its domains to have zero mean and unit standard deviation. We follow the same procedure with the individual achievement test, and also calculate a measure of total achievement, which gives one-quarter weight to each test. As with the individual tests, total achievement is normalized to have zero mean and unit standard deviation. All regressions are estimated by OLS. Standard errors in (2) and (3) adjust for clustering at the school level.

Impacts of coaching on teacher-student interactions and achievement
To motivate our regression results, in Figure 1 we graph the univariate densities of the CLASS (Panel A) and achievement (Panel B) in treatment and control groups. Panel A shows that, relative to the control group, the distribution of CLASS scores for teachers who received coaching is shifted to the right. The biggest difference between the two distributions appears to be in the right tail. Panel B shows a smaller difference in achievement between treatment and control groups, although the distribution of test scores of children in classrooms where the teacher received coaching appears to be shifted to the left.
Our main results are in Table 3. Panel A, which corresponds to equation (1), shows that teachers who were randomly assigned to receive coaching had higher end-of-grade overall CLASS scores, with an effect size of 0.26 SDs (p-value: 0.06). Columns (2) through (5) show that the effects on coaching are concentrated at the top of the distribution: Teachers who received coaching are not significantly less likely to be below the 10 th percentile of the distribution (coefficient of -0.034, with a standard error of 0.041), but are significantly more likely to be above the 90 th percentile (coefficient of 0.089, with a standard error of 0.041). Columns (6)  where teachers received coaching are significantly more likely to be below the 10 th percentile of the distribution of achievement (coefficient of 0.017, with a standard error of 0.010), but are no less likely to be above the 90 th percentile (coefficient of -0.007, with a standard error of 0.010). Columns (6) through (9) show that the coaching effects are larger for language than for math. However, they are not driven by a single test-rather, there are negative, significant (or borderline significant) impacts of coaching on three of the four individual tests we applied.
In sum, the results to this point show that coaching significantly improved the quality of teacherchild interactions but, surprisingly, significantly lowered test scores, especially at the bottom of the distribution. We now turn to possible explanations for this result.
We begin by discussing two features of the way in which the data for the CLASS were collected.
First, teachers were filmed towards the end of the academic year, so the changes in behaviors we observe among teachers who received coaching may be relatively recent-too recent to affect child achievement.
Second, it is possible that teachers in the treatment group, who knew what behaviors were "expected" of them, "acted" for the camera on the unannounced day on which they were filmed-a Hawthorne effect. 7 Both of these features of the data could mean that we may overstate the extent to which the coefficient on the CLASS in equation (2) reflects true changes in teacher behaviors over the course of the school year. On their own, however, they cannot account for the significantly negative effect of coaching on test scores. Rather, our findings suggest that the coaching intervention disrupted learning in some way, and that this disruption was not picked up by the CLASS.
To further explore this, we first turn to estimates of equation (3a). Columns (1) through (5) of Table 4 show-unsurprisingly, given the positive effect of coaching on the CLASS and the negative effect on test scores-that including the CLASS as an additional ("mediating") variable in the achievement regression increases the negative coefficient on coaching in absolute value (in the regression in which we control for the overall CLASS the coefficient on treatment is -0.079, with a standard error of 0.034).
In columns (6) through (10), we include the interactions between the CLASS and treatment, as in equation (3b). These results show there are remarkable differences between treatment and control groups in the association of the CLASS and test scores. Column (6) shows that at all levels of the CLASS, children in classrooms in which the teacher received coaching had lower achievement-the estimate of is -0.083 (with a standard error of 0.034), very close to the value in column (1). The results in column (6) show, however, that there is also a difference in slopes-indeed, the association between achievement and the CLASS in the treatment group, given by the sum of (0.103, with a standard error of 0.027) and (-0.105, with a standard error of 0.034), is essentially zero. In other words, in the treatment group, children in classrooms of high-CLASS teachers did not learn more than those in classrooms of low-CLASS teachers.
Much the same pattern can be observed for individual domains of the CLASS: In columns (7) through (9) estimates of are always negative and significant, as are estimates of . For example, in the case of Emotional Support, is -0.076 (with a standard error of 0.034), and is -0.070 (with a standard error of 0.035). 8 What could account for these effects of coaching? Coaching may have crowded out time that teachers spent on other tasks. Teachers and coaches only met for one hour every two weeks, at the end of the school day, but coaches also gave teachers "assignments" they were meant to complete by their next meeting. It is possible that time on meetings and assignments was time that the teachers would otherwise have spent on other teaching-related activities-for example, lesson planning.
It is also possible that changes in teaching practices disrupted learning, at least temporarily.
Indeed, it is likely that teachers who took the coaching intervention most seriously spent more time preparing assignments and made more changes to their in-class behaviors than other teachers. These teachers may also have had the largest disruptions in their classrooms, and this, in turn, could account for the fact that the CLASS does not predict achievement among children in the treatment group.
We conclude with two important caveats to our results. First, it could be that it takes time for teachers and children to adapt to a new pedagogical approach-teachers who received coaching could be less effective initially, but more effective eventually. In that case, we might expect to see improvements in achievement when teachers in the treatment group teach the next cohort of 1 st grade students, perhaps especially if these teachers had received coaching for a second year.
Second, even if coaching lowered achievement in the short run, changes in teacher behaviors may have led children to acquire other skills-for example, higher levels of Emotional Support could raise child self-esteem or foster a growth mindset. Plausibly, improvements in unobserved child skills, in turn, could raise child test scores in subsequent years. Indeed, a number of interventions in early childhood have yielded zero or very small effects on cognition and achievement, but have resulted in substantial improvements in outcomes in adulthood-see Garcia et al. (2020), Conti, Heckman, and Pinto (2016), and Heckman, Pinto, and Savelyev (2013) for an analysis of two influential programs in the U.S., and Chetty et al. (2011), who find that the effects of higher classroom quality on test scores in early elementary school fade out quickly, but lead to improvements in wages and other outcomes in adulthood. Unfortunately, we cannot investigate these possibilities because we do not have data on outcomes other than achievement, or on achievement in subsequent years.

Conclusion
Learning outcomes of young children in many developing countries are low. By the middle of elementary school, a large fraction of children cannot read, and cannot do basic math operations like single-digit addition. Four out of five students in Mozambique and Nigeria cannot read a simple word of Portuguese and English, respectively, after more than three years of compulsory language learning. In India, only one in four 4 th grade students manages tasks-such as basic subtraction-that are part of the curriculum for 2 nd grade (Bold et al. 2019), and 31 percent of children in 3 rd grade cannot recognize basic words (Kremer, Brannen, and Glennerster 2013). In Latin America, two-thirds of children do not achieve the minimum levels of literacy expected for their age (Busso et al. 2017). Researchers and policymakers have struggled for decades to find ways of raising achievement, including with policies that seek to improve the skills of teachers who are currently in service. Interventions that provide teachers with practical, personalized tools to improve teaching practices, including teacher coaching programs, have been identified as particularly promising.
In this paper, we analyze one such program in Ecuador. The intervention we study provided 1 st grade teachers with bi-weekly, personalized coaching using a semi-structured curriculum. The content of the program drew on theories in developmental psychology and best practice, and was faithfully implemented. It had the elements of what has been identified as best practice: The World Bank World Development Report on education, for example, states that "to be effective, teacher training needs to be individually targeted and repeated, with follow-up coaching, often around a specific pedagogical technique" (World Bank 2018, p. 131).
After one year, coaching had a modest, positive effect on the quality of teacher-student interactions, albeit primarily in one domain, Emotional Support. On the other hand, the program did not raise achievement. Indeed, as we show, the program had a significant negative effect on test scores, especially at the bottom of the distribution. It is possible that the meetings between coaches and teachers crowded out other valuable teacher activities, like lesson planning, or that the changes in pedagogical practices introduced by coaches disrupted learning, at least in the short run.
Recent research on preschool in the U.S. suggests that coaching for teachers may be most effective when it focuses on a single domain of learning-math, language, literacy, socio-emotional skills-rather than on improving classroom quality more broadly (see the discussion and references in Weiland and Yoshikawa 2021). Our results, as well as those from coaching programs for teachers implemented in Chile  and South Africa (Cilliers et al. 2020), are consistent with this observation. We also speculate that, in settings where teacher quality may be low and the number of students per teacher is high, coaching programs may have to be highly structured, giving teachers concrete tools like scripted lessons and learning assessments, to guide them through implementation. We believe that this is an important area for future research.
Personalized coaching tends to be expensive, especially if it is in-person, rather than web-based.
The intervention we study cost well over 10 percent of the salary of the average teacher. 9 In order to have a positive benefit-cost ratio, coaching programs would likely need to have substantial, positive effects on achievement in the short run, or wages in the long run. The results in this paper, and our read of the evidence from other developing countries, suggests that caution is in order before coaching programs for teachers of young children, perhaps especially programs that focus on improving teacherchild interactions, are taken to scale. They also underline the difficulties inherent in translating policies from one setting, like the U.S., to very different settings.  Notes: Regressions of achievement on the CLASS or its domains. Panel A refers to the control group in the coaching experiment. Panels B and C refer to 1 st grade data from the multi-grade experiment analyzed in Araujo et al. (2016) and Campos et al. (2020). Regressions in Panel C include school fixed effects, those in Panel B do not. Standard errors in all regressions are corrected for clustering at the school level. *, **, and *** refer to significance at the 10 percent, 5 percent, and 1 percent, respectively.  (2) in text. Sample sizes are 3,751. All regressions include fixed effects for blocks (10 fixed effects), and all controls in Table 1. Standard errors in Panel B are corrected for clustering at the school level. *, **, and *** refer to significance at the 10 percent, 5 percent, and 1 percent, respectively. Notes: Regressions of test scores on treatment, the CLASS score (or one of its domains), and the interaction between treatment and the CLASS score (or one of its domains). Sample sizes are 3,751. Specifications (1) through (5) refer to equation (3a) in the text, and specifications (6) through (10) refer to equation (3b). All regressions include fixed effects for blocks (10 fixed effects), and all controls in Table 1. Standard errors are corrected for clustering at the school level. *, **, and *** refer to significance at the 10 percent, 5 percent, and 1 percent, respectively.