Effects of innovative science and mathematics teaching on student attitudes and achievement: A meta-analytic study

Many teaching approaches have been tried to improve student attitudes and achievement in science and mathematics education. Achievement effects have been synthesized, but a systematic overview of attitude effects is missing. This study provides a meta analytic review based on 56 publications (1988 e 2014), reporting 65 independent experiments that investigated the effects of teaching approaches on student attitudes in primary or sec- ondary science or mathematics education. Five types of teaching approaches were distinguished: inquiry-based, context-based, computer-based, collaborative learning strategies, and extra-curricular activities. Since many different attitude outcomes were distinguished and attitudes were assessed at different levels of granularity, we did separate analyses for speci ﬁ c and more global outcomes. Outcomes were not signi ﬁ cantly different for different educational approaches. When taking all interventions together, signi ﬁ cant effects were found for General Attitude ( n ¼ 60; d ¼ 0.35), General Interest ( n ¼ 20; d ¼ 0.22), and Career Interest in Science ( n ¼ 4; d ¼ 0.40). The effects were signi ﬁ cantly weaker for studies with older students. Analysis of achievement outcomes yielded a signi ﬁ cant and large overall effect ( n ¼ 40; d ¼ 0.78), again with no signi ﬁ cant differences between teaching approaches. Although the positive effects might be partly due to novelty, the current ﬁ ndings do counter skepticism about the learning outcomes of interest- oriented teaching approaches.

In their primary grades, many children demonstrate vivid interest in science and mathematics, but their attitudes decline in the middle grades (Frenzel, Goetz, Pekrun, & Watt, 2010;George, 2006;Kind, Jones, & Barmby, 2007;Osborne, Simon, & Collins, 2003;Potvin & Hasni, 2014a). Interest levels vary by topic and by gender, but the general pattern is quite similar for mathematics and most of the sciences, and the decline occurs in most countries that have attained a certain level of wealth (Osborne et al., 2003;Sjøberg & Schreiner, 2010). The lack of interest in science and mathematics has been on the educational and political agenda for a long time for various reasons, such as the need for a scientifically literate public; the need for employees with a STEM (Science, Technology, Engineering and Mathematics) background; and the need of science itself to gain public support (e.g., AAAS., 1993;European Commission, 2004;Lacey & Wright, 2009;OECD, 2003OECD, , 2008Rocard, Csermley, Jorde, Walberg-Henriksson, & Hemmo, 2007;WRR, 2013).
Many educational innovations, including context-based teaching, inquiry-based teaching, and ICT usage, have been proposed, both in science and mathematics education, to foster positive attitudes, but there is little systematic evidence about which educational approaches are effective to promote interest, attitude, and motivation (Fortus, 2014;Osborne et al., 2003). Moreover, critics have been concerned that such interest-oriented approaches will go at the cost of achievement outcomes (e.g., Kaminski, Sloutsky, & Heckler, 2008;Kirschner, Sweller, & Clark, 2006;Klahr & Nigam, 2004). The purpose of the current study is to gain insight into the magnitude and the robustness of these effects through meta-analytic synthesis of experimental and quasi-experimental studies that investigate the effects of innovative teaching approaches on students' attitudes toward and achievement in science or mathematics.

Background
On the basis of theoretical and experimental studies, attitudes, motivation, and interest are regarded as important determinants of the quality and depth of learning processes, student persistence, and study choice (Hidi & Renninger, 2006;Maltese & Tai, 2011;Pintrich, Marx, & Boyle, 1993;Vansteenkiste, Lens, & Deci, 2006). Some authors even proposed that affective learning outcomes might be more important to further learning than cognitive ones (e.g., Maltese & Tai, 2011). Although the direct correlations between attitude and achievement as observed in empirical survey studies are too weak to warrant such claims (Ma & Kishor, 1997;Singh, Granville, & Dika, 2002), it is clear that affective outcomes are important, especially in the long run.
Research on the effect of teaching approaches on attitudes in science and mathematics education has a long and somewhat cumbersome history. In their seminal review about attitudes in science education, Osborne et al. (2003) conclude that: even a cursory examination of the domain reveals that one of the most prominent aspects of the literature is that 30 years of research into this topic has been bedeviled by a lack of clarity about the concept under investigation. (p. 1053).
This lack of clarity is reflected in the common practice that researchers will use their own newly developed instruments to assess attitudes, rather than a well-validated existing one (Blalock et al., 2008). Moreover, with regard to intervention studies, Osborne et al. conclude that, although the science education literature may contain thousands of reports of interventions designed to change attitudes, effect sizes are marginal and most findings cannot be generalized beyond the particular study.
The key problem according to Osborne et al. is that studies compare student outcomes between curricula without undertaking an analysis of the ways in which the curricula differ. Although they admit a few positive exceptions (the use of lifeworld contexts does seem to have consistent positive effects), their sobering conclusion is that teacher variables are more important anyway, and that research may better turn away from curriculum variables.
Nevertheless, over the last ten years, several review studies have documented the effects of innovative teaching approaches on attitude and/or achievement. In a systematic review of 17 experimental studies assessing the effects of contextbased and Science Technology and Society (STS) approaches in science education, Bennett, Lubben, and Hogarth (2007) found a positive effect on attitude, but no consistent effect on achievement. Likewise, a systematic review by Potvin and Hasni (2014b), based on 97 intervention studies, which covered a broader range of educational approaches such as inquirybased, collaborative work, ICT-usage, and out-of-school activities, found a modest positive effect on attitude/motivation/ interest (which were taken together in their analyses). In both reviews, all kinds of study designs were included, and there is little information about the size of the effects. With regard to achievement, meta-analyses have been conducted by Schroeder, Scott, Tolson, Huang, and Lee (2007), and by Furtak, Seidel, Iverson, and Briggs (2012). Schroeder et al. included 61 studies covering eight types of teaching innovations. They found a substantial overall effect on achievement (d ¼ 0.67 ± 0.01), with no significant differences between the approaches. Furtak et al. specifically focused on the achievement gains for inquiry-based education. They included 37 studies, and found a substantial positive effect on achievement (d ¼ 0.50 ± 0.01), with markedly higher values for teacher-led inquiry contrasted to student-led inquiry. All of the above reviews had their focus on science education. The evidence in mathematics seems to be sparser, with the available review studies focusing on other influencing factors such as teacher behavior or classroom atmosphere (e.g., Middleton & Spanias, 1999). A notable exception is Rosen (2009), who report substantially positive effect-sizes for 'problem posing', both on achievement and on attitudes towards mathematics. To sum up, although the general tendency across studies seems to be that innovative teaching approaches do have positive effects, there is little clarity about what interventions cause effects on what outcome, and under what conditions.

Affective outcomes: attitude, interest and motivation
In educational research, the influence of affective factors on learning is being addressed from at least three largely separate research traditions: attitude, interest, and motivation research. Several authors have attempted to bring unity by synthesizing perspectives (e.g., de Brabander & Martens, 2014;Eccles & Wigfield, 2002;McLeod, 1992; Van Aalderen Smeets, Walma van der Moolen, & Asma, 2012), but thus far this has not led to a widely used common framework. Yet, in order to compare the effectiveness of interventions, and to synthesize findings across studies as we do in our review, we need a conceptual framework to decide whether two measurements map on the same underlying construct, or whether they measure different constructs. Some authors have opted for rather fine-grained frameworks (e.g., Gardner, 1995), which allows for highly specific conclusions, but limits the opportunities to compare outcomes across a wider sample of studies; others took together all interest, motivation, and attitude measures in a single construct (e.g., Potvin & Hasni, 2014b), leading to a larger pool of studies but fuzzier interpretations. In order to find an optimum between these extremes, we briefly review the underlying psychological theories.
According to commonly held definitions in psychology research, an attitude is "a summary evaluation of a psychological object captured in such attribute dimensions as good-bad, harmful-beneficial, pleasant-unpleasant, and likable-dislikable" (Ajzen, 2001, p. 28), or alternatively "a psychological tendency that is expressed by evaluating a particular entity with some degree of favor or disfavor" (Eagly & Chaiken, 1993, p. 1). It should be noted that in science education, the term attitude has also been used to refer to scientific attitudes, such as a critical stance, or a demand for verification (e.g., Osborne et al., 2003). Although the attainment of scientific attitudes is an equally important goal for science education, they fall beyond the scope of the current review.
The field of attitude research has long been plagued by the fact that attitudes would only be poor predictors of actual behavior, even up to the point that researchers were calling to abandon the concept of attitude (e.g., Wicker, 1969). This lack of predictive power has been explained from the observations that a) attitude is a multidimensional construct (e.g., Gardner, 1975) and b) any concrete situation involves many particularities that are not being accounted for by the overall attitude assessment (e.g., Ajzen, 1991, p. 185), and c) that in addition to the attitude toward the object or behavior, behavior is also influenced by subjective norms and perceived behavioral control (Ajzen, 1985).
Subjective norms refer to the opinions significant others are supposed to hold about the behavior, and whether one would like to be seen as someone who engages in this particular behavior. With regard to the science and mathematics domains: if science is regarded as "for nerds only", this will turn off students who do not want to be regarded as nerds. In science education research, this has been operationalized in a scale assessing the perceived Normality of Scientists. Perceived behavioral control, or self-efficacy, is an expectancy about one's capabilities to learn or perform a given task (Bandura, 1977). In science and mathematics education research, similar constructs are variously being addressed by names such as Science Self Concept or Math Anxiety (reverse coding). Unlike measures for general self-esteem, self-efficacy toward a specific behavior has been a strong predictor of actual behavioral choices, and perseverance. These elements have been incorporated in Ajzen's theory of planned behavior (1991) which became a successful model for predicting human behaviors in many areas.
Like attitude theories, interest theories primarily focus on person-object relations. According to a widely held definition by Hidi and Renninger (2006), interest refers to "the psychological state of engaging or the predisposition to reengage with particular classes of objects, events, or ideas over time" (p. 112). This definition encompasses short-term situational interest (state) as well as long-term personal interest (disposition). In the present context our focus will be on personal interest as an educational outcome. As with attitude, the general developmental pattern for interest seems to be a "downward trend … plateaued in later years, with high variability in mean levels, but little variability in the shape of the growth trajectories" (Frenzel et al. 2010, p. 507). However, as can be observed in young children as well as in scientists, the onset of a personal interest can be a quite idiosyncratic event, and the object of interest can be very narrow (e.g., American Scientist, 2012;Chi & Koeske, 1983;DeLoache, Simcock, & Macari, 2007).
Motivational theories also attempt to answer the question about what gets an individual moving (energization) and toward what activities or tasks (direction; Pintrich & Schunk, 2002). Compared to attitude and interest research, many motivation theories are more focused on external factors that influence energization for the task at hand (rewards, punishments), and other constructs are domain overarching personal traits (mastery goal orientation versus performance goal orientation). These are beyond the scope of the current review.
Other motivation theories take a more long-term perspective on what guides the direction of motivation (goal orientations, self-determination theory). A leading theory in this field is the Self Determination Theory (SDT, Ryan & Deci, 2000), which identifies the basic psychological needs for Competence, Autonomy, Relatedness as ultimate drivers of behavioral choices. People will tend to engage in behaviors that are likely to help satisfy these needs, and avoid behaviors that threaten the fulfillment of these needs (failure avoidance). SDT also offers a developmental perspective: Motivation can develop from externally driven to autonomous and identified for activities that successfully contribute (competence) toward self-selected (autonomy) and meaningful (relatedness) goals. SDT proposes a continuum from pure externally driven behavior (avoid punishment) to autonomous internally driven behavior. Such autonomous behavior could be driven by intrinsic motivation (just for the joy of it, which is very similar to interest, cf. Eccles & Wigfield, 2002), or it could contribute to other goals valued by the actor (becoming an engineer, being valued by the community). Both intrinsic motivation and the career perspective resemble forms of personal interest that are regularly being assessed in science and mathematics education research. The desire to be valued by the community has close relation to the concept of subjective norm in Ajzen's (1991) theory of planned behavior.

Multiple grain sizes e toward a conceptual framework
One proposed remedy for the poor predictive validity of attitudes is the aggregation of specific behaviors across occasions. By aggregating different behaviors, observed on different occasions and in different situations, other sources of influence tend to cancel each other out, with the result that the aggregate represents a more valid measure of the underlying behavioral disposition, and a more powerful predictor of general behavior ( 2007). By contrast, if the aim is to predict a specific behavior, it is important to assess attitudes specific to this particular behavior, rather than attitudes toward a broad behavioral category (Ajzen, 1991).
For attitudes toward science and mathematics, this would imply that doing science in school, doing practicals, doing science as a leisure activity, choosing a science career, and support for science as a societal phenomenon should all be regarded as different attitude objects. On the other hand, Kind et al. (2007), in factor analyses on questionnaire data from 932 middle school students, found that questions about "learning science in school", "learning science outside school", and "future participation in science" would all load on a single factor, which they termed general interest in science. Likewise, some authors point to the difference between the cognitive (beliefs), affective (emotional responses) and conative (behavioral intentions) components of attitude. Even if students agree that science is important, this opinion may remain unrelated to students' feelings about doing science themselves. Evidence is mixed as to whether these components should be accounted separately in order to predict behavior. For instance, Francis and Greer (1999) claimed a one-dimensional structure for their attitude toward science instrument, but Walker, Smith, and Hamidova (2013) using the same instrument with a new sample of students, found a two-factor structure, with the items neatly split up between beliefs about the relevance of science, and interest in doing science oneself (affective and conative).
Given the great diversity in conceptual frameworks, the broadly targeted nature of many educational interventions, and the diffuseness of the targeted behavior, we consider that both the fine grained and the more global perspective may have their merits, and we cannot justify a single level of analysis. Therefore, in this study we opt for a hierarchical framework (Fig. 1).
In line with the theory of planned behavior and attitude research, and further informed by an initial sample of about 50 studies, we identified two clusters of cognitive beliefs about the relevance of science or mathematics, namely personal relevance and societal relevance. Typical items would be "science helps me in dealing with everyday problems" and "science contributes to our wealth", respectively. Some studies combine both kinds of items in a single scale, in which case they could be said to assess general relevance. Likewise, interest (including intrinsic motivation) was being assessed in three different contexts, namely classroom, leisure, and career. The three constructs taken together could be said to represent general interest (cf., Kind et al., 2007). Furthermore, several studies assess self-efficacy toward the specific domain at hand and/or the normality of scientists, which is taken as indicative of subjective norms/relatedness. Finally, a group of studies either combine items from different categories in a single overall measure, or employ a more fine-grained instrument while only reporting an overall score. In line with the idea that an overall attitude would predict overall behavior, these studies could be regarded as assessing the overall attitude.

Achievement
While affective outcomes are important, they should not go at the cost of the cognitive learning outcomes. Achievement may seem more straightforward to conceptualize, but achievement outcomes pose their own challenges. Most notably, and in science even more than in some other school subjects, achievement tests are topic and course-specific, and the majority of studies use achievement measures specifically constructed for the task at hand. As a consequence, the quality of instruments is hard to assess, and in many studies, the experimenter-made measures are more aligned with the content taught in the experimental condition than with the content in the control condition (e.g., inquiry skills are needed on the test, or the questions are posed in context-rich format). On the other hand, one may argue with some ground that standardized tests are often more aligned with the outcomes of the traditional curriculum. Slavin and Madden (2011), focusing on mathematics and reading studies reviewed in the U.S. Department of Education's What Works Clearinghouse (WWC), found that measures that are "inherent" to the treatment (covering content not taught in the control group) are associated with effect sizes that are much higher compared to measures of the curriculum taught in experimental as well as control groups (d ¼ 0.45 vs. d ¼ À0.03). By contrast, Schroeder et al. (2007) in their meta-analysis, where 47 out of 62 achievement tests were locally constructed, found no difference in outcome (d ¼ 0.73 and d ¼ 0.75 resp.).

Teaching approaches
Over the past decades many teaching approaches have been proposed to foster positive attitudes toward science and mathematics. Although each approach has its own unique features, which could be characterized on many dimensions, a few broad categories tend to be distinguished in research as well as in public debate (e.g., Potvin & Hasni, 2014b;Schroeder et al., 2007). To categorize the studies in their sample Schroeder et al. drew on a typology of teaching strategies by Wise (1996). Taking this typology as a starting point we encountered the following types of interventions in our study: Context-based teaching (con). Many curriculum reforms have focused on the use of contexts and applications of science and mathematics. Although the intensity and the role of context use vary across implementations, one aim of all contextbased curricula is that students will experience the relevance and applicability of the science content in society and in their personal life worlds (Gilbert, 2006). Many studies on context-based interventions report gains in students' attitudes to science and technology, with learning gains similar to those of conventional approaches (cf., Bennett et al., 2007); Inquiry-based learning (ibl). In IBL, students get involved in inquiry activities in order to find answers to learning questions. Implementations differ in the amount of teacher guidance, but students get (partial) responsibility for devising the research questions and methods, and interpreting the results. Thus, inquiry-based learning is supposed to promote engagement and ownership, and a more "human" view of science as knowledge in the making. Many studies on inquirybased curricula report positive effects on attitudes as well as achievement (Furtak et al., 2012;Gibson & Chase, 2002); ICT-rich learning environments (ict). ICT-rich teaching approaches include (individualized) computer-based instruction, games, feedback, interactive quizzes, computer based labs, simulations and robotics. It would be hard to propose a common mechanism by which ICT usage would lead to more positive attitudes, but many studies on ICT-based interventions report such gains (Potvin & Hasni, 2014b). Proposed mechanisms include that students enjoy working with computers, students feel more safe to experiment and make mistakes, and/or students appreciate the (quick) feedback; Collaborative learning (coll). Collaborative and cooperative teaching approaches, such as project-based work, discussion, "jigsaw" or peer feedback, tend to enhance social interaction and relatedness between learners, and quite often involve an increased ownership of the content to be learnt. Literature reports positive effects on students motivation, increased selfconfidence and satisfaction, and more positive attitudes toward the subject matter (Lazarowitz & Hertz-Lazarowitz, 1998); Extracurricular (ext). Extracurricular activities are not part of the standard lesson plan or classroom environment, yet are part of, or are strongly linked to the school program. Examples include field trips, mobile science labs, summer camps, guest lectures, and visits to science centers. These types of interventions usually aim at raising curiosity (interest), encounters with role models/scientists (normality) and relatedness. Literature suggests that extracurricular activities can provide learners valuable and particularly motivational opportunities to learn science (Rennie & McClafferty, 1996).
It might be worthwhile to notice the categories of teaching approaches (according to the Wise/Schroeder typology) that remained empty in our research. In our final set, we did not find any research on Questioning Strategies, Focusing Strategies, Hands-on Strategies, Assessment Strategies, or Direct Instruction. It seems that the affective effects of these more cognitively oriented approaches received little research attention so far in research on attitudes.
Beyond the type of educational approach, many other factors will influence the affective outcomes of education (e.g., Myers & Fouts, 1992;Middleton & Spanias, 1999). It is desirable to include such influencing factors in the analyses, both because it can be informative to study their effects, and because inclusion of such factors can enhance the power of analyses by accounting for what otherwise would be unexplained noise. However, many relevant background characteristics are not routinely reported. With respect to the sample of participants, we expected that grade level and age, SES and gender could affect the outcomes. With respect to the intervention, we expect that more extensive interventions, and better prepared teachers (Slavin, Lake, Hanley, & Thurston, 2014) will have stronger effects. Furthermore, although our general question is about attitudes in science and mathematics, there are significant differences in popularity between biology and for example physics (cf., Osborne et al., 2003), and it was expected that the effectiveness of an educational approach may depend on the domain at hand.
A possible caveat when talking about innovative teaching approaches is that they are always innovative relative to a, mostly implied, "regular" practice, and this regular practice can vary across time and space. Context-based teaching, for instance, has become more or less mainstream in many countries nowadays, but not in others. To be included in our analyses, studies had to have a complete comparative design, so the teaching intervention is well-defined as the difference between the intervention and the control conditions. However, when synthesizing across studies, the definition of "regular" teaching might differ across studies and, given a lack of information and a limited number of studies, there is no way to control for that. This implies that our analysis has to rely on a linear additive effects model, ignoring the fact that innovations might work out differently for different regular teaching practices.
The purpose of the current review is to find out how the respective innovative teaching approaches in science and mathematics education do affect (components of) attitude and achievement as defined in the previous sections. Specific questions are: 1) What is the impact of innovative teaching approaches on student attitudes? 2) What is the impact of innovative teaching approaches on student achievement? 3) Are the answers to a and b different dependent on duration of the intervention, grade level, preparedness of the teachers, type of innovative approach, or school subject? 4) Is there a tradeoff between promoting positive attitudes, and promoting achievement?

Literature search process
The literature on attitudinal outcomes of science and mathematics education was searched broadly, canvassing the Web of Science, ERIC, PsycINFO, and Scopus databases from 1988 to 2014. As a quality criterion, we decided to limit our search to peer reviewed literature. In a stepwise process we defined a query consisting of Attitude Â Domain Â Design keywords (for the full query, see the Methods supplement accompanying the online article). Since the initial query yielded over 15,000 publications, we refined the query by adding filters to limit the number of search results without excluding any usable studies. The first 100 hits in each database were used to verify that these additional filters did not result in loss of relevant publications. After deduplication, 6066 unique publications remained. The search in Scopus and the Web of Science yielded many more studies than in ERIC and PsycINFO, presumably because many studies on interest have been published in science and mathematics education journals, rather than educational or psychological journals (see Fig. 2).

Classification and selection
The classification and selection process was facilitated by organizing all publications, meta data and coding information in a database, such that we could keep track of each publication throughout the entire process. All authors participated in the coding team.
Step 1: Surface level screening inclusion and exclusion criteria. The purpose of the first step was to efficiently exclude a substantial part of studies that did not meet our inclusion criteria. This surface level screening was mostly done on the basis of reading title and abstracts. Only in the few cases where essential study characteristics were not evident from title and abstract, the full text was consulted. Inclusion criteria were (for coding guidelines see the Methods supplement accompanying the online article): 1. Curriculum-related intervention in science (physics, chemistry, biology) or mathematics education; 2. Participants are students in general primary or secondary education (Grades 1e12 in the US system); 3. Domain-specific attitude, interest, or motivation were assessed quantitatively; 4. The study design was experimental or quasi-experimental (control group and pre-posttest design).
After this screening 533 publications remained.
Step 2: Design check and categorization. In this step, full texts were consulted to verify that all statistical information required for inclusion in a meta-analysis was reported in the publication. For the purpose of the present report, only studies with an experimental or quasi experimental design were included, and they had to report means, standard deviations, and sample sizes for both treatment and control, and for both pre-and posttest, or equivalent information. After this step 152 publications remained.
Step 3: Quality of attitude measurement and categorization of interventions. This step involves a more in-depth reading of the full text to assess the quality and type of the measurement(s) and the type of intervention. In order to be included, an attitude measurement scale had to assess domain-specific attitudes (e.g., "learning physics is fun, rather than "learning is fun", or "this lesson was fun"), and the quality of the instrument had to be validated in a previous publication, or reliability statistics and at least three sample items per scale had to be reported in the publication at hand. If a publication had at least one acceptable attitude scale, both the attitude and the intervention were categorized according to our framework, Reject: n = 5533 Step 1: Surface level screening Step 2: Design check Step 3 and statistical data were extracted. In some cases it turned out that the reported information did not allow for the computation of effect sizes after all. If the study reported achievement outcomes, these outcomes were also extracted. Moreover, study level moderating variables were extracted in the same pass. Unfortunately, many relevant moderating variables are not regularly reported, and information that is reported can be in many incompatible formats. The moderating variables that were reported sufficiently often to be included in our further analyses were: domain (biology, chemistry, physics, mathematics, or general science); time interval (days from first to last activity); whether the teachers had been trained on the approach; and grade level of the students. Since grade-level systems differ across countries, grade levels for non-US studies were converted to the grade levels for the corresponding age group in the US-system (i.e., grade 1 starts at the age of 6).
In 11 eligible publications, multiple (independent) experimental conditions were compared. For eight of these (Alexander, Fives, Buehl, & Mulhern, 2002;Cebesoy & Akinoglu, 2012;Eskander, Bayrami, Vahedi, & Ansar, 2013;Hong & Lin-Siegler, 2011;Isiksal & Askar, 2005;Kara & Yes¸ilyurt, 2008;Ke & Grabowski , 2007;Tarim & Akdeniz, 2008) the different treatment conditions were assigned to the same intervention type in our analysis. Therefore, data for the multiple conditions were pooled into a combined estimate of the pretest and posttest means and standard deviations (Higgins & Green, 2011, Chapter 7). One study (Hwang, Shi, & Chu, 2011a) had one experimental condition and two control groups. However, the experimental and one of the control groups were both exposed to the same educational approach, although in a slightly different manner. Therefore, we pooled the data of the experimental group with one of the control groups. In two publications the multiple treatment conditions were so different that we analyzed them separately (Cinici, Sozbilir, & Demir, 2011;Peş man and € Ozdemir, 2012). In these cases, the control group was split in two, and each experimental condition was paired with half the control group data. In three publications (Barnett et al., 2004;Akçay, Yager, Iskander, & Turgur, 2010;Rosen, 2009) an experiment was done twice with different groups of participants, for instance, one in elementary and one in secondary school. We included these experiments separately in our analysis. Finally, the publication by Chiu (2011) reported on girls and boys separately. Since our research question does not make this distinction we included the combined results from boys and girls in our analyses.
After this step, 56 publications with 65 independent experiments were left for further analysis. Table 1 presents an overview of experiments by domain and intervention types. The interventions labelled "Other" could be further subdivided into meta cognitive learning strategies (4Â), use of representational formats (3Â), and conceptual change approach (1Â). (For an overview of individual experiments with study characteristics, see the Methods supplement accompanying the online article).

Quality of the coding process
In order to ensure the quality of the coding process, the team regularly met up to discuss the interpretation of the code book rules for problematic cases. Consistency of the coding in step 2 and 3 was verified by having one of the authors, who had not been involved in the initial coding for those steps, recode a randomly selected set of about 10% of publications in that step. Interrater agreement, as expressed in Cohen's Kappa (Cohen, 1968) was found acceptable (Table 2).

Statistical analysis
For each study the effect size per outcome variable was calculated as the standardized difference in mean change for the intervention and control condition, with M being the mean at pretest (pre) or posttest (post) for the treated (T) or control group (C). SD pre is the pooled pretest standard deviation for the intervention and control condition, and C p the correction factor for small sample sizes (Morris, 2008). Standard errors for the effect sizes were calculated as proposed by Morris (2008). We ran separate random effect meta-analyses per outcome variable using inverse variance weighting to account for differences in sample size across studies. All calculations for effect sizes and standard errors were carried out in R 3.1.0 (code available upon request) and all metaanalyses were done with the Metafor package (Viechtbauer, 2010). In order to detect whether publication bias had skewed our sample, we looked at the variation in study outcomes as a function of sample size/standard errors. In an unbiased sample, studies with a large error are expected to have more varied outcomes than studies with a small error, but the mean outcomes should not be systematically different for small and large error studies. If accidental negative outcomes remain unpublished, or accidental positive outcomes get published, this would be expected to occur more frequently for the large error studies, which in turn would lead to systematically more positive mean outcomes for large error studies than for small error studies. In case a significant meta-analytic result was found for a specific outcome variable in a set of at least ten studies, we checked whether this kind of distortion had occurred using funnel plots and regression tests from the Metafor package in R (Egger, Smith, Schneider, & Minder, 1997;Viechtbauer, 2010). (For additional details about the analysis, see the Methods supplement accompanying the online article).

Results
We will first report the overall meta-analyses for the effect of innovative teaching approaches on overall attitude and achievement. Next, we will zoom in on the specific attitude components, and on differential effects for specific types of teaching approaches.

Overall effects of innovative approaches on overall attitude and achievement
For 65 independent experiments the effects of an intervention on overall attitude were examined. From these, 39 experiments just reported a single outcome for "overall attitude", another 26 experiments reported separate outcomes for specific subscales, which were combined into a single outcome per experiment for the present analysis. The forest plot in Fig. 3 shows the estimated effect size per experiment and the overall effect size and 95% CI over the 65 experiments. The results show an overall significant and large treatment effect, with d ¼ 0.58 [0.37; 0.79], and p < 0.0001. However, the sample exhibited substantial heterogeneity, Q(df ¼ 64) ¼ 200.83, p < 0.0001. The experiment by Prokop, Tuncer, and Kvasni c ak (2007) and the four experiments by Akçay et al. (2010) strongly deviate from the general pattern. After removal of these studies, the effect remains significant d ¼ 0.35 [0.24; 0.47], p < 0.0001, and no significant heterogeneity remains (Q(df ¼ 59) ¼ 54.78, p ¼ 0.63).
With regard to achievement, 42 experiments out of 65 measured achievement in addition to overall attitude. Since content and learning aims differ per study, there is no standard measurement instrument, and all we can conclude about when synthesizing across studies is a "general" effect on achievement. For the 42 experiments, we found a significant and large effect on achievement, d ¼ 1.07 [0.68; 1.47], p < 0.001 (Fig. 4). However, the sample exhibited substantial heterogeneity, Q(df ¼ 41) ¼ 199.29, p < 0.0001. Two studies (Acar Sesen & Tarhan, 2013;Prokop et al., 2007) were considered as outliers and removed from the statistical analysis. Without these two studies, the effect remained significant and substantial (d ¼ 0.78 [0.60; 0.97], p < 0.001), and some heterogeneity was left, Q(df ¼ 39) ¼ 59.15, p ¼ 0.02. The latter implies that the sample of studies thus cannot be regarded as taken from a single normally distributed population of studies; part of the variance is likely to be accounted for by differences in underlying study characteristics. However, at I 2 ¼ 34%, the heterogeneity is small enough to regard the overall effect as a meaningful outcome (Higgins & Green, 2011).
Next, we verified whether there was an association between overall attitude effect and achievement effect at the study level. The correlation between the two was found nonsignificant, r ¼ 0.26, N ¼ 42, p ¼ 0.091. After removal of the outliers (Acar Sesen & Tarhan, 2013; Prokop et al., 2007), the outcome remains nonsignificant r ¼ 0.17, N ¼ 40, p ¼ 0.30.

Effects of type of approach, domain, and background variables
To further assess causes behind these effects we performed a meta-regression with type of teaching approach, domain, timeframe (days from start to end of the intervention), grade level, and the presence of teacher training as predictor variables. The variance in each of the predictor variables was sufficient to make the analysis meaningful (timeframe:  imputed the average grade level of the other studies. For the five studies (which comprised six experiments) that did not report whether teacher training had taken place (Ke and Grabowski, 2007;Pilli & Aksu, 2013;Rosen, 2009;Uzuntiryaki and Geban, 2005;Yang and Tasi, 2010) we assumed absence of teacher training.
In Table 3 the result of the meta-regression analysis is presented for the set of studies excluding the experiments by Akçay et al. (2010). The data provide no evidence for one approach being more effective than the others, no evidence for differences between domains, and also teacher training and timeframe do not explain significant amounts of variance. The negative effect of grade level is significant however, indicating that the effectiveness of interventions decreases for older students. A similar meta-regression was conducted for achievement but no significant effects were found.

Effects on specific attitude constructs
Next, we analyzed subsets of studies separately per subscale. There were no studies to report on personal relevance as a separate construct. For all other studies an effect size could be computed. In all cases the data from Prokop et al (2007) were removed as outliers. The resulting effect-sizes are presented in Table 4 (forest plots are presented in the Results supplement accompanying the online article). The effects for General Interest and Career Interest are significant and positive. However, there are no significant differences between any of the subscales, so there is no ground to declare one of the effects larger than the others.

Publication biases
The findings in a meta-analysis might be distorted by publication bias. For each significant result obtained in the previous analyses we used funnel plots and regression analysis to assess whether the distribution of outcomes suggested a publication bias (plots are presented in the Results supplement accompanying the online article). After removal of the outliers we had already identified in our previous analyses, the remaining results are well in line with the expected funnel shape, and all regression tests were nonsignificant. From these analyses we find no indications for publication bias.

Conclusion and discussion
The purpose of this study was to describe the effects of innovative teaching approaches on student attitudes and achievement in science and mathematics education. Based on previous reviews, and the interventions we found in the current sample of studies, we distinguished five types of commonly advocated educational approaches: context-based, inquiry-based, ICT-enriched, collaborative, and extra-curricular. From a theoretical stance, we consider attitude a multidimensional construct comprising: perceived relevance (personal and societal), interest (school, leisure, career), self-efficacy, and normality of scientists. However, many experimental studies report a global aggregate measure rather than scores for each specific subscale, and it would imply a substantial loss of information if we would exclude all these studies. Moreover, an "overall attitude" construct may have its own merits since study behavior and study choice are influenced by a large number of different attitudes. Therefore, in this review we adopted a hierarchical framework to allow both fine-grained and global analyses.
In  Hattie (2009). For the other attitude components uncertainties were large, which might be partially explained by the small number of studies per specific attitude construct.
In answer to the second research question, about the effects of innovative teaching approaches on achievement, we found a significant and large effect (d ¼ 0.78 [0.60; 0.97]. This value is considerably larger than the effects for attitude, and well above the typical value of 0.4 for achievement as reported by Hattie (2009). However, in interpreting the effect size for achievement, it should be taken into account that achievement measures were not subjected to the same rigorous criteria we applied to the attitude scales. Standardized achievement tests are rare in this corpus of studies, so each study has its own achievement test specifically designed for the content at hand, and it might be that in some cases the achievement test was biased toward the innovative group (see Slavin & Madden, 2011).
To answer the third research question about the differential effects of different types of interventions, we conducted a meta-regression with type of teaching approach, duration, teacher preparation and domain as independent variables. The overall attitude construct was the only one with a sufficient number of studies to perform this analysis. We found a significant negative effect for grade level, which implies that the attitudes of older children are harder to influence. There were no significant effects for type of teaching approach, duration, teacher training, or domain.
To answer the final research question about a tradeoff between promoting attitude and promoting achievement, we tested whether studies that were more successful in promoting positive attitudes would also be more successful in promoting achievement. However, no correlation was found between the effect of an intervention on attitude and its effect on achievement.
We had expected different effects for different types of teaching approach and stronger effects for longer lasting interventions and interventions with teacher training, but no such effects were found. The fact that we did not find any significant differences between approaches could be interpreted in several ways. We might conclude that any innovation is good, or that any positive effects are just novelty effects. In our view, a more likely interpretation is that, rather than type of approach as defined in this study, it is the quality of the content and the implementation that matters. A context can be dull or inspiring; productive for content learning or just superficial; an inquiry assignment can be confusing or stimulating; and a teacher can amplify the message of an approach or interfere with it. Such features might be essential for the effects of a teaching approach, and they certainly are essential for everyone who intends to replicate the experiment, or implement the proposed approach. As a further complication, the effect of a teaching innovation might be different for different baseline situations. If such interactions are strong, the entire concept of generalizability would be at stake. Our relatively homogeneous data do not point in this direction, but at least more detailed information about the teaching in the "regular" or "control" conditions would be needed. Unfortunately, with the notable exception of Girod, Twyman, and Wojcikiewicz (2010), details about the teaching in either condition were only seldomly reported in the studies we encountered.  Likewise with regard to duration and teacher preparation, many studies provide only limited information, so timeframe had to be used as a proxy for the extensiveness of the intervention, and for teacher preparation, all studies with some form of teacher preparation were coded as "yes". However, it should be expected that teacher training needs to be substantial in order to be measurably effective in student achievement and attitude (Hofstein, Kesner, & Ben-Zvi, 1999).
Some studies reported results much more positive than the rest. They were regarded as outliers and removed from the statistical analysis. A deviant result could have many reasons, but still we regard it useful to analyze these studies more in depth to see whether the particular conditions or interventions could offer a likely explanation for deviant outcomes. The intervention by Akçay et al. (2010), conducted in Turkey, involved a comprehensive student-centered STS curriculum. They ran four parallel experiments in grades 6 to 9. The timespan of the intervention was 180 days, which is considerably longer than the average of the other studies in our sample (94 days) but not extreme (the longest intervention in the sample lasted 2 years). What makes the Akçay et al. study really stand out is the extensive teacher training program (4-week summer camp, participation in the design, several 3-day return meetings etc.).
At the other extreme of the time scale, Prokop et al. (2007), in an experiment with 6th graders in Slovakia, found strong and lasting effects on all types of interest in biology for a one-day (10 h) fieldtrip where they visited different ecosystems, captured animals and plants, identified their species and wrote their diaries. Although the time span is very short, it seems that this intensive submersion in nature left a lasting impression on many students. It may also be a special affordance of the ecology domain that it lends itself to such an approach.
Unfortunately, we had to exclude many studies because basic information was missing; instruments were used without traceable validity information or means and standard deviations were partially missing. The same holds for several moderating variables. Many authors before us have ended their studies with the wish that future researchers will choose among the few well-established attitude instruments, rather than keep coming up with new home baked instruments. Furthermore, as a minimal set of study characteristics, we would propose that studies need to report composition of the class in terms of age, gender, cultural background, and SES; grade level; educational track; time frame and actual duration of student activities; time frame and duration of teacher preparation. However, we are convinced that such statistical parameters are not enough, and not even the most important information to predict the effectiveness of educational innovations. We would like to join Furtak et al. (2008) in their call for a fuller description of educational interventions, such that the reader would be able to replicate the intervention or at least judge both the type and quality of the intervention herself.
With regard to the practice of teaching, recommendations can only be tentative. First, since younger children have more positive attitudes toward science and mathematics, and since our findings indicate that the attitudes of younger children are more malleable, it seems worthwhile to start early on with fostering and maintaining positive attitudes, rather than try to start repairing after the decline in attitude has begun. This is especially true because interest at the onset of secondary school has been found the major predictor of interest at the end of school (Sadler, Sonnert, Hazari, & Tai, 2012). Second, since we found modest positive effects on overall attitude, and positive effects on achievement, with no significant difference between approaches, we conclude that there is no evidence to withhold teachers from investing all their best efforts in implementing high quality innovative approaches that suit their learning aims and that appeal to them most.