Integrating Research into Instructional Practice : The Use and Abuse of Meta-Analysis

This article sketches a broad picture of meta-analysi s, a technique for quantitatively summarizing research studies. Its overall purpose is to guide researchers and practitioners. working in the instructional media and technology field, in future research design and decis ionmaking for instructional development. The article has five main sections: 1) a general introduction to the other sections describes the place of meta-analysis within the educational technology field; 2) a discussion of the reason for and the nature of integrative revlew as a whole, plus details of some of the objections in principle and in practice to quantitatve synthesis; 3) a description of the process of conducting a meta-analys is along wlth a discussion of major methodological issues; 4) an example of three meta-analys es produced on one instructional t r e a t m e n t (i . e . . m a s t e ry l e a r n i n g ) , a n d a n e x p l a n a ti o n o f di f f e r e n t r e s u l t s o b t ai n e d i n each c a s e : and 5) a table of 26 meta-analyses on studies of instructional variables along with guidance on how to read and interpret them. Résumé: Cet article dessine un portrait de la méta-analyse qui est une technique qui résume de façon quantitative des études de recherche. La méta-analyse a pour but de conduire les chercheurs et les praticiens, qui travaillent dans le domaine de la technologie et dans les media éducatifs, à concevoir des recherches et résoudre des problèmes pratiques dans le domaine de la technologie éducative. L'article est divisé en cinq parties: 1) l’instruction générale des autres parties décrivant la posltlon de la méta-analyse dans le domaine de la technologie éducatlve; 2) discussi on sur la nature et l' importance d’une révision intégrale; présentation des arguments contre les synthèses quantitatlves; 3) description des étapes de réalisation d’une méta-analyse; 4) un exemple de trois méta-analyses (par exemple: un apprentissage de virtuosité et une expl ication des différents résultats obtenus dans chaque cas; et 5) une table de 26 méta-analyses sur les variables d'instruction avec des conseil s sur ses lectures et ses interp réta t lons .


INTRODUCTION
In 1983 Richard E. Clark stunned many people in the media and technology field by declaring that instructional media have no more effect on student learning and achievement than a delivery truck has on the quality of goods it transports to market.Both, he argued, are essentially neutral carriers of their respective contents.His claim extends from televised instruction on through to more recent applications of computer-based learning.
Clark's characterization of television as a neutral medium did not corne as a particular shock to most, because of the results of experiments in the 1950s Canadlan Journal of Educatlonal Communlcatton,VOL. 19, and 1960s where no significant differences in TV treatments abounded (Saettler, 1968).But to challenge the literature of computers in education (see Clark, 1985aClark, , 1985b) ) was to contradict both intuition and the prevailing research evidence.A flurry of comments and counter-comments in the literature (e.g., Petkovitch &Tennyson, 1984) and at conferences followed Clark's article for several years.
Clark's claim was based in part on an evaluation of several meta-analyses that have appeared in recent years on the effectiveness of computer-based instruction.In particular he argued that these quantitative summaries (see Table 2 at the end of this article for references) were fundamentally flawed, because a variety of experimental artifacts -among them the novelty effect associated with the treatment itself-had not been factored out of the results.
This article is about meta-analysis and its usefulness to practitioners for planning and predicting the outcomes of instructional treatments and to researchers for conceptualizing future research efforts.Meta-analysis, also referred to as quantitative synthesis, is a general set of procedures for combining the results of many individual research studies addressing a single question (Glass, 1976(Glass, , 1978)).The technique has grown out of a need in the social sciences to capture the essence of ever expanding research literatures and to provide definitive answers, in terms of the magnitude of effectiveness, to the bigger questions posed by theoreticians and practitioners.In addition, meta-analysis attempts to circumvent the subjectivism commonly associated with narrative forms of literature review and the limitations ascribed to the box score or vote count technique (Kavale, 1984).However, meta-analysis is not without its critics.There is disagreement among researchers on both the underlying premises of the technique as well as procedural issues relating to its implementation.
This article examines meta-analysis as a technique for reviewing literature with a particular focus on the literature of instructional techniques, methods and strategies.Some of the main issues on both sides of the "metaanalytic debate" are examined, for the purpose of judging its usefulness to educational technology and specifically its potential as a tool for designing instruction.
A case where great controversy has arisen over meta-analytic findings will then be reviewed in detail: the debate for and against the use of mastery learning.Finally, some guidance for reading and interpreting meta-analyses will be provided.An appendix to this article includes references to metaanalytic studies of instructional variables and strategies that are likely to be of interest to the educational technologist.

Why Integrate Research Studies?
It has long been recognized that the result of a single research study by itself is far from conclusive, even when the finding supports the hypotheses under consideration.Therefore, it has been common practice for researchers to review the literature of all such studies, whenever enough are available.It is not uncommon, in fact, to see the same question asked and answered in reviews every couple of years, as new studies add to the weight of evidence that can be brought to bear on a particular question.
The value of integrative reviews stems from limitations that are inherent in the research process itself.Since few studies of educational phenomena and even fewer studies of instructional methods actually draw subjects at random from a population, integrative reviews of many similar studies serve to provide greater coverage of the population.The need for wider coverage is increased when one realizes that individual samples suffer from the same problems of error that is involved in testing the null hypothesis within a study Even when a treatment effect is weak, five findings of significant differences out of one hundred studies run will be expected in the population (i.e., when a equals .05),simply as a result of chance.Integrative reviews, therefore, provide a means of overcoming the effects of chance fluctuation within samples, leading to a more generalizable conclusion concerning an effect.

Methods of Integrating Findings
Light and Smith (197 1) provide a typology for categorizing most reviews of research in the social sciences.The first type of review involves listing or describing factors which have produced significant differences in at least one study.The style of this type of review is primarily narrative.In the second type only studies that support a particular point of view are presented.Most of the brief reviews of literature at the beginning of research articles are of either the first or the second type.A third type involves summarizing the findings of many studies using what has come to be called the vote count or box score technique.A simple count of studies reporting positive, negative or no significant results is conducted and a verdict is reached when a plurality of votes exists.The last type, reviews in which effect sizes are aggregated across many studies, is the category in which meta-analysis resides.The third and the fourth types of reviews are both quantitative in nature.The box score or vote count technique, however, has been criticized because it fails to take into account the effects of differential sample size on the sensitivity of the null hypothesis test.Larger samples require smaller mean differences to establish significance than smaller samples, although they are given equal credence in this technique.Box score analysis also does not take into account the magnitude of differences/relationships or'the quality of the study.Metaanalysis, the subject of this article, was developed by Glass (1976Glass ( , 1978) ) to overcome the difficulties inherent in descriptive reviews and the problems associated with using statistical indices (e.g., r) to reflect differential treatment effects or relationships among variables.

Objections to Quantitative Synthesis
Objections in principle.Complaints about quantitative synthesis range from the purely philosophical to the purely methodological.On the one hand, there are the arguments raised by advocates of qualitative and naturalistic approaches to enquiry (e.g., see Guba, 1979).His objections in principle to "reduction by numbers" applies doubly to quantitative synthesis, since the distillation of many studies removes the researcher one step further from the "texture" of the original setting.This, of course, is an objection that cannot be overcome by improving the theory or practice of quantitative synthesis.Rather, one must accept or reject this argument based on other criteria which have been laid out and vigorously defended by both sides.
The second major objection applies not to quantitative analysis in general, but to quantitative synthesis in particular.Eysenck (1984) argues that what is lost when narrative review is transformed to quantitative review is the exercise of scientific judgment over what is, in nearly all areas of research, a complex set of interacting variables.He states that "No simple addition of diverse and incommensurate studies can serve the purpose of drawing meaningful conclusions from heterogeneous and complex data.That requires experience, knowledge and the intangible quality we call good judgment" (p.47).
Evidence that paints a rather different picture of actual review practice is supplied by Jackson (1980).He examined a random sampling of narrative reviews from the social sciences and found that decision rules were so often unstated in these reviews, that it was difficult to describe them, much less evaluate their quality Meta-analysis is often touted as the antidote to the subjectivism that appears to be endemic to the process of describing research outcomes verbally.

Objections in practice.
Hardly anyone from within the quantitative research community argues about the need to synthesize the results of large bodies of research literature.It is readily acknowledged that when the literature exceeds even a dozen studies, the ability of reviewers to capture its essence in narrative form is diminished.Quantitative synthesis, then, as a principle-for dealing with substantial literature bases is not challenged.In addition, it is generally acknowledged that there is nothingobjectionable to the statistical underpinnings of meta-analysis, given that they derive from the wealth of statistical experience that has developed over many decades.It appears, then, that the objections arising from the research community derive more from the practice of meta-analyzing, than the principle of meta-analyzing.Slavin (1984) and others have argued that one of the chief problems inherent in much of the meta-analytic literature that has appeared since its introduction by Glass, is the uncritical combining of studies that have little more in common than the underlying question -Yz better than Yc.It is not surprising that many practitioners have adopted this strategy and that the literature reflects this tendency, since Glass had originally suggested subjecting all available studies to meta-analysis.
In large measure, this point is at the heart of Clark's objections to the metaanalyses on computer-based learning.In this particular case, according to Clark, the unconsidered effects of treatment artifacts produced an overestimation of effect size for this medium of instruction, although it could have just as easily gone the other way.In any case it is generally agreed that lumping all possible forms of treatment and methodological variations into one analysis probably leads to more confusion than clarity.
It has been argued, moreover, that one of the strengths of meta-analysis, its tendency towards summary conclusions, is also one of its weaknesses (Guskin, 1984).Since the research question being raised in a meta-analysis is often dichotomous or at best one of simple relationship, the variety of more complex findings that may have appeared in the original articles is reduced The fear has been expressed that consumers of meta-analyses may come away with nothing more than unqualified statements such as, "computer-based instruction is better than traditional lecture-based methods" or "the correlation between prior achievement and instructional support is moderately high and positive".It is certainly arguable, however, that for consumers who would not bother to digest more subtle forms of summaries, a simplistic view of the state of a research literature is better than no impression at all.
In the following sections, procedures for conducting a meta-analysis are described.Issues related to each procedure will be discussed to highlight both the potentials and the problems associated with the technique.

Defining the Scope of the Analysis
A first important decision to be made after a general area of research has been identified is how extensively the search will be conducted and what descriptors will be used in reviewing the literature.While this sounds like a relatively straight forward process, it is usually not.Often this step involves making literally dozens of a priori decisions about what will be included (and not included) in the meta-analysis.
Each decision will narrow the field of search, as well as the number of studies identified and the population to which generalizations can be made.
For example, in a synthesis of mastery learning studies one might consider features such as: a) how far back in time the review will go; b) the grade level of subjects; c) the subject matter tested; d) the duration of treatment; e) whether self-paced or group-based treatments is used; e) the type and quality of the dependent variable; and f) a host of experimental design characteristics (e.g., internal and external validity).Carlberg and Walberg (1984) point out trade-offs in: a) narrowly focusing the synthesis to exclude relevant variations in treatments (high fidelity/limited conclusions); and b) making the scope of inclusion so broad that marginally relevant and/or bad research is analyzed (low fidelity/more robust conclusions).
Advice on both sides of this issue has been offered in the literature of metaanalysis.Glass, McGaw and Smith (1981) argue for the widest inclusion criteria possible in order to reduce the effects of reviewer bias in the selection process.Eysenck (1978) has criticized this approach as "garbage in -garbage out".
As a way of accommodating this criticism, Slavin (1986Slavin ( , 1987) ) has proposed an approach called "best-evidence synthesis".This approach is based on the legal notion that ". . . the same evidence that would be essential in one case might be disregarded in another because in the second case there is better evidence available" (1986, p. 6).In the case of research review this means that only the best quality literature should be used in judging the general state of a research question-those studies which are high in methodological rigor and best manifest the characteristics under study.In the absence of studies of better quality, this could involve having to use less well designed studies, but in any case, comprising the best evidence.Objections to the use of this approach have been raised by Guskey (1987), who counters that the "best" in bestevidence synthesis is itself subjective and does not necessarily eliminate bias from the review.Abrami, Cohen and d'Apollonia (1988) take a middle approach, between that advocated by Glass and that advocated by Slavin: . . .we urge greater care in describing the inclusion criteria and in detailing the reasons for excluding individual studies.But we also consider that reviews sometimes go beyond describing the substance of the literature to consider the methodological problems and generalizability concerns that distinguish the best evidence from other evidence.Reviews may thus contribute to knowledge in an area through the analysis of study weaknesses as well as strengths.Such a contribution cannot be made through only the analysis of best evidence (p.164).

Reviewing the Literature
Once inclusion criteria have been established, the approach to locating studies for review is not substantially different from that used in other forms of integrative review.Primary studies may be located from a variety of sources, some of which are accessible through computer-generated searches.Most meta-analyses include the literature from relevant journals in the field.Others include theses and dissertations, conference presentations, technical reports and in-house manuscripts, chapters in books and monographs and other documents referred to categorically as "fugitive material".
Even when inclusion andexclusion criteria have been soundly determined, there remains the thorny problem of actually sorting studies by the established criteria.This process is by no means straight forward, as Abrami, Cohen and d'Apollonia (1988) have demonstrated using data from the literature on the validity of student ratings of instructors.They found that even when inclusion criteria were very clearly specified, seven expert raters had an average comprehensiveness index of only .58(i.e., ratio of correctly included studies to incorrectly included or excluded studies) with individual indices ranging from .13 to .88.They make recommendations for enhancing the agreement among raters, along with suggestions for improving meta-analysis methodology at four other stages in the process.

Identifying Variables for Study
Unless the researcher is very familiar with the primary literature under study, it is advisable to select asample of studies for the purpose of determining variables that will be subsequently coded for analysis.The purpose of this exercise is to determine which variables, in addition to the primary distinction under study, have been most commonly reported in the literature.These additional variables may serve to aid in generalization or may actually form the basis for tests of significance in their own right.In the following sections these variables, under commonly encountered headings, are discussed.
Demographic variables.Among other things, these include variables related to the nature of the experimental sample under study (e.g., sex, grade level, SES).
Treatment variables.Included in this category are characteristics of the treatment condition, for instance, type of treatment, duration of treatment, location of treatment and experimenter characteristics.
Design variables.Variables falling into this category are those associated with the nature and quality of the experimental manipulation.Examples include presence of experimental control, randomization and selection, presence of pretest, nature of dependent measures, specific threats to internal and external validity Once these variables have been identified, they are coded for each study using a scheme that is similar to that shown in Figure 1 (see page 178).

Calculating Effect Size
The estimate of the strength of a treatment, called an effect size, is calculated using a relatively simple procedure.For difference questions, the means of treatment and control groups are ascertained, and the control mean is subtracted from the treatment mean.Naturally, if this difference has a positive sign, it indicates that treatment subjects have outperformed control subjects, while a negative sign indicates the reverse.This raw difference is not enough, however.It must be standardized so that other studies investigating the same variable may be averaged with it.The meta-analytic researcher accomplishes this by dividing the raw difference by an estimator of e--the standard deviation of the control group (for Glass Other formulae for deriving effect sizes in studies that do not contain some of the elements listed above have been presented by McGaw and Glass (1980).Formulae are also available for obtaining effect sizes when transformed scales are used (e.g., gain scores), when factorial designs are used or when dependent measures have been adjusted by a covariate.The result is a z-score of sorts* -a standardized metric which represents the number of standard deviations the treatment condition has outperformed the control condition (or underperformed if the sign is negative).All of the effect sizes in the study are then averaged (to produce a mean effect size) or the median of the distribution is represented.Figure 2 (see page 179) shows how the difference between the two theoretical distributions of control and treatment may be shown graphically, and then represented as an actual distribution of effect sizes.
*Z-scores are calculated within a distribution of raw scores using the following formula: Score in the Distribution Distribution Mean + Distribution Standard Deviation.Since the distribution of z-scores is in unsquared deviation units, its mean is always 0 and its standard deviation is always 1.0.This is not the case with an effect size distribution, since the mean and standard deviation for each study included is different.The mean of the distribution may be either positive or negative and represents the average standardized difference between sample means..4-.5 .8-.9 Guekey and Pigott (1988).
At this stage, the homogeneity of the effect size distribution is considered.
Homogeneity of effect size is the assumption that the variance surrounding an effect size mean is small enough for the mean to be considered a valid representation of the phenomenon in the population, and is similar to the assumption of homogeneity of variance in experimental design.The test is ac complished by comparing a calculated sum of squares to the chi-square distri bution (with k degrees of freedom where k is the total number of studies).
In some meta-analyses that are published, this test is either not performed or not reported, although it is critical to interpreting the average effect size.
The finding of an effect size of .94 with a standard deviation of 1.91 in a mastery learning meta-analysis (Lysakowski & Walberg, 1982) is probably an example of too much variability, considering the magnitude of the mean.Consequently, Guskey and Pigott (1988) reported a homogeneity of variance violation for mastery learning, x2 = 759.5 (df = 77), p < .00l,and avoided calculating a measure of central tendency for the set of mastery learning studies.

Using Inferential Statis tics
If homogeneity of effect size is violated, it is recommended (Hedges & Olkin, 1985) that effect sizes should be separated into subsets by coded characteristics until homogeneity is achieved.While similar to the statistical procedure just described, this test is equivalent to the F-test among groups in a one-way experimental design.Guskey and Pigott followed this procedure for all studies selected for inclusion on subject area, grade level of students and duration of study On those studies which reported them, program characteristics, gender, initial ability level of the students and extent of teacher training were investigated in an attempt to isolate models of study characteristics that would explain the lack of homogeneous findings.
Even when the homogeneity assumption is met, most meta-analyses report tests of significance across coded variables to enhance the findings and explore other dependencies that may exist in the data.Let'ssay, hypothetically, that the average effect size for a study is .60,but when the sample is categorized by sex, women (ES = .80)improve more than men (ES = .60).This suggests that women may be affected by the treatment more than men.In a sense it is an interaction term relating sex, as an independent variable, to the average difference between treatment and control.The test of significance is analogous to ANOVA in that total variation is partitioned into between-class and withinclass components for the purposes of comparison (Hedges & Olkin, 1985).One should always resist a causal interpretation of comparisons like this, however, since no random assignment to treatments is involved.
An actual example of this comes from a study of teacher feedback on homework assignments (Paschal, Weinstein & Walberg, 1984).Homework was found to be more effective in the fourth and fifth grades for improving achievement than in upper elementary or high school.Also, when graded versus ungraded homework was compared, a substantial difference emerged.Graded homework produced an effect size of .80,while ungraded homework influenced achievement by only .36 standard deviations.Both of these characteristics of the sample produced significant differences when tested using the procedures outlined above.
In addition, researchers would be interested in whether there is a difference among a variety of methodological and demographic aspects of the studies under consideration.This amounts to searching for bias in the variables coded under threats to internal and external validity, publication sources, such as articles, bookchapters dissertations and ERIC documents and other variables that may reside concomitantly with treatments.Not surprisingly, higher effect sizes are often found for published over non-published works, since journals usually accept studies that apply more rigorous methods and often reject studies reporting no significant differences.
Where a quantitative scale is involved, regression analysis can be used to test among increasing or decreasing levels of some continuous independent variable and its accompanying dependent variable.A good example of this comes from Glass and Smith (1979).They investigated the effects of differing class sizes (i.e., number of students being taught at a time) and the cognitive achievement associated with it.Eighty studies were gathered and increasing class size (a quantitative scale) was regressed against achievement (also a quantitative scale).Results indicated that achievement was found to increase from by .50 standard deviations as class size changed from 1 (i.e., individualized tutorial instruction) to 40.However, the relationship was not completely linear.The greatest change occurred between class sizes of 1 and 20, beyond which it flattened into almost a straight line.This suggests that with class sizes over 20, individual achievement does not rise incrementally.This study was one of the first large-scale meta-analyses and its results have been widely discussed as both an example of good and bad (e.g., see Slavin, 1984) metaanalytic practice.

Interpretation and Reporting
When they are completed, meta-analyses, unlike the individual samples summarized within them, are thought to approximate the population of subjects from which the original studies were drawn.In fact, meta-analyses often include literally tens of thousands of subjects, assumed to have been originally drawn from the same population before random assignment.When treatment effects are present, two populations are actually involved, one treated and one untreated.The effect size estimates the standardized difference between these populations.
Figure 2 shows this comparison in graphic form.An effect size of 1.0 means that the treatment population has outperformed the control population by one population standard deviation.Often the effect size is converted to a percentile rank to enhance interpretability An effect size of 1 is equivalent to the 84th percentile in a normal distribution, meaning that the average treatment condition subject is above 84% of subjects in the control condition.
Since one of the purposes of meta-analytic studies is to allow for comparison among potentially useful instructional treatments, some additional form of interpretation of average effect size is desirable.A non-technical interpretation of low, medium and high effect sizes has been suggested by Cohen (1969).Small effect sizes (e.g., .20 or the 58th percentile) are similar to those associated with comparisons among the heights of 15 and 16 year old girls.Medium effect sizes (e.g., .50 or the 69th percentile) would be similar to differences between 14 and 18 year old girls.Large effect sizes (.80 or the 79th percentile) are of the order of magnitude of differences in IQ between holders of Ph.D. degrees and the average college freshman.

Description of Mastery Learning
Although the concept of mastery learning has existed since the 1920s, it became a mainstream instructional strategy primarily as a result of work by Bloom (1968) and Block and Anderson (1975).In its simplest form mastery learning is ". . .a test about what the student was supposed to learn; a test not for gradingorjudging, but rather to see what the student has learned and what he or she needs to learn.The students are then given some help" (Bloom quoted in Koerner, 1986, p. 60).There are two primary forms that have grown out of this basic notion: a) group-based mastery learning; and b) personalized system of instruction (PSI/Keller Plan).PSI is an individualized form of mastery learning.
Supporters of mastery learning claim that the method will produce significantly higher achievement results, given the same objectives, the same materials, and the same amount of time allocation as standard instructional models.In group-based mastery instruction, teachers determine the pace of instruction, while in PSI the student controls the pace.In addition, it is argued that learning achievement will be dramatic: 90% of the learners will be able to achieve at a learning level of 85% or higher.Effectively, this would change the normal distribution of learning outcomes produced by standard instruction into one that is highly negatively skewed (Bloom, 1984).Guidance is individualized and focused on what has not been achieved.
Over the years many studies have been conducted to test these claims in both the contexts of group-based settings and PSI.The first review of literature (Block & Burns, 1976), conducted on both group-based and PSI studies, concluded that the mastery approaches described result in higher achievement and positive affective outcomes.However, the cognitive results were not as dramatic as the supporters of mastery learning had claimed.In the 1980s three meta-analytic studies of group-based mastery learning were conducted, each reaching dramatically different results as to the state of research that underlies this instructional technique and the magnitude of treatment effects.These studies are summarized in Table 1 (see pages 183 and 184), and their features and issues related to them are discussed.

Three Meta-Analyses on Mastery Learning
The first major review of research (Lysakowski and Walberg, 1982) was a quantitative summary of three of the four fundamental ingredients of quality instruction: cues, participation and feedback corrective.Reinforcement, the fourth element, had been reviewed previously The reviewers concluded that the average effect for all three components was .97,and that the effect for feedback and correction, the element most commonly associated with mastery learning, was .94.Clearly, this was dramatic evidence that mastery learning had achieved the potential that had all along been claimed for it.
Five years later, Slavin (1987) published a meta-analytic study of mastery    Mean undetennined because of heterogeneity of studies.
Med i an inteNal = .6-. 7 None indicated 1 "Strong Claim" comes from studies where mastery and control groups have same objectives, same materials and same amount of time with standardized measures.
2 "Curricula Focus Claim" (weak claim) comes from studies where teachers use a particular curricula and students learn a particular set of objectives.
3 "Extra Time Claim" (weak claim) comes from studies where additional time and instructional resources were required to bring all students to an acceptable level of achievement.-o

<3
learning that all but refuted the major claims made by Bloom and others.Using a technique developed by him, called 'best-evidence synthesis," he was able to show that the results of mastery learning are considerably smaller in subsets of studies embodying: a) the "strong claim"-that mastery will outperform the control group when they have the same objectives, the same materials and the same amount of time and when learning is measured with standardized instruments; b) the "curricular focus claim" -that mastery learning focuses teachers on particular curricula and students on the attainment of particular objectives; and c) the "extra time claim"-that mastery learning is an effective use of additional time and instructional resources to bring all students to an acceptable level of achievement.In addition, Slavin only used studies that he considered methodologically rigorous and those where the mastery learning treatment lasted for four weeks or longer.Evidence for the "strong claim" produced a median effect size of .04(essentially 0).Studies representing the "curricula focus claim" were found to have a median effect size of .26,and the median for those representing the "extra time claim" was .31.The most recent meta-analysis of mastery learning studies was conducted by Guskey & Pigott (1988).A subset of articles related only to elementary and secondary classrooms had been published previously by Guskey and Gates (1986).The Guskey and Pigott meta-analysis included a larger number of studies (n = 46) and arrived at a somewhat surprising conclusion: that the variability of effect sizes for methodologically sound studies of group-based mastery learning was too great to compute an average effect size estimate.Attempts to derive models of measured variables which explained this heterogeneity were generally unsuccessful, although some trends are noted (e.g., higher in some subject matters).

Some Reasons for the Differences
There are several explanations for the discrepancies among these metaanalyses which help to demonstrate some of the characteristics of metaanalyses in general.First, it is obvious that meta-analyses conducted at different points in time, especially when there is high research productivity in the area, are bound to produce different results.More refined methods of study, better research designs, sensitivity to criticisms of previous research studies and a host of other considerationscan affect the results achieved through metaanalysis from one era to another.
Second, the selection of studies for inclusion, even when the same study pool is available, can dramatically affect the results that are achieved by different researchers.Smith and Glass (1977) and others argue for the inclusion of all available studies that include the minimum criterion of a metric of comparison, while others have claimed that mixing apples and oranges clouds the issue under study considerably, rather than elucidating it (Slavin, 1984).In the meta-analyses under consideration, the earliest effort set wide inclusion criteria that admitted many studies that were not included in the subsequent articles.Note in Table 1 that the percentage of overlap between Lysakowski/Walberg and Slavin and Guskey/Pigott is 0% and 10%, respectively Between Slavin and Guskey/Pigott the overlap is considerably higher.In this latter case, however, it is far less than it could be because of the restrictive conditions set by the 'best evidence synthesis".
One interesting finding by Guskey and Pigott (and Lysakowski & Walberg, although it was not discussed) reveals something of interest about the outcomes of meta-analytic studies.You may notice in Table 1 that for an effect size of .94,Lysakowski & Walberg report a standard deviation of 1.91.This describes a very platykurtic distribution (i.e., flat) which cannot be thought of as homogeneously summarized by a single effect size.In sampling an essentially different literature of mastery learning studies, Guskey and Pigott found the same thing.The standard deviation of the distribution of effect sizes should be considered an important piece of information and effect sizes with high standard deviations or standard error of the mean should not be taken at face value, because they may not be statistically significant.
The variability, both within and among mastery learning studies probably says little about the pedagogical soundness of its best applications, but instead bespeaks the implementation and methodological problems that continue to plague it.Variations in practice abound and there exists thorny research design problems which have not yet been fully addressed.These include dealing with time to mastery, equilibrating mastery and non-mastery treatments, dealing with the skewed distributions that invariably results from good mastery applications and establishing a sound rationale for using either standardized or well constructed locally produced instruments.As the technology of mastery learning improves and researchers become sensitive to methodological problems that are peculiar to mastery investigation, the variability in mastery studies will undoubtedly subside.

Comments for Practitioners and Researchers
Conscientious practitioners are always searching for support for the design of quality instructional programs.This might come from previous successes, from the analysis of cost/benefits, or from the literature of research studies.Meta-analysis seems a reasonable tool for achieving the latter goal.
Table 2 (see pages 187 through 191) lists 26 meta-analyses of instructional variables divided for convenience into categories: instructional media, text design features, classroom processes, feedback and correction and social aspects of learning.The references to these studies are provided in the appendix.These studies represent a potentially valuable resource for the practitioner and researcher alike.
In spite of the apparent flaws in the practice of meta-analysis, it remains the single most powerful tool for summarizing studies in an era of rapidly ):.2lncluded were still projection, film.multimedia.CCTV, ETV, Observation and fee dback.
• Technologies reported were programmed instruction, visually-based instruction, multi-media, CAl and audio-tutorial.
• TechnicaHy not a meta-analysis.ES is based on conwrsion of percentage of increase of treatment over control to standardized units.
• Reported for studies where repeated questions (either pre• or post•) were used with repeated criterion questions.
• ESs for an studies reviewed were considered too heterogeneous to derive a mean effect size.
• 'Extra Time Claim' (see Mastery Learning example).For the researcher, meta-analysis represents a means for focusing thought on the large questions and a heuristic for designing future studies taking into account the smaller questions, For the practitioner in the media and technology field, meta-analysis is a means for making broad decisions about the implementation of new programs and the design of instructional products.Effect size is the metric for predicting what might happen in a new circumstance if a particular instructional variable were implemented.It tells roughly how many standard deviations of additional achievement would be expected over groups that do not receive the variable.However, it behooves both practitioners and researchers alike to heed the warnings of thecritics of metaanalysis practice.The following suggestions may aid the reader in using the information contained in meta-analysis to support their instructional decisions.
1. Achieving a common definition -While seemingly self-evident, the consumer of meta-analyses should make certain that their definition and that of the author are in agreement and that the studies reported in the metaanalysis are examples of the conceptual definition under consideration.For instance, a designer searching for pre-instructional activities for textbook design should realize that the studies reported under the rubric of advance organizers will not include other design features that might be commonly associated with advance organization, such as outlines, abstracts, introductions and overviews.The technical definition created by Ausube1 and tested in meta-analyses does not include the above.
2. Achieving a common circumstance -Meta-analyses often summarize studies across a wide variety of instructional or educational circumstances (i.e., grade levels, SES levels, geographical boundaries).Consumers of metaanalyses should be aware of these circumstances and if necessary base their conclusions on subsets within the meta-analysis that fit their own needs.There is a danger in this, however.When the studies in a meta-analysis are subdivided, the resulting number of studies per subset is often quite small, often fewer than 5 studies.It is more difficult to base a firm judgment on smaller data set than larger ones, because the smaller number runs a greater risk of a Type II error (accepting a when it should be rejected).Naturally, the variability among studies within a subset should be of concern, as well as the mean.
3. Achieving an overall understanding -Of note in Table 2 is the fact that some areas have been investigated several times.This is partly because the state of evidence is always advancing.More recent meta-analyses, supplant older ones in characterizing the field more fully.However, in some cases, a meta-analysis may be repeated to reconsider an earlier finding or to incorporate a new methodological or conceptual application into the state of the art.The meta-analysis of the mastery learning literature by Slavin (1987) is a good example of the latter case, The 'best-evidence synthesis" represents a new conception of how inclusion and exclusion criteria should be set and as a result a dramatically different impression of the field emerged.
It is surprising that follow-ups of the Kulik et al. media studies of the early 1980s have not been attempted, particularly after Clark's 1983 attack on their validity One would have expected a response to determine if Clark's assertion accurately represented the overall literature, given that his findings were based on only a partial selection of these studies.We can only speculate that calls by Salomon and Clark (1979) and Bernard (1986) and others to stop asking gross media questions, comparing a media treatment to a control group, have been heeded.Unfortunately, literature concerning the nuances of within media comparisons does not abound, reducing the likelihood of additional meta-analyses in the media area.
Since new meta-analyses can appear for either of the reasons mentioned above, to achieve a complete understanding of a field of inquiry, it is important to become familiar with all of the meta-analyses that have been conducted, not just the most recent ones.
Another point of importance here is the limitations of meta-analysis for drawing specific conclusions about when or under what exact circumstances a particular technique or medium should be applied.Meta-analysis is far too global to aid in the fine-grained analysis of instructional problems.In addition, it has seldom been used to address instructional treatments that are continuous or incrementally applied (e.g., varying degrees of feedback) or where variations in type of common strategy (e.g., type of questioning) are examined.Therefore, meta-analysis is most useful as a tool for making the larger instructional decisions.The designer must look to more specific studies of instructional treatments and/or conduct local evaluation studies on prototype materials in order to gain insight into particular aspects of developing instruction.
4. Achieving a statistical understanding -There are several important points here.One, the mean effect size that is reported may not accurately reflect the underlying population parameter.If a test of homogeneity of effect size is not provided, look carefully at the magnitude of the standard deviation if it is given.Interpretation of this statistic may be supplemented by a histogram similar to the one pictured at the bottom of Figure 1 that are often included (stem-and-leaf diagrams are also common).This will provide a visual sense of the distribution of effect sizes and the variability among them.Two, while tests of significance within the distribution of effect sizes and between subsets of demographics are important, they can be misleading.When sample size is small, the power of the test is low reducing the probability that differences will be detected, even when they are present in the population.When sample size is large, even a relatively small effect size may exceed the critical value necessary to reject the null hypothesis.Three, in interpreting the effect size, reference to the percentile rank and to non-technical descriptions of the meaning of effect sizes are invaluable.

Comments for Researchers
Meta-analysis represents both a means for estimating the effects of instructional treatments in practice and a heuristic for designing future research studies.This latter function may be accomplished in two ways.First, meta-analyses can sensitize researchers to issues of design and methodology that will admit or not admit their studies to scrutiny in future attempts to synthesize the literature.Second, the ancillary analyses contained in most meta-analyses can aid researchers in identifying the sources of data and variables that are likely to interact with the major question that is being addressed.If these suggestions sound like prescriptions for conformity, that is exactly how they are intended.Progress in the science of instruction, to some degree, is predicated on the presence of high quality replications in order for the larger questions to be answered.
However, it should be recognized that there exist limitations to metaanalysis as a heuristic for research.Meta-analysis is a retrospective approach which derives its strength from the weight of past efforts.It is therefore unlikely that new developments-those that will qualitatively extend beyond present practice-will emerge from this technique.Meta-analysis will never be a substitute for insight and creativity in the conduct of primary research or the development of new instructional methods.In short, as a technology of quantitative synthesis, meta-analysis should never substitute for the kind of in-depth exploration and complex thinking that characterize productive scientific enquiry.

CONCLUSION
In this article we have sketched a broad picture of the nature of metaanalysis, its potential for informing researchers about the overall effectiveness of variables in a given field and for aiding media and technology practitioners in making decisions concerning larger instructional development issues.We have discussed both the philosophical and practical objections to metaanalysis and have described the process of doing a meta-analysis in some detail.Clearly, all of the many issues that have arisen over the last 15 years cannot be catalogued here.However, the core issues that have been represented and the references, provide ample fodder for further consideration.
Figure 1.Cross-Sections of a Simplified Coding Scheme.
urban settings ' ES :Glass's ES {M1 -Me I SDJ; d =Cohen's d (M1-Me I Se>Pooled) or by r = Pearson Product-Moment Correlation.All effect sizes are from achieve ment data.SO refers to standard deviation and SE refers to the standard error of the mean.Unless otherwise indicated, effect sizes shown are for overall research question only.
. This process is necessary even when flawed because of the impossibility of keeping up with even a fraction of the literature by researchers and practitioners alike.

TABLE 1
Comparison of Three Meta-Analytic Studies of Group-Based Mastery Learning

TABLE 2 (
conti nued)Summary of Meta-Analytic Studies of Instructional Variable Effects on Student Achievement ;:::