Systematic review of the Hawthorne effect: New concepts are needed to study research participation effects☆

Objectives This study aims to (1) elucidate whether the Hawthorne effect exists, (2) explore under what conditions, and (3) estimate the size of any such effect. Study Design and Setting This systematic review summarizes and evaluates the strength of available evidence on the Hawthorne effect. An inclusive definition of any form of research artifact on behavior using this label, and without cointerventions, was adopted. Results Nineteen purposively designed studies were included, providing quantitative data on the size of the effect in eight randomized controlled trials, five quasiexperimental studies, and six observational evaluations of reporting on one's behavior by answering questions or being directly observed and being aware of being studied. Although all but one study was undertaken within health sciences, study methods, contexts, and findings were highly heterogeneous. Most studies reported some evidence of an effect, although significant biases are judged likely because of the complexity of the evaluation object. Conclusion Consequences of research participation for behaviors being investigated do exist, although little can be securely known about the conditions under which they operate, their mechanisms of effects, or their magnitudes. New concepts are needed to guide empirical studies.


Introduction
The Hawthorne effect concerns research participation, the consequent awareness of being studied, and possible impact on behavior [1e5]. It is a widely used research term. The original studies that gave rise to the Hawthorne effect were undertaken at Western Electric telephone manufacturing factory at Hawthorne, near Chicago, between 1924 and1933 [6e8]. Increases in productivity were observed among a selected group of workers who were supervised intensively by managers under the auspices of a research program. The term was first used in an influential methodology textbook in 1953 [9]. A large literature and repeated controversies have evolved over many decades as to the nature of the Hawthorne effect [5,10]. If there is a Hawthorne effect, studies could be biased in ways that we do not understand well, with profound implications for research [11].
Empirical data on the Hawthorne effect have not previously been evaluated in a systematic review. Early reviews examined a body of literature on studies of school children and found no evidence of a Hawthorne effect as the term had been used in that literature [12e14]. The contemporary relevance of the Hawthorne effect is clearer within health sciences, in which recent years have seen an upsurge in applications of this construct in relation to a range of methodological phenomena (see examples of studies with nonbehavioral outcomes [15e17]).
There are two main ways in which the construct of the Hawthorne effect has previously been used in the Ethics statement: Ethical approval was not required for this study. Competing interests: No authors have any competing interests. Authors' contributions: J.M. had the idea for the study, led on study design, data collection, and data analyses, and wrote the first draft of the report. J.W. assisted with data collection and analyses. All three authors participated in discussions about the design of this study, contributed to revisions of the report, and approved the submission of the final report.

What is new?
Most of the 19 purposively designed evaluation studies included in this systematic review provide some evidence of research participation effects.
The heterogeneity of these studies means that little can be confidently inferred about the size of these effects, the conditions under which they operate, or their mechanisms.
There is a clear need to rectify the limited development of study of the issues represented by the Hawthorne effect as they indicate potential for profound biases.
As the Hawthorne effect construct has not successfully led to important research advances in this area over a period of 60 years, new concepts are needed to guide empirical studies. research literature. First, there are studies that purport to explain some aspect of the findings of the original Hawthorne studies. These studies involve secondary quantitative data analyses [1,18e20] or discussions of the Hawthorne effect, which offer interpretations based on other material [4,10,21,22]. The Hawthorne effect has also been widely used without any necessary connection to the original studies and has usually taken on the meaning of alteration in behavior as a consequence of its observation or other study. In contrast to uses of the term in relation to the original Hawthorne studies, methodological versions of the Hawthorne effect have mutated in meaning over time and across disciplines and been the subject of much controversy [1,2,4,23,24]. This diversity means that certain aspects of the putative Hawthorne effect, for example, novelty [25] are emphasized in some studies and are absent in many others.
There is a widespread social psychological explanation of the possible mechanism for the Hawthorne effect as follows. Awareness of being observed or having behavior assessed engenders beliefs about researcher expectations. Conformity and social desirability considerations then lead behavior to change in line with these expectations. Chiesa and Hobbs [5] point out that just as there are different meanings given to the purported Hawthorne effect, there are also many suggested mechanisms producing the effect, some of which are contradictory. In all likelihood, the most common use of the Hawthorne effect term is as a post hoc interpretation of unexpected study findings, particularly where they are disappointing, for example, when there are null findings in trials.
The aims of this systematic review were to elucidate whether the Hawthorne effect exists, explore under what conditions, and estimate the size of any such effect, by summarizing and evaluating the strength of evidence available in all scientific disciplines. Meeting these study aims contributes to an overarching orientation to better understand whether research participation itself influences behavior. This inclusive orientation eschews restrictions on participants, study designs, and precise definitions of the content of Hawthorne effect manipulations.

Methods
The Hawthorne effect under investigation is any form of artifact or consequence of research participation on behavior.
Studies were included if they were based on empirical research comprising either primary or secondary data analyses; were published in English language peer-reviewed journals; were purposively designed to determine the presence of, or measure the size of, the Hawthorne effect, as stated in the introduction or methods sections of the article or before the presentation of findings if the report is not organized in this way; and reported quantitative data on the Hawthorne effect on a behavioral outcome either in observational designs comparing measures taken before and after a dedicated research manipulation or between groups in randomized or nonrandomized experimental studies. Behavioral outcomes incorporate direct measures of behavior and also the consequences of specific behaviors. Studies that described their aims in other ways and also referred to the Hawthorne effect as an alternative conceptualization of the object of evaluation were included as were studies that have other primary aims such as the evaluation of an intervention in a trial in which assessment of the Hawthorne effect is clearly stated as a secondary aim of the study, for example, with the incorporation of control groups with and without Hawthorne effect characteristics. Studies were excluded if: unpublished or in grey literature on the grounds that it is not possible to systematically assess these literature in an unbiased manner; discussion articles and commentaries were not considered to constitute empirical research; they referenced or used the term Hawthorne effect incidentally or described it as a design feature or as part of the study context, or invoked it as an explanation for study findings. Studies of the Hawthorne effect that incorporate nonresearch components, including cointerventions such as feedback, hamper evaluation and are also excluded, as were reanalyses of the original Hawthorne factory data set by virtue of nonresearch cointerventions such as managerial changes (see Ref. [8] for a detailed history of the studies).
Studies were primarily identified in electronic databases. In addition, included studies and key excluded references were backward searched for additional references and forward searched to identify reports that cited these articles. Experts identified in included studies and elsewhere were contacted. The most recent database searches took place on January 3, 2012 for the following databases: Web of The term ''Hawthorne effect'' was searched for as a phrase as widely as possible. If the database permitted, the term was searched for in all fields as was the case for Embase, CINAHL, ERIC, and others. In the Web of Knowledge search, which uses Web of Science, MEDLINE, and BIOSIS Previews databases, ''Hawthorne effect'' was entered into the ''topic'' field. The term was also searched for in ''keyword'' fields for databases such as NCJRS, LLBA, APPI, and others. The use of this term as the core object of evaluation negated the need for a more complex search strategy.
Hits from the database searches were downloaded into EndNote software (Thomson Reuters), removing duplicates there. Screening of titles and abstracts was undertaken by the second author or a research assistant. After a further brief screen of full-text articles, potential inclusions were independently assessed before being included. Data extracted are summarized in the tables presented here, which also contain information on risk of bias in individual studies. Binary outcomes were meta-analyzed in Stata, version 12 (Statacorp), with outcomes pooled in random-effects models using the method by DerSimonian and Laird. The Q and I 2 statistics [26] were used to evaluate the extent and effects of heterogeneity, and outcomes are stratified by study design. Formal methods were not used to assess risk of bias within and across studies, and narrative consideration is given to both. We did not publish a protocol for this review.

Results
Nineteen studies were eligible for inclusion in this review [27e45]. The PRISMA flowchart summarizing the data collection process is presented in Fig. 1. The design characteristics of included studies, along with brief summaries of outcome data and observations on most likely sources of bias, are presented separately for randomized controlled trials (RCTs), quasiexperimental studies, and observational studies in Tables 1e3, respectively. All included studies apart from one [27] have been undertaken within health sciences. All observational studies were studies of the Hawthorne effect on health-care practitioners, as were two of the quasiexperimental studies [37,39]. Although none of the randomized trials evaluate possible effects on healthcare practitioners, the study by Van Rooyen et al. [28] was undertaken with health researchers. The quasiexperimental   [35]. Four of the 5 quasiexperimental studies used some form of quasirandomized methods in constructing control groups (except Ref. [36]). Heterogeneity in operationalization of the Hawthorne effect for dedicated evaluations, in study populations, settings, and in other ways, is readily apparent in Tables 1e3. Fourteen of the 19 included studies report evaluations of effects on binary outcome measures. These data are presented in Fig. 2. The first six studies presented in Fig. 2 comprise six of the seven (not including Ref. [32]) evaluations of the effects of reporting on one's behavior by answering questions either in interviews or by completing questionnaires. All other studies evaluate being directly observed and/or the awareness of being studied in various ways, apart from one study that combines both types of Hawthorne effect manipulation [33].
As a result of heterogeneity in definitions of the Hawthorne effect (reflecting the inclusion criteria), findings from meta-analytic syntheses should be treated with caution. Explorations of the extent and effects of heterogeneity are presented in Table 4. Pronounced effects of statistical heterogeneity are reflected in the I 2 statistics for two of the three study designs (RCTs and observational studies) and also when attention is restricted to the eight studies of being observed or studied and to the subset of six studies of answering questions, and overall. When the one interview study (of preelection interview effects on voter turnout [27]) is removed, however, to leave five studies of the effects of self-completing questionnaires on health behaviors, statistical heterogeneity is markedly attenuated.
Bearing these explorations of heterogeneity in mind, effect estimates provide a confidence interval (CI) including unity for the five trials alone [odds ratio (OR), 1.06; 95% CI: 0.98, 1.14], and for the six studies of answering questions (OR, 1.07; 95% CI: 1.00, 1.15). They reach statistical significance in relation to the five studies of self-completing health questionnaires (OR, 1.11; 95% CI: 1.0, 1.23). The pooled estimate for the five quasiexperimental studies is similar to that for the five trials and is not statistically significant (OR, 1.07; 95% CI: 0.99, 1.17), whereas that for the four observational studies (OR, 1.29; 95% CI: 1.06, 1.30) and the eight studies of being observed (OR, 1.21; 95% CI: 1.03, 1.41) are larger and statistically significant. The overall odds ratio, without any weighting for study design, was 1.17 (95% CI: 1.06, 1.30).
Quantitative outcome data were presented in three of the other five studies; two identifying between-group differences [29,32] and one not [28]. The large effect in the study by Feil et al. [29] is noteworthy. In the remaining two studies, continuous measures of effect were not reported in the form of mean differences and were complex to interpret, although both reported statistically significant Hawthorne effect findings [40,43]. Continuous outcomes were also Self-report. Self-report.
Reasons for lack of effect at 6 mo unclear.
Self-report. Attrition. Two of four arms in online trial evaluating a decision support aid described. Limited information in report. Small numbers.
Abbreviation: HE, Hawthorne effect. No consent procedure for control group, no information on refusals to consent to cohort study. Small sample size.
Outcome data reported comparing approximately 15% who participated in case reviews (rather than those randomized) with approximately 85% who did not. Brief report.
Abbreviation: HE, Hawthorne effect. evaluated for two studies included in Fig. 1, with both finding evidence of statistically significant effects [33,35]. Of the 19 studies, therefore, 12 provided at least some evidence of the existence of a Hawthorne effect, however this was defined, to a statistically significant degree. Small sample sizes appeared to preclude between-group differences reaching statistical significance in two studies [31,36]. In five studies, it was judged clear that there were no between-group differences that could represent a possible Hawthorne effect [28,37e39,45].

Discussion
The Hawthorne effect has been operationalized for study as the effects of reporting on one's behavior by answering questions, being directly observed, or otherwise made aware of being studied. There is evidence of effects across these studies, and inconsistencies in this evidence. We explore heterogeneity in targets for, and methods of, study as well as in findings. We will begin by examining the evidence base for each of the study designs, before considering the limitations, interpretation, and implications of this study.
The RCTs tend to provide evidence of small statistically significant effects. There are also studies that showed no effects, and two studies that provided evidence of large effects [29,34]. The study by Feil et al. [29] used a strong manipulation, incorporating a placebo effect, which is not usually considered to be a Hawthorne effect component, in addition to research-and trial-specific participation effects. In both this study and the one undertaken by Evans et al. [34] also finding a large effect, small numbers of participants are involved. The diversity in the content of the manipulations in these studies is emphasized. When one considers the RCT data from the five studies contributing to the meta-analysis [27,30,31,33,34] alongside the three studies that did not, two of which produce statistically significant effects on continuous outcomes [29,32], it seems that overall, there is evidence of between-group differences in the RCTs. These between-group differences cannot, however, be interpreted to provide consistent or coherent evidence of a single effect. Odds ratios for Hawthorne effect manipulations on binary outcomes The same could be said of the diversity of the contents of the manipulations in the quasiexperimental studies, and the picture is made more complex by greater variability in study design features, particularly in relation to allocation methods. Overall, they produce mixed evidence, with a between-group difference in one study [35], a noteworthy difference in a small underpowered study [36], and no evidence of between-group differences in the other three studies [37e39]. It is difficult to draw any conclusion from included studies with quasiexperimental designs.
Although their design precludes strong conclusions because of the likelihood of unknown and uncontrolled biases, the heterogeneity of findings in the observational studies is interesting. Two studies produce identical point estimates of effects [41,42], whereas the other two estimates are similarly different [44,45]. These data suggest that the size of any effects of health-care practitioners being observed or being aware of being studied probably very much depends on what exactly they are doing. Perhaps, this is not at all surprising, although it does undermine further the idea that there is a single effect, which can be called the Hawthorne effect. Rather, the effect, if it exists, is highly contingent on task and context. It is noteworthy that the three other studies with health professionals, all using control groups, show no effects [28,37,39]. This is an ''apples and oranges'' review. This approach was judged appropriate, given the current level of understanding of the phenomena under investigation. This design decision does, however, entail limitations in the forms of important differences between studies, including operationalization of the Hawthorne effect, and exposure to varying forms of bias. The observed effects are short term when a follow-up study is involved, with only the studies by Murray et al. [35] and Kypri et al. [32] demonstrating effects beyond 6 months. Both these studies involved repeated prior assessments, and both provide self-reported outcome data. Self-reported outcomes do not appear obviously more likely than objectively ascertained outcomes to show effects. The forms of blinding used are often tailored to the nature of the study, making performance bias prevention difficult to evaluate across the studies as a whole.
By design, we have excluded studies that defined the object of evaluation to incorporate nonresearch elements as occurred in the original studies at Hawthorne [8]. We may have missed studies that should have been identified, although this is unlikely if use of the Hawthorne effect term was in any way prominent. Studies that have been missed may be more likely to be older and from nonhealth literature. An alternative design for this study might have eschewed this label and sought instead to synthesize findings on studies of research participation and/or awareness of being studied. Although potentially attractive, this course of action would have involved considerable difficulties in identifying relevant material and would risk losing the main focus on the Hawthorne effect. Similarly, we might have also included studies with cognitive and/or emotional outcomes in which effects might be greater [46], rather than focusing on behavioral outcomes. This possibility may be appropriate for evaluation in the future. Although we have sought to make our explorations of the heterogeneity of included studies as informative as possible, our analyses might be seen as excessive data fishing. These are, however, clearly presented as post hoc analyses after examination of high levels of heterogeneity, and the study by Granberg and Holmberg [27] is distinct from the other four questionnaire studies in a range of ways including the behavior being investigated.
Heterogeneity in operationalization of the Hawthorne effect make the data in this review challenging to interpret, yet it does appear that research participation can and does influence behavior, at least in some circumstances. The content and strength of the Hawthorne effect manipulations vary in these primary studies, and so to do the effects, although it would not be possible to discern any form of dosee response relationship. The manipulation by Evans et al. [34] may appear in some respects weak, for example, being an online questionnaire, and in others as potentially strong, examining decision making in relation to uptake of a cancer test, which was the study outcome. Although weak uses of the Hawthorne effect term in the wider literature mean that it is not very informative for interpreting the data from this study, outcomes may be considered in relation to the prevailing ideas about the core mechanism of the Hawthorne effect that conformity to perceived norms or researcher expectations drives change. Many, but not all, of the studies with positive findings appear broadly consistent with this account, although so too do many of the studies with negative findings and it is not clear why this is so. The study by Murray et al. [35] examined adolescent smoking at a time when the prevalences of both nonsmoking and regular smoking were approximately one-quarter in this sample, so it is unclear what norms or perceptions of researcher expectations many have been. This study exemplifies the literature as a whole in being principally concerned with the possible existence of a Hawthorne effect and not being designed to test the hypothesized mechanism.
There are other possible mechanisms of effect that have also not been evaluated. For example, regardless of perceptions of norms or researcher expectations, the content of the questions asked may themselves stimulate new thinking. In the studies by Evans et al. [34] and O'Sullivan et al. [30], patients may well not have previously considered what was being enquired about, and this may have been an independent source of change. Concerns about biases being introduced by having research participants complete questionnaires have existed for more than 100 years, before the Hawthorne factory studies took place and approximately 50 years before the Hawthorne effect term was introduced to the literature (see Ref. [47] for an early history of these issues). Given how long the Hawthorne effect construct has been the predominant conceptualization of these phenomena [48], it appears that this construct is an inadequate vehicle for advancing understanding of these issues. Alternative long-standing conceptualizations of these problems such as demand characteristics within psychology have also yielded disappointingly underdeveloped research literatures [49e51]. This state of affairs points toward an obvious need for further study of whether, when, how, how much, and for whom research participation may impact on behavior or other study outcomes.
Further studies will be assisted by the development of a conceptual framework that elaborates possible mechanisms of effects and thus targets for study. The Hawthorne effect label has probably stuck for so long simply because we have not advanced our understanding of the issues it represents. We suggest that unqualified use of the term should be abandoned. Specification of the research issues being investigated or described is paramount, regardless of whether the Hawthorne label is seen to be useful or to apply or not in any particular research context. Perhaps, use of the label should be restricted to evaluations in which conformity and social desirability considerations are involved, although it is striking how hostile social psychology has been to this construct [2]. So, what can be said about priority targets for further study on the basis of this systematic review and what concepts are available to guide further study?
Decisions to take part in research studies may also be implicated in efforts to address behavior in other ways so that research participation interacts with other forces influencing behavior. It is also possible, if not likely, that these relatively well-studied types of data collection (completing questionnaires and being observed) are part of a series of events that occur for participants in research studies that have potential to shape their behavior, from recruitment onwards. Giving attention to precisely what we invite research participants to do in any given study seems a logical precursor to examination of whether any aspect of taking part may influence them. Phenomenological studies, which ask participants about their experiences, would seem to be useful for developing new concepts. If individual study contexts are indeed important, we should expect to see effects that vary in size and across populations and research contexts, and perhaps also with multiple mechanisms of effects. The underdeveloped nature of these types of research questions means that it may be unwise to articulate advanced conceptual frameworks to guide empirical study. We propose ''research participation effects'' as a starting point for this type of thinking. Although descriptive, it also invites investigation of other aspects of the research process beyond data collection, which may simply be where research artifacts emanating from both social norms and other sources are most obvious.
We conclude that there is no single Hawthorne effect. Consequences of research participation for behaviors being investigated have been found to exist in most studies included within this review, although little can be securely known about the conditions under which they operate, their mechanisms of effects, or their magnitudes. Further research on this subject should be a priority for the health sciences, in which we might expect change induced by research participation to be in the direction of better health and thus likely to be confounded with the outcomes being studied. It is also important for other domains of research on human behavior to rectify the limited development of understanding of the issues represented by the Hawthorne effect as they suggest the possibility of profound biases.