Introduction

Scientific progress depends on reproducible results. A single finding from a single research group might reflect idiosyncrasies in subject populations or research methods—or simply random chance. When many research teams replicate a given finding many times, however, we can be relatively confident that the result is robust.

It is troubling, then, that a substantial proportion of scientific claims have proven, upon further investigation, to be difficult to replicate. Drug development firms regularly perform in-house replications of published research findings before launching full-scale commercialization efforts. Recent reports by scientists at Bayer (Prinz, Schlange, & Asadullah, 2011) and Amgen (Begley & Ellis, 2012) claimed that 79 % and 89 % of high-impact findings, respectively, could not be replicated. Similarly, Ioannidis (2005a) found that the findings of 16 % of landmark clinical trials could not be replicated by subsequent studies; effect sizes of another 16 % of studies were revealed to be dramatically smaller upon replication.

Although the most detailed studies of reproducibility have focused on medical research, functional neuroimaging studies also appear to be vulnerable to inflated effect sizes and false positive results. Wager, Lindquist, and Kaplan (2007) used the statistical thresholds and number of comparisons reported across a sample of 195 fMRI studies of long-term memory to estimate the false positive rate due to chance. Using this approach, these authors estimated that about 17 % of reported activation peaks were likely false positives. Similarly, Wager, Lindquist, Nichols, Kober, and Van Snellenberg (2009) showed that between 10 % and 40 % of activation peaks reported in individual studies did not overlap with activation maps derived from meta-analyses, also likely reflecting false positive results.

These estimates of false positive rates (or, equivalently, rates of irreproducible results) are likely to be conservative. Estimates derived from reported statistical thresholds assume that false positives are caused only by chance. However, so-called questionable research practices (e.g., omitting conditions or measures, concealing negative results, sequential hypothesis testing) likely make considerable contributions to false positive rates as well (John, Loewenstein, & Prelec, 2012). And while estimates drawn from meta-analyses are agnostic about the origins of false positives, this approach is highly sensitive to the quality of the constituent experiments: If a set of experiments is biased, a meta-analysis of these experiments will be biased as well.

Altogether, although evidence on reproducibility in functional neuroimaging research remains sparse, preliminary studies suggest that a substantial proportion of published research findings—a far greater proportion than would be expected, given nominal false positive rates—cannot be replicated. Why might this be the case? Many factors contribute to elevated false positive rates, including small sample sizes (Button et al., 2013; Yarkoni, 2009), flexible analysis procedures (Carp, 2012a), and failures to publish negative results (Ioannidis, 2005b; John et al., 2012). The risks posed by these practices have received increasing attention in recent years. However, less has been said about perhaps the most basic requirement of reproducible research: complete and clear description of experimental methods and results.

Why reporting matters

Incomplete or ambiguous descriptions of experimental details pose two major challenges to the goal of reproducible research. First, independent investigators cannot attempt to replicate published studies unless they can determine exactly what the results of those studies were and how those results were obtained. For example, while Brown and Braver (2005) claimed that activation in the anterior cingulate cortex (ACC) is sensitive to the likelihood of committing an error, Nieuwenhuis, Tanja, Mars, Botvinick, and Hajcak (2007) reported no relationship between ACC activation and error likelihood. Does this discrepancy reflect a failure to replicate or a subtle difference between experimental protocols? Unfortunately, because of incomplete reporting on the part of both research groups, there is no way to know. For example, Brown and Braver did not describe which software packages were used to analyze imaging data; Nieuwenhuis and colleagues did not describe how intervals between trials and trial phases were jittered; and neither study described critical methodological details such as the source or target images used in spatial normalization or how spatial smoothing kernels were selected. Although both groups cited potential cross-study differences in participant personality as a potential explanation for divergent results, neither group reported the personality characteristics of their samples.

Incomplete methods reporting can also elevate false positive rates by masking suboptimal research practices and outright errors. For example, inadequate statistical power increases the risk of false positive results (Button et al., 2013; Yarkoni, 2009). However, without access to detailed effect sizes or power calculations, reviewers cannot assess statistical power. Thus, incomplete reporting of these parameters increases the likelihood that underpowered and irreproducible studies will be published. Similarly, using suboptimal analysis procedures (such as global mean scaling or fixed-effects modeling) or questionable research practices (such as sequential statistical testing or omission of nonsignificant tests) is a red flag for reviewers. But if these details are not reported, reviewers and readers may simply not notice. Subsequent studies using more appropriate methods may not replicate these studies. In other words, incomplete reporting may permit the publication of incorrect results that cannot be reproduced.

Studies of research reporting in various scientific domains suggest that comprehensive descriptions of experimental procedures and results are hard to come by. For example, Chan and Altman (2005) showed that fewer than half of published randomized controlled trials reported critical methodological details like power calculations, definition of primary outcome measures, and handling of subject attrition. The functional neuroimaging literature is no exception to the overall trend of ambiguous or incomplete reporting. In a survey of over 200 recently published fMRI studies (Carp, 2012b), I found that many studies did not report important details of experimental design, data acquisition, and data analysis. Although standardized reporting guidelines for fMRI studies have been developed (Poldrack et al., 2008), none of the studies in this sample fully adhered to those guidelines—even though most of them were published after the release of the guidelines.

How to improve reporting

Improve natural language descriptions

Why don’t studies describe methods and results more thoroughly? Scientific norms likely contribute to this problem. By convention, most published fMRI studies provide a brief, high-level summary of methods and results. When preparing new manuscripts for publication, authors follow the template established by previous studies. Likewise, journal editors and reviewers hold submitted manuscripts to the standard set by previous reports. In this way, incomplete reporting practices are reinforced, sustained not by malicious intentions but, instead, by simple inertia. On this account, efforts to improve reporting quality should focus on cultural change: Neuroscientists should be encouraged to provide more detailed descriptions of methods and results in published studies.

As was described above, researchers have promulgated standardized reporting guidelines for functional neuroimaging studies (Poldrack et al., 2008). More recently, Ben Inglis has developed analogous guidelines for describing acquisition parameters used in fMRI research (http://practicalfmri.blogspot.com/2013/01/a-checklist-for-fmri-acquisition.html). However, the mere publication of such guidelines does not guarantee that researchers will adhere to them (Carp, 2012b). To persuade neuroscientists to abide by the guidelines laid out by their peers, the field might draw from lessons learned in the clinical trials literature. As in the neuroimaging literature, clinical trials researchers have developed guidelines for methods reporting (Bossuyt et al., 2003; Moher, Schulz, Altman, & Group, 2001; von Elm et al., 2007). However, researchers in the clinical trials literature have gone one important step further: Many journals that publish clinical trials have formally adopted these guidelines and (at least nominally) require authors to comply with them. Although descriptions of methods and results in the clinical trials literature remain incomplete (Smidt et al., 2006), these measures do appear to modestly improve reporting quality (Plint et al., 2006). Thus, adoption of standardized reporting recommendations by journals, professional societies, or funding agencies would likely bring about a parallel improvement in methods reporting in the neuroimaging literature.

Incomplete reporting is also likely driven by another pragmatic constraint: Some journals impose space limits on method sections or on overall article length. Meanwhile, a standards-compliant description of the methods for a typical neuroimaging study might take up to 1,200 words or more (http://www.fmrimethods.org/index.php/Event-related_design_example). Thus, even if authors wish to describe their research comprehensively, space limits may not permit them to do so. Again, however, a handful of journals and professional societies have set a positive example that the rest of the field would do well to follow. For example, Nature Publishing Group recently lifted space limitations on the methods section (http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852). In tandem, Nature has announced plans to require more thorough methodological and statistical reporting (http://go.nature.com/oloeip). Thus, allowing more space for thorough descriptions of research methods and results may also improve reporting quality. Alternatively, encouraging or requiring authors to submit comprehensive methods descriptions in supplemental materials sections or in independent repositories such as the Open Science Framework (http://openscienceframework.org/), OpenWetWare (http://openwetware.org/wiki/Main_Page), or Nature Protocols (http://www.nature.com/nprot/index.html) may lead to better reporting without expanding article formats (see Table 1), although the author wonders whether editors and reviewers will scrutinize such supplements as carefully as the primary article itself.

Table 1 An incomplete list of tools that can be used to improve methodological transparency and reproducibility in neuroimaging research

Share source code

As was discussed above, expanding natural language descriptions of research methods and results constitutes an important first step in improving reproducibility. However, while this strategy is useful, it is also incomplete. Natural language is inherently ambiguous, making the precise translation of computational procedures into human language challenging, if not impossible (Ince, Hatton, & Graham-Cumming, 2012). Even formal mathematical and computational descriptions, such as equations and pseudocode, are ambiguous: Different implementations of the same formulae will likely yield appreciably different results (Ince et al., 2012). Thus, these authors argue that “anything less than release of actual source code is an indefensible approach for any scientific results that depend on computation” (Ince et al., 2012, p. 485). On this account, sharing source code is a critical supplement to natural language descriptions of research procedures and results.

Practically, sharing source code may be less of a burden on investigators than providing more complete natural language descriptions of their work. Sharing source code requires little additional effort from investigators: Many Web-based tools, such as GitHub (https://github.com), FigShare (http://figshare.com), and RunMyCode (http://www.runmycode.org), allow researchers to distribute and even execute source code free of charge (see Table 1). This approach also offers substantial benefits to investigators, providing free storage and version control for their projects. Most important, sharing code approach allows independent scientists to conduct direct replications of reported research findings with a minimum of computational uncertainty. If source code for data acquisition is available (e.g., task presentation code in E-Prime or PsychoPy), independent investigators can (relatively) easily collect data that are comparable to the original data. And if source code for data analysis is available, independent investigators can analyze their data just as in the original study.

Share schematic representations of methods and results

However, while sharing source code offers important benefits relative to current practice, code sharing nevertheless suffers from significant limitations. Source code may not be easy to read, even for experts. Different research groups use different tools and programming languages; shared IDL code will be of little use to researchers who run their analyses in MATLAB. Source code may also be poorly organized, variables and functions may be given opaque names, and comments and documentation may be sparse or even absent (Morin et al., 2012). In other words, the fact that source code can be interpreted by a machine does not guarantee that it can be interpreted by a human. To provide an extreme example, I suspect that most readers of this article (like the author) will find the following JavaScript code difficult to understand:

  • eval(function(p,a,c,k,e,r){e=String;if(!''.replace(/^/,String)){while(c--)r[c]=k[c]||c;k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('3 4(0){1=0+2;5 1}',6,6,'a|result||function|plustwo|return'.split('|'),0,{}))

For interested readers, this code is equivalent to the following, which was obfuscated using a JavaScript compressor (http://dean.edwards.name/packer):

figure a

For additional examples of programs that are difficult for humans to read but nevertheless perfectly functional, see also Turing tarpit programming languages (http://en.wikipedia.org/wiki/Turing_tarpit) and obfuscated code contests (e.g., www.ioccc.org). And while these examples are admittedly extreme, unintentionally obfuscated code is not uncommon, even in popular software. For example, the SPM data analysis library includes lines like SPM = spm_spm(SPM) and variables like SPM.xX.xKXs.ds. Finally, to avoid hypocrisy, the author must point out that his own code is often challenging to read, even for him (for many examples, see his GitHub page at https://github.com/jmcarp). Many resources are available to help scientists write clearer, better organized, and better documented code, such as Software Carpentry (http://software-carpentry.org/). But perhaps it is unreasonable to expect scientists to become software developers as well, and I suspect that few full-time researchers have the time to do so. Altogether, these examples illustrate that the provision of raw source code alone is no guarantee that human scientists will be able to understand this code in a useful way. In other words, while sharing source code is necessary for successful research replication, it may not be sufficient.

To ensure that would-be replicators can both execute and understand the methods used in previous studies, investigators should also create and share formal, schematic descriptions of both experimental methods and research results. Such representations would convey essentially the same information as the natural language descriptions detailed above, but with reduced ambiguity and, most critically, in machine-readable formats. For example, a schematic representation of a Stroop task might include entities describing different task conditions, stimulus properties, run durations, and so on. Similarly, a schematic representation of a data analysis pipeline would include nodes for all processing steps (e.g., realignment, spatial smoothing) and parameters (e.g., reference volume, smoothing kernel). A schematic description of a set of results would include the model(s) used (e.g., t-test, multiple regression, etc.), degrees of freedom, contrast vectors, and links to statistical maps (e.g., spmT, COPE, etc.), as appropriate. These representations can also precisely capture analysis workflows, enumerating the inputs and outputs of each processing node.

Describing research methods and results using machine-readable schemas (in addition to natural language descriptions and code sharing) confers a variety of important benefits. For example, schematic descriptions of research methods avoid much of the ambiguity of even the most thorough natural language descriptions. Natural language is marked by polysemy: the use of a single word or phrase to represent different ideas. For example, “normalization” may refer to spatial registration of individual brains to a common template or to scaling all voxels to a common average (or to a variety of other procedures). Natural language descriptions of research methods are also made ambiguous by synonymy: the use of different words or phrases to represent a single idea. For example, different studies use different terminology to describe correction for differences in the acquisition timing of two-dimensional slices. Some reported that data were “corrected for differences in slice-acquisition timing” or that “slice-timing correction” was applied. Others used longer descriptions, reporting that “timing offsets between slices were corrected using cubic-spline interpolation.” Still others used abbreviated descriptions that could easily be confused with other procedures, noting that data were “time-corrected.” In contrast, an appropriate formal description would contain a single, unique term for each procedure, reducing ambiguity due to polysemy and synonymy.

Using machine-readable schemas would confer additional benefits for both authors and readers. Schematic representations of methods and results could be automatically translated into natural language documents, facilitating the often dull tasks of writing method sections and generating tables of results. Schema-to-language translators could also generate concise summaries of methods for readers looking to get the gist of the paper and comprehensive summaries for more methodologically focused readers. In addition, the wide availability of machine-readable meta-data describing methods and results would both elevate the quality of meta-analysis studies and reduce the time and effort required to conduct them. Traditional quantitative meta-analysis of fMRI experiments requires investigators to manually extract activation coordinates from each contrast of interest and each relevant publication. While the NeuroSynth project (Yarkoni, Poldrack, Nichols, Van Essen, & Wager, 2011) expedites this process by automatically extracting activation results from published journal articles, meta-analysis would require even less work—and be even more accurate.

Guidelines for natural language descriptions of research methods and technical solutions for code sharing are already in place (see Table 1); barriers to using these approaches are largely cultural. In contrast, neither the formal representations of neuroimaging experiments nor the software to create and make use of these representations are mature enough for everyday use. Fortunately, though, technical progress in these areas has recently made impressive strides. In particular, Keator et al. (2013) have proposed an extension of the XCEDE schema for describing the acquisition and analysis of neuroimaging data, the XCEDE Data Model (XCEDE-DM). International Neuroinformatics Coordinating Facility (INCF) has planned to augment the freely available Nipype library (Gorgolewski et al., 2011) with the facility to automatically export workflows to this schema (http://datasharing.incf.org/ni/Projects#XNAT_-_NIPyPE_Workflow). If other popular analysis libraries, such as SPM and FSL, release similar utilities, many neuroimaging researchers may soon be able to export schematic representations of their work with minimal time and effort.

Altogether, ambiguous and incomplete reporting of research methods constitutes a significant challenge to the reproducibility of neuroimaging results. The present commentary makes three recommendations to address this challenge. First, natural language descriptions of methods and results should be augmented according to standardized reporting guidelines; existing guidelines should be updated and potentially expanded as research techniques evolve. However, natural language descriptions of computational procedures are inherently ambiguous; complete source code must also be provided if independent scientists are to be able to precisely replicate published findings. Finally, even the provision of source code alone does not guarantee reproducibility. Incomplete documentation and the use of different programming languages and libraries across labs may make it difficult for one researcher to understand another researcher’s code. Thus, researchers should also share formal, machine-readable descriptions of experimental procedures and results—for example, using the XCEDE Data Model. The technology required to implement the first two recommendations is already in place; the technology required to implement the last is making rapid progress and may soon be ready for widespread use.

Objections

Skeptical readers may argue that the recommendations advanced here are infeasible, unhelpful, or otherwise not worth pursuing. For example, some might suggest that these steps require too much time or effort to implement. Indeed, writing comprehensive method and results sections increases the time needed for authors to write neuroimaging papers and for readers and reviewers to evaluate them. I would offer several responses to this point. First, I would point out that all of the recommendations offered here are amenable to automation. Method sections need not be exquisite prose; before long, software may be able to generate basic methodological summaries with minimal human intervention. Similarly, as software improves, technical barriers to sharing source code and machine-readable representations of analysis workflows will continue to diminish. Second, many of these recommendations offer secondary benefits to investigators that are themselves sufficient to justify taking these steps. Code sharing through GitHub provides a free backup of source code, protecting researchers from hard disk failures and accidental changes or deletions to analysis software. Creating schematic representations of analysis workflows and results will expedite writing method and results sections of papers and may even obviate large parts of these sections entirely. Third, and perhaps most important, even if these recommendations impose costs of time and effort on researchers, these costs will be repaid many times over by making neuroimaging research more reproducible. The costs of writing a longer method section pale in comparison with those of running a large-scale follow-up study—only to discover that the original results cannot be repeated.

As has been pointed out by an anonymous reviewer, code sharing may be challenging when there is no code to share—specifically, when researchers use graphical user interfaces (GUIs) to collect and analyze data. Fortunately, many popular packages allow users to export their workflows to code with minimal effort. For example, experiments designed in the GUI environments in E-Prime and PsychoPy are automatically converted to machine-readable code when run. Similarly, analysis workflows constructed in the GUI provided by modern versions of the popular SPM package can be saved to machine-readable .m or .mat files (see the SPM8 manual (http://www.fil.ion.ucl.ac.uk/spm/doc/manual.pdf), chap. 42). In fact, code generated automatically by these packages may even be better organized and more amenable to automatic analysis than code written by researchers—most of whom, after all, are not professional software developers. Altogether, sharing source code often remains both feasible and valuable when analysis is conducted using graphical interfaces.

Skeptics might also point out that most readers and reviewers will never examine source code linked with most published papers. Reviewers have little incentive (other than public-spiritedness) to inspect or execute code. Furthermore, even experts may not be able to detect subtle (but potentially important) programming errors (http://www.russpoldrack.org/2013/02/anatomy-of-coding-error.html). However, even if these resources are not used by a majority of readers, the minority who do use them will make invaluable contributions to the literature. For example, if reviewers have access to the original source code, they can more easily detect suboptimal or incorrect methods; these errors can then be corrected, or the associated papers revised. Similarly, researchers conducting meta-analytic studies can use open source code and schematic descriptions to weight experiments by methodological quality. Only a few community members need to access this methodological information for the entire field to reap the benefits, and these benefits can continue to accrue years after the original studies are published. Finally, creating stronger incentives for reviewers to carefully inspect source code would further increase the utility of coding sharing. For example, displaying reviewers’ names along with published papers might encourage more thorough evaluation of source code. Editors might also recruit methodological consultants specifically tasked with verifying that executing the provided code yields the expected results.

Researchers may also be concerned that their rivals might exploit these steps by scooping their ideas or detecting flaws in their studies. But, if anything, code sharing will prevent scooping. Sharing time-stamped code establishes academic precedence: The first research group to post code can prove they were the first to develop the corresponding analysis. And while code sharing may indeed enable independent scientists to expose flaws in published work, this is a good thing: better to be embarrassed by an error than to preserve flawed claims in the literature.

Finally, readers might object that the recommendations outlined here address only a subset of the challenges presently facing reproducible research. Such an objection would be entirely valid. For example, encouraging researchers to describe their methods and results, as recommended here, would not preclude fraud. The few researchers who are willing to fabricate or alter their data to produce favorable results could simply fabricate their method sections, source code, and so forth, as well. As was pointed out by an anonymous reviewer, these recommendations also do not directly address the problem of so-called researcher degrees of freedom: Researchers typically have access to a broad range of methods and may be tempted to choose the methods that yield the most favorable results over equally valid methods that happen to yield null results (Ioannidis, 2005b; Simmons, Nelson, & Simonsohn, 2011). Indeed, methodological flexibility poses a serious challenge to reproducibility in neuroimaging research (Carp, 2012a). Encouraging researchers to comprehensively describe their final methods does not prevent them from concealing other analysis pipelines they used that led to less interesting results. As a final point, I would suggest that comprehensive reporting is a valuable tool for preventing and fixing honest mistakes. Different strategies will be needed to address dishonest mistakes; for promising suggestions on this front, see the work of Simmons and colleagues (Simmons et al., 2011, 2012).

Conclusion

Incomplete reporting of methods and results poses a serious challenge to reproducibility in the functional neuroimaging literature: Without full disclosure, independent scientists cannot review or replicate published findings. I recommend three simple steps to increase transparency in the reporting of functional neuroimaging experiments: (1) providing comprehensive descriptions of research methods and results, according to standardized reporting guidelines; (2) distributing complete source code for all data collection and analysis procedures; and (3) sharing schematic representations of data collection and analysis pipelines, as well as results. These steps offer substantial benefits to the reproducibility and transparency of neuroimaging research, as well as a variety of secondary benefits to researchers. At the moment, cultural barriers discourage authors from pursuing the first recommendation, and technical barriers impede them from pursuing the third. To remove cultural barriers against comprehensive natural language descriptions, journals and professional societies should adopt standardized reporting guidelines and eliminate space limits on method sections, as a handful of these groups have already done. To remove technological barriers against the creation and distribution of schematic descriptions of research methods and results, methodological researchers should continue to develop software to automate these tasks. Altogether, increased adoption of these recommendations would foster transparency and improve reproducibility in neuroimaging research.