Primary Research Data and Scholarly Communication

David Martinsen

doi:10.1515/ci-2017-0309

Publicly Available Published by De Gruyter May 24, 2017

Primary Research Data and Scholarly Communication

David Martinsen
David Martinsen <dmartinsen@consultdpm.com>, formerly a Senior Scientist at ACS Publications, consults in scholarly publishing at David Martinsen Consulting in Rockville, MD, USA. He is co-chair of the CPCDS Subcommittee on Cheminformatics Data Standards, and is also a co-chair of the Research Data Alliance Chemistry Research Data Interest Group. ORCID.org/0000-0002-8667-5855

From the journal Chemistry International

https://doi.org/10.1515/ci-2017-0309

Abstract

One of the questions that scholarly publishers have faced as a result of the increasing interest in research data is the relationship of journal articles to the research data behind those articles. From the 1970s forward, journals in the physical sciences published research data in the form of supplemental materials. Due to economic considerations and the sheer volume of data, this supplemental material was generally distributed in microform. In the late 1990s, some publishers began experimenting with digital distribution of research data. By the early 2000s, the volume of research data being submitted was causing problems for editors and reviewers. This tension was captured well in an editorial by Emilie Marcus in Cell in 2009. [1]

The importance of research data has been considered for some time. An interesting article by Müller [2] envisioned a somewhat primitive connected world in the late 1950s:

“It is manifestly impossible to embark upon a program to print prodigious detailed tables of all the data known to physical science. It would exceed our supply of paper, few could afford it, and storage would be a major problem. It would seem eminently feasible, however, to punch program cards for tens of thousands of cards for as many empirical or fundamental equations and the data to which they apply. In a suitable computer center or agency these could be interrogated when necessary and the detailed data sent to a subscriber by teletype or more leisurely by mail. Individual dialing of long-distance ‘phone calls will be with us all very soon. It is not too great an extrapolation to imagine the time when one will be able to dial a call to a computer agency and, upon requesting a certain code designation, have an answer returned in a matter of minutes. There are no existing technical impediments to this scheme. The problem is one requiring vision, organization, and dedicated policy.”

Two comments are interesting to note. First, even today, if one were to store all the raw data being collected through scientific research, from laboratory instruments to space telescopes, the global storage capacity would be strained. [3] Second, in a later part of the paper, Müller bemoans the likelihood that telecommunications might be overwhelmed by entertainment and advertising, to the detriment of science. It seems he was prescient in both the benefits and pitfalls of a large, distributed network.

It is also interesting to note that in the chemistry domain, a long tradition of aggregation of published data has grown without the additional efforts of individual researchers. These include various spectroscopic databases, such as the NIST, Wiley, and BioRad collections of spectral data of various types, as well as structural information in the CCDC, ICSD, and PDB collections. [4] The latter system, with the support of the US government, has become a repository where researchers are expected to deposit structural data concurrent with article submission.

As the cost of storing and disseminating digital content dropped, the economic burden for storing and distributing research data decreased. Because of its importance in describing and understanding research, the publication of research data has been advanced as a mechanism to bring greater transparency to science, to improve reproducibility, to reduce duplication of research, and to better detect fraud. [5] In fact, some advocates have proposed the publication of research data as an end in itself. These proposals, that data be considered a first class object, [6] or that scientific dissemination in the form of journal articles be replaced by data that speak for themselves, is making many in the scholarly publishing world rethink the relationship of the data to the textual description of the research. By extension, this also affects the ways in which scientific reputations and careers are assessed (i.e., metrics).

Given all the positive rationales for researchers to publish and share their data, one might question why more scientists are not actively doing so. A number of reasons can be given. It should be said that, pre-Internet, supplementary material generally took the form of printed pages, even when the data themselves were collected digitally. Even though the transmission of digital data can be accomplished more easily, many researchers continue to provide their research data as word processing documents or PDF files. While these are adequate for human consumption, they are not necessarily useful for importing data into a software package for visualization or for reanalysis. Spreadsheets are also a common way for research data to be shared. These are easier to visualize and re-use than PDF files. However, at issue in all cases is the extra effort involved in providing enough descriptive metadata for readers or software processors to understand the experiments completely. It is simply easier to provide research data in the traditional formats. It takes time and effort to prepare data for publication. It takes more effort to prepare the raw data for publication, including metadata about the experiment, in order for the digital data to be usable.

Figure 1. The STM-ODE Data Publication Pyramid, showing the significant percentage of unpublished data. [9]

In addition, researchers feel that they have already shared their data when an article is published, with the data presented as tables or images in the article itself or in the supplemental material. This is one conclusion that can be inferred from a Wiley survey of 2250 authors, where 50 to 75 % of researchers claimed to be sharing their research data as supplemental material. [7] A more recent survey, from Leiden University and Elsevier, [8] confirms this conclusion. Researchers who do publish their data do so almost exclusively within the current journal publishing environment, and over a third of researchers do not publish any research data. This study makes a distinction between disciplines that are intensively data sharing, where sharing of datasets is necessary for analysis, and those disciplines where there is restricted data sharing. In chemistry, computational and cheminformatics studies may fall under the intensively data sharing scenarios, while the experimental disciplines, perhaps the bulk of chemistry research, fall under the restricted sharing scenario. The study presents some of the challenges that need to be overcome to enable better data publication and sharing in those restricted sharing environments.

The scholarly publishing industry has, arguably, been proactive in the research data discussion. The International Association of Scientific, Technical, & Medical Publishers (STM, http://www.stm-assoc.org/) has been involved in a growing number of research data initiatives. STM provided input to the Opportunities for Data Exchange Project, which resulted in a report on integrating data and publications. [9] A data publication pyramid, as shown in Figure 1, has been used to describe some of the issues associated with research data. At the top of the pyramid, the smallest two segments represent the bulk of the data that has been prepared for publication, either within the article or in the supplemental materials. The bulk of the data, though, at the bottom of the pyramid, is never published but rather contained in disks or file drawers. While the pyramid is instructive, it seems to me that an iceberg might be a better metaphor. In the iceberg, as shown in Figure 2, [10] it is much more apparent that the hidden data below the surface is not nice and neat. However, neither is the portion above the surface, the published part, very nice and neat.

Figure 2. Image of an iceberg, [10] to indicate that the published data above the surface as well as the majority below the surface, are in need of refinement.

Each year, technology leaders in the STM publishing industry gather to review technology issues of the past year, and identify trends that will be important over the next five years. At the most recent meeting, in December 2016, the group produced a graphic that envisioned the scholarly communication ecosystem as a pinball machine. Through a combination of skill and chance, researchers, funding agencies, academic institutions, and publishers navigate a series of obstacles and opportunities, trying to gain points to enhance their reputations. In this representation, shown in Figure 3, [11] the center of the image is that of trust and integrity. With the rise of fake news and the “debunking” of science, the scientific enterprise must pursue the truth with vigor and with transparency. The publication of research data is a significant part of that effort.

One reason given for publishing data is to improve the reproducibility of science. Started in 1955, perhaps as a joke, the Journal of Irreproducible Results [12] now finds itself in the vanguard of studies in reproducibility, including reproducibility guidelines, academic departments devoted to reproducibility studies, [13] and most recently a manifesto. [14] Recent studies have cast doubt on the reproducibility of science, leading to the perception that perhaps scientific research itself, or at the very least the peer review process, is flawed. [15] Reports in the popular press of recommendations that contradict earlier recommendations add to this perception. For example, the US National Institute of Allergy and Infectious Diseases recently recommended that the rise in peanut allergies may be due to lack of exposure to peanuts in young children. [16] Thus, rather than avoiding peanuts in the early years of life, exposing children to peanuts may actually be a better choice. There is certainly much that can be done to improve the quality of experiments and the quality of data collected. However, there are a couple of issues related to reproducibility that are relevant as we consider the realm of Big Data. First, there is a difference between correlation and causation. It is tempting to journalists hunting for the next attention grabbing headline to jump from correlation to causation, and it is tempting for researchers to do the same. It is important to ask, though, whether in any given study there were enough samples or a large enough range of samples studied to even propose a general correlation, and only then ask whether the correlation actually means causation. In the life sciences in particular, where it is very difficult to control all of the possible variables in an organism, this challenge is great.

Figure 3: The graphic summarizing the Tech Trends 2021 developed by the STM Future Lab Committee at its December 2016 meeting. [11]

However, even in the physical sciences, where it is perhaps easier to control the experimental conditions, the distinction between correlation and causation remains difficult. It is often the outliers that really lead to new understandings of science. For example, the announcement of the 2011 Nobel Prize in Chemistry, awarded to Dan Shechtman, notes that Schectman “had to fight a fierce battle against established science”. [17] Could a machine have made such a remarkable discovery? Big data analytics are designed to look for correlations, with humans providing the judgement that the correlation is related to causation. It will be interesting to see if the tools and techniques related to cognitive computing and artificial intelligence can eventually lead to confidence in a conclusion that a specific correlation is indicative of causation.

A second aspect of reproducibility is related to a fundamental randomness in nature at a granular level and, as we develop more sensitive instrumentation, the limits of detection of a particular method. In fact, one indication of manipulated data is whether a match is too good. There is an inherent irreproducibility in nature; that is one of the reasons that, when conducting experiments, one should always conduct more than a single instance and make several measurements to compare.

Questions of reproducibility should also be considered within the context of the idea that negative data should be published in addition to positive data. [18] This idea perhaps has different significance in the life sciences than in the physical sciences. In the life sciences, it is useful to know whether a particular drug has an adverse affect on a significant percentage of the population, or in fact has any medicinal affect at all (keeping in mind the difference between correlation and causation). In the physical sciences, though, one might imagine that there are no negative results in that same sense. There are only results. It is up to the researcher to interpret the results. If the results don’t match the preconceived notion of the researcher, or the research community, then it is up to the researcher to explain the results. In this way, unexpected results become novel positive results. It is said that publishing negative results will prevent much duplication and waste of research dollars. However, if the negative results of the past were not questioned, would we have had publications on arsenic-based life [19] (subsequently proved to be an invalid conclusion), or on compounds of helium? [20]

On the other hand, despite these cautions, the benefits of access to both primary research data and big data analytics will be great. We are moving beyond the use of technology to simply collect, store, process, and preserve data to using it to analyze and interpret data. These cognitive technologies are still in their early stages, and we can expect big failures along the way. The expectation that technology companies need only turn their attention to these problems to quickly solve what researchers have been struggling with for years and years is not a proven assumption, at least while we are in this transition period. Looking back, we may think that the change was remarkably rapid. But for now, the perspectives of Francis Collins and Craig Venter on the 10th anniversary of the human genome publication are still relevant: [21]

“But for all the intellectual ferment of the past decade, has human health truly benefited from the sequencing of the human genome? A startlingly honest response can be found on pages 674 and 676, where the leaders of the public and private efforts, Francis Collins and Craig Venter, both say ‘not much’.”

Über den Autor / die Autorin

David Martinsen

David Martinsen <dmartinsen@consultdpm.com>, formerly a Senior Scientist at ACS Publications, consults in scholarly publishing at David Martinsen Consulting in Rockville, MD, USA. He is co-chair of the CPCDS Subcommittee on Cheminformatics Data Standards, and is also a co-chair of the Research Data Alliance Chemistry Research Data Interest Group. ORCID.org/0000-0002-8667-5855

References

1. Emilie Marcus, Taming Supplemental Material, Cell 139(1):11, 2009. https://doi.org/10.1016/j.cell.2009.09.02110.1016/j.cell.2009.09.021Search in Google Scholar PubMed

2. Ralph H. Müller, Computer center for basic physical science data proposed, Anal. Chem. 30(8):55A, 1958. https://doi.org/10.1021/ac60140a75410.1021/ac60140a754Search in Google Scholar

3. Zachary D. Stephens, et al, Big Data: Astronomical or Genomical?, PLOS Biology, 2015, https://doi.org/10.1371/journal.pbio.100219510.1371/journal.pbio.1002195Search in Google Scholar PubMed PubMed Central

4. http://webbook.nist.gov/chemistry/, http://olabout.wiley.com/WileyCDA/Section/id-406117.html, www.bio-rad.com/en-us/spectroscopy, www.ccdc.cam.ac.uk/solutions/csd-system/components/csd/, https://icsd.fiz-karlsruhe.de, www.rcsb.org/pdb/home/home.do, Accessed 17 April 2017. Search in Google Scholar

5. Sean Bechhofer, et al, Research Objects: Towards Exchange and Reuse of Digital Knowledge, Nature Precedings, 2010. https://doi.org/10.1038/npre.2010.4626.1.10.1038/npre.2010.4626.1Search in Google Scholar

6. Phil E. Bourne, Tim Clark, Robert Dale, Anita de Waard, Ivan Herman, Eduard Hovy, and David Shotton, eds, FORCE11 Manifesto: Improving Future Research Communication and e-Scholarship, 2011, www.force11.org/about/manifesto, Accessed 17 April 2017.Search in Google Scholar

7. Wiley Data Sharing Survey: https://figshare.com/articles/Data_Sharing_Infographic/3555993Search in Google Scholar

8. Paul Wouters and Wouter Haak, Open Data: The Researcher Perspective, 2017. www.elsevier.com/__data/assets/pdf_file/0004/281920/Open-data-report.pdf, Accessed 14 April 2017. Search in Google Scholar

9. Susan Reilly, et al, Report on Integration of Data and Publications, 17 October 2011, www.libereurope.eu/wp-content/uploads/ODE-ReportOnIntegrationOfDataAndPublication.pdfSearch in Google Scholar

10. https://pixabay.com/en/iceberg-ice-arctic-i-iceberg-snow-1321692/, accessed 17 April 2017Search in Google Scholar

11. http://www.stm-assoc.org/standards-technology/tech-trends-2021/, accessed 7 May 2017 Search in Google Scholar

12. The Journal of Irreproducible Results, www.jir.com/.Search in Google Scholar

13. John Ioannidis, Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford 94304, California, USA.Search in Google Scholar

14. Marcus R. Munafò, et al, A manifesto for reproducible science, Nature Human Behaviour 1, Article number: 0021, 2017, https://doi.org/10.1038/s41562-016-002110.1038/s41562-016-0021Search in Google Scholar PubMed PubMed Central

15. John P. A. Ioannidis, Why Most Published Research Findings Are False, PLOS Medicine, Published 30 August 2005, https://doi.org/10.1371/journal.pmed.002012410.1371/journal.pmed.0020124Search in Google Scholar PubMed PubMed Central

16. Addendum guidelines for the prevention of peanut allergy in the United States: Report of the National Institute of Allergy and Infectious Diseases–sponsored expert panel, World Allergy Organization Journal 201710:1; https://doi.org/10.1186/s40413-016-0137-910.1186/s40413-016-0137-9Search in Google Scholar

17. A Remarkable Mosaic of Atoms, www.nobelprize.org/nobel_prizes/chemistry/laureates/2011/press.html, Accessed 14 April 2017.Search in Google Scholar

18. Natalie Matosin, et al, Negativity towards negative results: a discussion of the disconnect between scientific worth and scientific culture, Disease Models & Mechanisms 7:171-173, 2014, https://doi.org/10.1242/dmm.01512310.1242/dmm.015123Search in Google Scholar PubMed PubMed Central

19. Felisa Wolfe-Simon, et al, A Bacterium That Can Grow by Using Arsenic Instead of Phosphorus, Science 332(6034):1163-1166, https://doi.org/10.1126/science.119725810.1126/science.1197258Search in Google Scholar PubMed

20. Xiao Dong, et al, A stable compound of helium and sodium at high pressure, Nature Chemistry, 2017, https://doi.org/10.1038/nchem.271610.1038/nchem.2716Search in Google Scholar PubMed

21. Editorial: The Human Genome at Ten, Nature 464:649-650 https://doi.org/10.1038/464649a; Published online 31 March 2010Search in Google Scholar

Online erschienen: 2017-5-24

Erschienen im Druck: 2017-7-26

Primary Research Data and Scholarly Communication

Abstract

Über den Autor / die Autorin

References

Journal and Issue

Articles in the same Issue