On the role of research data centres in the management of publication-related research data

Standard-Nutzungsbedingungen: Dieses Dokument darf zu eigenen wissenschaftlichen Zwecken und zum Privatgebrauch gespeichert und kopiert werden. Sie dürfen dieses Dokument nicht für öffentliche oder kommerzielle Zwecke vervielfältigen, öffentlich ausstellen, aufführen, vertreiben oder anderweitig nutzen. Sofern für das Dokument eine OpenContent-Lizenz verwendet wurde, so gelten abweichend von diesen Nutzungsbedingungen die in der Lizenz gewährten Nutzungsrechte. Terms of use: This document may be saved and copied for your personal and scholarly purposes. You are not to copy it for public or commercial purposes, to exhibit the document in public, to perform, distribute or otherwise use the document in public. If the document is made available under a Creative Commons Licence you may exercise further usage rights as specified in the licence.


Summary
This paper summarizes the findings of an analysis among scientific infrastructure service providers. These service providers have been evaluated in regard to their potential services for the management of publication-related research data. By conducting a desk research and an online survey, we found out that almost three quarters of all responding research data centres, archives and libraries generally store externally generated research data -what also applies to publication-related data. Almost 75% of all respondents also store and host the code of computation (the syntax of statistical analyses). If self-written software components have been used to generate research outputs, only 40% of all respondents accept these software components for storing and hosting. Eight in ten institutions also stated that they are taking specific actions for digital long-term preservation of their data. In regard to the documentation of stored and hosted research data almost 70% of all respondents claimed to use the metadata schema of the Data Documentation Initiative (DDI); Dublin Core was used by 30 percent (multiple answers were permitted). Almost two thirds also used persistent identifiers to facilitate citation of these datasets. Three in four respondents also stated to support researchers in creating metadata for their data. Application programming interfaces (APIs) for uploading or searching datasets currently have not been implemented by any of the respondents yet. Little widespread is the use of semantic technologies like RDF.

Background
In economics more and more publications in scientific journals are empirical research papers, in which the authors evaluated either self-produced or externally available datasets along their own research interests.
Compared to other branches of empirical research the compilation of own datasets is not common in economics. A major exception is the field of experimental economics, where researchers often generate their own datasets in the course of investigations motivated by game theory. But these datasets are typically not documented appropriately or even archived for re-examination. Instead, empirical economists frequently use data received from official statistics or from surveys by specialised research bodies (e.g. from the ALLBUS of GESIS 1 or from the SOEP at DIW Berlin 2 ). In addition, relevant data may often also be bought from companies like Thomson Reuters or Bloomberg.
Although a rising number of publications in economics (as in most of other scientific disciplines) is based on the analysis of datasets, there are currently few effective means to effectively replicate or re-examine the results of an empirical article, to verify it, or to make it available for re-utilisation and for the support of scholarly debates.
Even research data, that -in principle-is publicly available, will typically not be archived (e.g., in a final working-file) with respect to the specific selection and adjustment procedures. Thereby, replications will not necessarily be prevented, but they are extremely difficult in the cases of ambitious analysis based on specific data selections and calculations.
This current situation confronts both the scientific community and scientific infrastructure service providers like libraries and research data centres with multiple challenges.
On the role of research data centres in the management of publication-related research data 4 | 15

Why is economic research often not replicable?
According to the literature the reasons for missing replicability of economic research may be located in different areas: • First and most important is, that there is a lack of incentives for researchers to share their data with the community. The academic reward system does not honour the often time-consuming efforts of data sharing -in sharp contrast to publications, although "[a]n applied economics article is only the advertising for the data and code that produced the published results" (Anderson, Greene, McCullough and Vinod (2008), 101).
• Furthermore, economists may worry, that data sharing could lead to personal disadvantages. Because researchers who work up and share data with the community do not receive appropriate compensation, e. g. reputation, for their efforts and might even suffer from disadvantages in terms of academic career because data sharing takes time which cannot be spend on own research. In addition, many researchers suspect others to "misuse" their data, for example by faulty interpretations or by using a dataset without due reference to the creator of the dataset. Eventually, the legal status of research data with regard to data sharing is not sufficiently clear, which also leads to reservations in data sharing (Siegert, Toepfer and Vlaeminck, 2012). 3 • Only few economic journals have currently implemented guidelines pledging their authors to provide the data and code of computation of their statistical analysis. So called "data availability policies" may in some instances oblige the authors of empirical research papers to supply the underlying data of their results and the code/syntax of their analysis along with the manuscript of the article. Those policies often are in line with the "replication standard" formulated by Gary King (1995).
• Useful infrastructure components for the management of publicationrelated research data are rarely being applied, which in turn prohibits any uniform way of citing the underlying data. Available technical solutions like Dataverse 4 , a powerful tool for managing and documenting publication-related research data, are being adopted by few journals only. In this context a critical point is, how professional research data centres are handling research-related data and what kind of services, if any, they are offering.
Since autumn 2011 these issues are systematically being addressed by the project EDaWaX (European Data Watch Extended -www.edawax.de) funded by the German Research Foundation (DFG) (cf. Vlaeminck, Wagner, Wagner, Harhoff and Siegert, 2013). Some of the first findings are summarized in other publications from the project: One article describes the data sharing behaviour of applied economists (Andreoli-Versbach and Mueller-Langer, 2013), other publications deal with an analysis of data management in economics journals (cf. Vlaeminck, 2013).
The supplementary working paper at hand describes the results of an evaluation of scientific infrastructure service providers with regard to potential services for the management of publication-related research data in the social sciences and economics.

Do research data centres offer services for archiving publication-related research data?
Especially research data centres could actually be ideal institutions for managing publication-related research data published as attachments to articles within scholarly journals. These capacities originate from decades of expertise in the handling of social-and economic research data, from core-competencies in the creation and maintenance of metadata collected and tagged from surveys and, last but not least, from experiences in managing access to these data (cf. Research Information Network, 2011).
Therefore, the project EDaWaX conducted a study evaluating if such services for publication-related research data are currently available from scientific infrastructure service providers like research data centres, libraries and archives. For this purpose a list of 46 scientific infrastructure organisations was prepared. It includes all German research data centres and data service centres accredited by the German Data Forum (RatSWD) 5 , research data centres organised within the Council of European Social Science Data Archives (CESSDA) 6 , the library networks in Germany as well as single libraries and public archives.
On the role of research data centres in the management of publication-related research data 6 | 15 In a first step, the websites of these organisations have been examined with regard to potential services for storing and hosting publication-related research data.
The results of the inquiry showed that a publication-related archive 7 is existing at the ICPRS (Inter-university Consortium for Political and Social Research -University of Michigan), which is already used by numerous authors to deposit their publication-related data. 8 DANS EASY 9 -a service located in the Netherlands -does not offer a specific service for publication-related data, but can in principle also be used to deposit such data. 10 However, the desk research could not uncover other indications for further analysis, which is why more information had to be raised by an online questionnaire in order to start a more detailed evaluation of potential services by these organisations.

The online-survey
In October and November 2012 an online-questionnaire was sent to 46 organisations -among them 36 national and international research data and data service centres, 1 archive, 7 library networks and single libraries and three other organisations (non-European research data centres). 22 organisations responded to our survey (48%). This return rate may be considered as quite satisfactory, especially when compared to average return rates of written surveys.
Due to the structure of the questionnaire not all participating organisations responded to all questions, which explain deviations in the number of responses.
Certainly more important than the return rate is the structure of respondents and non-respondents. The big majority of all respondents came from research data centres in Germany and Europe (86.4%). Significantly under-represented were respondents from German library networks and archives, but also three research data centres from non-European areas did not respond.
7 Meanwhile ICPSR's publication related archive has changed its name in "replication datasets". We can only presume that the library networks and the archive do not offer any relevant service for research data management, and therefore did not respond to our survey.

Empirical findings
Initially, the survey asked, whether institutions would, in principle, host and store publication-related research data. 10 In addition, the survey also asked, whether organisations would also host and store (self-compiled) software components and the code of computation/syntax of statistical analyses. These three types of data often are part of empirical submissions to economic journals. 11

Datasets
More than three-fourths of all organisations examined are generally accepting external datasets for storage. At the same time the lion's share of respondents reported, that research data would only be accepted, if certain criteria were met. Such criteria are subject to the specific competencies of many research data centres, but also to the specific regional/supra-regional or national competencies. Moreover, technical and organisational aspects (e.g. proper documentation, machine-readability…) and legal problems were cited as criteria. Approximately 74% of the respondents indicated, that their organisations would also host these types of data. If any criteria for hosting were mentioned, the subject-specific orientation of an institution was stated as main criterion.
On the role of research data centres in the management of publication-related research data 8 | 15

Software
With regard to storing and hosting of (self-compiled) software components, which are often used for economic simulations, our survey indicates that only a minority of just under a fourth of the organisations accepts storing and hosting software components without restrictions. Another 17% pointed out that they established criteria for assessing, if software could be stored and hosted (e.g., if essential for the analysis of the data).
Therefore, hosting and storing software components can be considered as a gap. Only a limited number of organisations are offering this service.

Code of computation
Almost 70% of the organisations examined offer options to store and host the code of computation. However, a quarter of all organisations is not considering to do so now or in the near future. One respondent also stated a criterion -he mentioned that storing and hosting of these data would only be useful in the case of derived variables.

APIs
Within our analyses we also examined the availability of application programming interfaces (APIs), which enable automated exchanges of data.Our results show that less than half of all organisations are having these interfaces at their disposal.
On the role of research data centres in the management of publication-related research data 10 | 15 Most frequently APIs were mentioned as a device for data search (47%), followed by APIs used for uploading research data. Slightly more than a third (35%) of all respondents declared to have an API at their disposal to analyse research data.
Further analysis by EDaWaX showed, however, that the reported interface usage consists of searching and uploading interfaces on the respondents' websites only. We were not able to find an API. Presumably, APIs in terms of external reading and writing accesses are by and large unknown among our respondents and not available so far.

Metadata schema and the creation of metadata Employed metadata schemata
We were also interested in the metadata schemata currently used by the organisations in their daily work. Our survey shows that more than 70% of the respondents are using DDI. Other like XML or Dublin Core are being used quite rarely (35% and 29%). All other metadata schemata were used rather sporadically.

Persistent Identifiers (PI)
In addition, we asked, whether organisations are assigning persistent identifiers (e.g. handle, DOI, URN, etc…) to datasets and other materials. The persistent identification of research data is an important issue, for instance because it enables researchers to cite datasets. Organisations in our sample are assigning such identifiers in more than 56% per default, but almost a third is not.

Support of Semantic Web Technologies
In our survey we also examined the implementation of RDF (Resource Description Framework). RDF is a general method for conceptual description or modelling of information implemented in web resources. Among the organisations answering this question only a minority of 6% stated to use and disseminate RDF-files. Almost a quarter of all respondents was not able to specify, whether their organisation is using RDF, which presumably indicates that RDF is largely unknown.
On the role of research data centres in the management of publication-related research data 12 | 15

Support for creating metadata
Again and again, a critical issue regarding the reuse of research data is the quality of data documentation. Therefore, a matter of particular interest was to find out, whether and if how respondents support researchers in generating metadata.
Our survey shows, that the majority (almost 65%) of all organisations does so.
Furthermore, we were keen to know, whether this support is software-basede.g., if there is a web frontend where researchers may type in the required information that is converted into a standardised metadata schema. We found more than 35% of the respondents to have such a software-based support for researchers in operation.There is a striking number of statements in the section other. Part of other support for researchers, for instance, consists of written data deposit forms. Our question regarding the software's names revealed that at least two institutions are using Nesstar. 12 Many organisations are also applying in-house developments.

Digital long-term preservation
In our survey we eventually wanted to identify to which extend the respondents' institutions have implemented specific measures for long-term preservation of research data. Our survey indicates, that more than 80% of all organisations have adopted procedures in this direction.
On the role of research data centres in the management of publication-related research data 14 | 15

Conclusion
Our results show, that research data centres might be relevant places for hosting and storing publication-related research data, because they are already fulfilling many pre-requirements. Nevertheless, among the responding organisations there seems to be no institution, which is currently complying with all requirements with regard to storing and hosting publication-related research data.
In detail the outcome of our survey is: • Almost three-fourths of all organisations in our sample is generally accepting external datasets -including publication-related research data. However, partial limitations exist -for instance, because of regional or subject-specific competence or because of the dataset's quality.
• Almost the same percentage (75%) of our sample is principally accepting the code of computation for storing and hosting. If (selfcompiled) software is used for obtaining empirical results within an empirical research paper, only a minority of 40% will accept these data for storing and hosting.
• DDI is the most common metadata schema currently in use among our respondents (70%). XML and Dublin Core are following with shares of 35% and 30% respectively (multiple answers were permitted). Almost two thirds are using persistent identifiers for their datasets and, thereby, are facilitating citations of the data. Approximately three-fourths of all organisations support researchers in generating metadata for datasets though.
• Interfaces (APIs) for searching, analysing or uploading datasets and other materials currently do not seem to be available yet. Also the use of RDF is little popular among the responding organisations.
• Digital long-term preservation is wide-spread among our respondents. More than 80% reported that their institution takes measures for ensuring the long-term availability of their digital holdings.