If Data is Used in the Forest and No-one is Around to Hear it, Did it Happen? A Citation Count Investigation

In this article I describe the process and results of tracking a citation from a data repository through the article publication process and trying to add a citation event to one of our DOIs. I also discuss some other confusing aspects related to citation counts as indicated in various systems, including reference managers, the publisher’s perspective, aggregators


Introduction
In my role as the Data Workflows Specialist at the University of Michigan Library, I reviewed large datasets and code deposits and developed data management workflows for researchers. I also supported various aspects of our research data repository, Deep Blue Data, 1 which is based on the Samvera Hyrax platform. 2 In this capacity, I worked with our developers to improve connections between our system and others to gather various metrics including Altmetrics 3 and citation counts for our datasets.
Deep Blue Data is a general institutional data repository for open-access data that does not require users to log in before downloading a dataset. While this feature reduces barriers to accessing our data, it can make it challenging to understand how, where, or if Deep Blue Data datasets are reused. Because of this issue, when I began looking into citation counts, I started with the "original use" of datasets, i.e., researchers citing their own datasets, to verify if the citation event process was working. Directly tracking these citations offered a more approachable option than using COUNTER. 4 Nonetheless, I learned that the process of capturing data citation events was not as straightforward as one would expect.
This paper follows my quest to understand how citation events are captured. This includes the important aspects of the citation format itself and the role that reference managers play in maintaining that format, the publisher process, interaction between the publisher and other systems such as Crossref, 5 the latter's relationship with DataCite (who mints our dataset DOIs), 6 and DataCite's accumulation of these events.
One method of capturing and displaying citation counts, particularly at "point-of-need," is to use the DataCite Data Metrics Badge shown in Figure 1. 7 In early January 2021, we added Data Metrics Badge to our repository. A few weeks later, I checked some of our more popular datasets to see if they were showing any citations and discovered they were not. Indeed, none of our more than 400 datasets showed citations. I investigated further by spot-checking datasets and found that although there were articles related to these datasets, as indicated in our "Citations to Related Materials" field, researchers did not appear to be citing their datasets in the "References" section of their papers. In some cases, they mentioned their dataset, complete with DOI, in the abstract of the paper. In other cases, they only indicated the existence of the dataset in the "Data Availability" statement rather than citing the dataset in both the "Data Availability" section as well as the "References" section (Ball and Duke, 2015). I assumed these errant practices were the root cause of missing citations. After researching the issue, I determined they are only partially to blame. Figure 2 shows my Borda | 3 rough understanding at that point of the Data Citation Pipeline and the major players in my investigation.

Figure 2
The Data Citation Pipeline.

Problem Investigation
My investigation into the "0" citation count began in earnest when a researcher who I had been encouraging to cite his data in the "References" section informed me that he had just submitted a paper with the data cited in this manner. This paper provided me with a "known" example to track. The next step entailed waiting several months until the publication process was complete to see if the citation was recognized or not.
In case the "normal" citation count process was not working, I tried using "Contribute Citations" 8 to our DataCite DOIs as mentioned by Cousijn et al. in the section "Contributing data citations: data repositories" (2019) as well as by the "Make Data Count" movement. 9 I reasoned that it would at least indicate whether articles and datasets were linked somehow. I followed the instructions using the DataCite API with a JSON payload indicating that the DataCite dataset DOI "IsCitedBy" (Starr & Gastl, 2011) an article DOI. See Figure 3 for an example of a JSON payload. 10  The first problem appeared at this point. After submitting the JSON payload via the Data-Cite API, there were no citations listed in the "Citations" field in the API results, as shown in Figure 4. 11 In addition, there were no new citations displayed on the DataCite Search results page for DOI 10.7302/dbfp-s644 as shown in Figure 5. 12

Figure 4
Result of "IsCitedBy" payload in "references" not "citations." Through further investigating, testing, and help from the DataCite Support staff, I discovered a bug in DataCite's system: using "IsCitedBy" causes the new citation to end up in the "References" section of the DOI, not in the "Citations" section.
In attempt to get a citation to show up, I used different "payloads" and checked other repositories that use DataCite and indicate citation counts. When I noticed that Dryad's DOIs in the DataCite API view were using "Cites" instead of "IsCitedBy", I promptly modified my Borda | 5 "payload" accordingly and it worked. A single citation count was indicated in the "Citations" section. DataCite Support eventually created a DataCite GitHub issue for this bug. 13 The final version of the original article I had been tracking was published in early November 2021 with its reference section hyperlinked. 14 This allowed me to see how the dataset citation was formatted and whether it would be recognized as an actual citation of the dataset with the DataCite Data Metrics Badge in the Deep Blue repository. Much to my amazement and chagrin however, the dataset citation went from displaying the DOI in the references list, as shown in Figure 6, to completely ignoring the DOI altogether in the "Markup" view in Crossref API results, as shown in Figure 7 as number "7". It should be noted that this was how it looked in early November 2021.  To provide another interesting example of publishers not handling datasets as well as articles, the Google Scholar link (circled in Figure 5), resolves to the AGU/Wiley article itself in a circular manner, not to the dataset itself (see Figure 8). 15 Rather than pointing to Google Scholar, it seems that AGU/Wiley could point to DataCite Commons, 16 (they have links to Crossref for other citations) or better yet Google Dataset Search, (Figure 9). 17 It should be noted that best practices were not followed in the naming of this dataset, where "Dataset for <article title>" should have been used. In discussion with an AGU representative, I learned that Google Dataset Search was not currently one of the options for linking in the references.   Armed with information about how dataset citations appeared to be processed -at least as compared to article citations at Wiley -I reached out to Shelley Stall, Senior Director of Data Leadership at AGU, to inquire about this second issue since its markup was very different from how article citations were handled. It turned out that AGU and several other publishers were working a on fix for this very problem and planned to implement it later in the same year. Figure 10 shows the citation after the initial fix (Stall et al., 2022) in Crossref. 18 Unfortunately, the AGU/Wiley initial fix does not help DataCite to see this citation mention in the references as a citation to the dataset in DataCite Commons frontend (Figure 11). 19

Figure 10
The dataset citation markup after the initial fix. 18   When the journal production process validates these citations, the machine actionable version of the citation is -in present practice -not always handled optimally, and the integrity of the machine-actionable citation may be damaged. In such cases, the citation will not be able to give automated attribution and credit to the software or dataset authors, nor will it support links to the other persistent identifiers associated with the article.' (Stall et al., 2022)

Current state of Issues
Since my initial investigations in 2021, I am pleased to report there have been fixes and improvements made. 20 Wiley's Data Citation Policy: https://authorservices.wiley.com/author-resources/Journal-Authors/ open-access/data-sharing-citation/data-citation-policy.html 21 Crossref Event Data: https://www.crossref.org/services/event-data/

IsCitedBy
Now that DataCite fixed the "IsCitedBy" bug, we can see the results of pushing the "IsCitedBy" payload as described in Figure 1. In the DataCite API view, 22 "IsCitedBy" shows as the "relationType" (see Figure 12) in addition to appearing correctly in the "citations" section (see Figure 13).   The Data Metrics Badge in Deep Blue Data also shows "1" (Figure 15).

Publisher Fix
The publisher/metadata fix was released in late May of 2022 and in July 2022 the fix was processed for the DOIs I have been tracking, as shown in the Crossref JSON of Figure 16. 24
With the DOI split out and the citation type indicated, Crossref can update Event Data. Event Data would then be read by DataCite and the reference should show in the "referenceType" section.

Other issues of note
In addition to the issue with publishers that I have described, there are issues with data citations in other parts of the pipeline. One problem is in how repositories sharing the data are making the suggested citation available. Another is that reference managers do not make it easy to create or manage a data citation through to the other end of the pipeline with data aggregators, nor do they make it clear whether they are sharing data citation metadata correctly.
The Deep Blue Data repository provides a suggested citation via plain text in the metadata section of the deposit (see Figure 16). The suggested citation must be highlighted and selected with a pointing device and copied to the clipboard to be used. Deep Blue Data does not currently have integrations with reference managers, nor does it make the citation available in other formats such as Biblatex. 25 From the Wiley's standpoint, having [Data set] or [Dataset] in the citation is critical for their process as it is or should be an immediate indicator that this is indeed a dataset and should be handled as such (Federer, 2020). Reference Managers, how are they handling data citations?
As indicated previously, the citation component of [Data set] or [Dataset] is key to downstream processing. Unfortunately, not all reference managers handle this equally. With Zotero, for instance, the results can differ depending on how you use the tool. Per my exchange with Zotero support, the "Browser plug-in" option pulls metadata information from COinS (ContextObjects in Spans) first and then from "meta tags." 26 Alternatively, using the Zotero desktop application and providing a DOI via the "Add Item(s) by Identifier" option will cause the metadata to be pulled from DataCite. Figure 17 is an example that shows the results of entering 10.7302/ZCK4-0058. In this case, the dataset is imported as "Item Type: Document" because Zotero does not currently have a "Data" Item Type. To address this issue, Zotero has implemented a temporary measure in the form of a field called "Extra" that indicates "Type: dataset" and "DOI: 10.7302/ZCK4-0058."

Figure 18
Screen shot from Zotero desktop "Add item by Identifier." Using the Zotero plug-in for Microsoft Word to "Add/Edit Citation" results in a citation that includes the [Dataset] designation. For an example, see (Arbic & Schindelegger, 2021) in the "References" section of this paper. Because the metadata in DataCite differs slightly from the suggested citation in Deep Blue Data, the entry in the References section is not the same as the suggested citation which is also less than ideal (see Figure 16).
Mendeley does not handle datasets well at all. It has no option to add a dataset by entering a DOI nor does the system does not search DataCite. Mendeley does not have a "Dataset" Document Type either-a fact confirmed by Mendeley Support in this response via email: "You can add the dataset into your library first then cite it using the citation plugin. Since there's no Dataset document type, you can try the Generic document type." Biblatex has had an option for datasets using @dataset since at least 2019. 27 It should be noted that the researcher who submitted this dataset, Dr. Brian Arbic, uses Overleaf, 28 a LaTeX editor supporting Biblatex, when preparing his papers. This way he avoided using Zotero or Mendeley and their issues with "type."

Borda | 11
In my cursory examination of citation styles as options of Zotero export, only APA supports the use of [Dataset] which is pulled automatically from the "Extra" field. Using another style requires including [Dataset] to ensure it won't be missed downstream.
The concern with some of these manual copy/paste data entry issues and data not being handled natively as a "type" is that the designation [Data set] or [Dataset] might be lost along with the DOI.

Data Availability Statements
Although using "Data Availability" statements might suggest a possible solution, it does not solve the problem. Crossref does not currently check the "Data Availability" statement for dataset DOIs and my investigation indicates that information is not currently passed from the publisher to Crossref (at least not in the case of AGU/Wiley).
For evidence of this issue, I tracked a dataset DOI, https://doi.org/10.7302/fn7r-hq31 mentioned in a "Data Availability" statement but not in the "References" section in an article in Nature ( Figure 18). 29
Unfortunately, there is no indication that data availability statements in the metadata are shared with Crossref. 30

Dataset Aggregators
Dataset aggregators are on the other end of the dataset publishing pipeline and have their own problems because they are highly dependent on other systems earlier in the pipeline. Interestingly but not surprisingly, the Google Dataset Search appears to be using their massive computing power to mine Google Scholar for any mentions of dataset DOIs. 31 For example, one of our more popular datasets from the perspective of downloads and hits is the RIGA dataset. 32 According to Google Dataset Search, it has 50+ citations. The article that is listed as a "related resource" in our repository only mentions the dataset DOI in the "Abstract" part of the paper.
Clarivate -Data Citation Index 33 seems to be somewhat unreliable for citation counts. For instance, when using "Deep Blue Data" as a search term in "Data Source," it returns 4,000+ entries (see Figure 19). This is far more than we have in our system (400+ this discrepancy, I discovered most of the so-called "datasets" were actually articles in another repository, "Deep Blue Documents." In addition, there are some actual datasets that are showing "1" citation and, for at least some of these, that citation was in fact listed in the "References" section of the related article. Unfortunately, in this case the citation is not showing in DataCite and therefore is not actually showing in our system as a citation count. It is not clear how the Data Citation Index is getting these citation counts, there are vague mentions of OIA-PMH to fill in metadata and references for citations in their white paper. 34

Conclusion
My exploration was extremely small in scope. It should be noted that the purity of my process and results were somewhat compromised in the middle of this investigation because the Deep Blue Data repository was a participant in the RADS project. 35 As a result, most datasets that were published before March 31, 2022 had their DOI metadata "relationType" metadata fields including "IsReferencedBy" updated via script. Also, as of Fall 2022, Deep Blue Data has yet to implement DataCite's Data Metrics badge, so it will not be visible if visiting datasets from this paper.
Nevertheless, the number of issues I encountered even in the small sample used in this study suggest these problems may discourage researchers from citing their data or, at least, citing it properly. My research also reveals that those of us in the data repository field cannot assume these citation and reference processes are working effectively.

Borda | 13
We need to test, check, and provide feedback to publishers, DOI organizations, and other systems in the citation pipeline if we want these systems to perform effectively. I have included an updated and slightly more complete Data Citation Pipeline to include additional data sources and metadata forms, including Reference Managers, JATS XML 36 and Aggregators -see Figure 21.

Figure 21 Updated Data Citation Pipeline
In addition, pushing aggregation databases such as Data Citation Index to be more transparent about their sources for dataset citation counts would help provide a more accurate picture of dataset reuse.
It would be great to see updated recommendations of steps intended "to increase data citation and develop metrics" e.g., as detailed in Federer's 2020 paper, "Measuring and Mapping Data Reuse: Findings from an Interactive Workshop on Data Citation and Metrics for Data Reuse." These could go a long way toward resolving these issues and improving the data citation process.