Identifying and removing duplicate records from systematic review searches

OBJECTIVE
The purpose of this study was to compare effectiveness of different options for de-duplicating records retrieved from systematic review searches.


METHODS
Using the records from a published systematic review, five de-duplication options were compared. The time taken to de-duplicate in each option and the number of false positives (were deleted but should not have been) and false negatives (should have been deleted but were not) were recorded.


RESULTS
The time for each option varied. The number of positive and false duplicates returned from each option also varied greatly.


CONCLUSION
The authors recommend different de-duplication options based on the skill level of the searcher and the purpose of de-duplication efforts.


INTRODUCTION
Systematic reviews continue to gain prevalence in health care primarily because they summarize and appraise vast amounts of evidence for busy health care providers [1,2]. Because they are used as the foundation for clinical and policy-related decisionmaking processes, it is critical to ensure that the methods used in systematic reviews are explicit and valid. The Cochrane Collaboration, for example, places a heavy emphasis on minimizing bias with a thorough, objective, and reproducible multidatabase search [2], which has become the standard in systematic review processes [3]. Searching multiple databases, however, results in the retrieval of numerous duplicate citations. Also, due to the nature of the publishing cycle in the field of medicine, conference abstracts and full-text articles reporting the same information are often retrieved concurrently. In addition, although many have called out against such practice, some authors ''slice, reformat, or reproduce material from a study'' [4], which creates repetitive, duplicate, and redundant publications. As Kassirer and Angell argued, ''multiple reports of the same observations can over emphasize the importance of the findings, overburden busy reviewers, fill the medical literature with inconsequential material, and distort the academic reward system'' [5]. Removing these duplicate citations, also known as de-duplication, can be a time-consuming process but is necessary to ensure a valid and reliable pool of studies for inclusion in a systematic review.
The aim of this study was to explore and compare the effectiveness of various de-duplication features. Specifically, the authors examined and compared two categories of de-duplication strategies: deduplicating in the Ovid and EBSCO database platforms and de-duplicating in three selected reference management software packages: RefWorks, EndNote, and Mendeley.

METHODS
Five de-duplication options were examined in this study: To create the citation samples used for this study, we reran the search strategies that were developed for a systematic review on ward closure as an infection control practice in Ovid MEDLINE, Ovid Embase, and CINAHL from the database inception to September 11, 2014 (Appendix, online only) [6].
For the Ovid multifile option (option 1), which allows de-duplication across various Ovid products, we opened up MEDLINE and Embase in the Ovid platform and ran a search using the strategies that were designed for the aforementioned systematic review. We ran the ''use'' command and database codes for MEDLINE and Embase, which are ''pmoz'' and ''oemezd,'' respectively, to ensure that the retrieved results were filtered appropriately (Appendix, online only). Then, we used the ''remove duplicates'' command for de-duplication.
For the EBSCO CINAHL option (option 2), we ran a search in CINAHL and limited the search results to non-MEDLINE citations. The results from the searches in Ovid and EBSCO were collated and recorded in two spreadsheets: the first one contained Ovid results only, and the second one contained both Ovid and EBSCO results.
For the other three options (RefWorks, Endnote, and Mendeley), we retrieved all citations from the systematic review and exported them to each deduplication option. In RefWorks, we clicked on the ''Exact Duplicates'' and ''Close Duplicates'' buttons in the ''View'' tab and deleted all identified citations. In EndNote, we clicked on the ''Find Duplicates'' button under the ''References'' menu. We deleted everything in the EndNote library duplicate references group. We loaded references as a Research Information Systems (RIS) file into Mendeley, where they were automatically de-duplicated. ''Check duplicates'' from the tools menu was then run to check for close duplicates, all of which were merged. All sets of citations were downloaded and recorded on separate spreadsheets.
To investigate these five de-duplication options, we needed a sample set of citations and a ''gold standard'' file of de-duplicated references to compare against each option. To create the sample set of citations for this study, we reran search strategies that were developed for a systematic review on ward closure as an infection control practice in Ovid MEDLINE, Ovid Embase, and CINAHL from the database inception to September 11, 2014 [6]. All of these search strategies are provided in the online appendix.
To develop the gold standard sets, we screened and de-duplicated the citations by hand, which were recorded on a Microsoft Excel spreadsheet. The detailed steps that we took to identify the duplicates in Excel are listed in the online appendix. To be considered duplicates, two or more citations had to share the same author, title, publication date, volume, issue, and start page information. The fulltext versions of the citations were consulted when we were in doubt. In such cases, we also checked the population sizes, methodology, and outcomes to determine whether the citations were duplicates. Conference abstracts were deemed to be duplicates if full-text articles that shared the same study design, sample size, and conclusion were retrieved, even if their publication dates varied. Older versions of systematic reviews were deleted when there was a link between them and newer versions. All citations that were classified as duplicates were deleted from the spreadsheet. Ultimately, 2 gold standard sets were developed: one for just Ovid MEDLINE and Ovid Embase (1,087 citations) and the other for Ovid MEDLINE, Ovid Embase, and CINAHL (1,262 citations). The first gold standard set was developed for comparison against the results from the Ovid multifile search alone (option 1). The second gold standard set was developed for comparison against the other 4 options (options 2-5).
All sets of results from the de-duplication strategies outlined above were compared against the gold standard sets to identify false negatives (duplicate citations that should have been deleted but were not) and false positives (duplicate citations that were deleted but should not have been). We also recorded the time it took to de-duplicate results in each option (Table 1, online only). We took into consideration the results of this comparison and the time it took to de-duplicate with each option when determining the most effective strategy for deduplication when searching the selected databases and using the selected reference management software.

RESULTS
The time spent on each de-duplication option varied (Table 1, online only). Including the time spent on reaching consensus, developing the gold standard samples of non-duplicate results took four hours and forty-five minutes. Carrying out Ovid multifile and CINAHL searches took less than three minutes to retrieve the results. Likewise, the Ovid multifile and CINAHL non-MEDLINE searches each took under three minutes. RefWorks took approximately ten minutes to delete exact and close duplicates. EndNote took three minutes to load and delete duplicates. Mendeley took five minutes. The majority of this time was spent merging the close duplicates.
The number of positive and false duplicates returned from each de-duplication option varied greatly ( Table 2). The Ovid multifile search alone resulted in 1,178 citations. The comparison to the gold standard for Ovid MEDLINE and Ovid Embase revealed that simply de-duplicating in Ovid resulted in 91 false negatives but no false positives.

DISCUSSION
Our primary research question was to compare the effectiveness of various de-duplication options. We were particularly interested in verifying whether using the various de-duplication options resulted in false positives (duplicates that should not have been deleted). Similar to Jiang et al., we believe false positives are more detrimental than false negatives because systematic reviewers want to maintain the highest possible recall in retrieval [7]. As running the Ovid multifile search command alone did not result in any false positives, we recommend using this option to further refine the search results before exporting to a citation manager. The limitation of this approach is that it only works if users subscribe to both MEDLINE and Embase through Ovid. PubMed users are not able to use this method.
Running the non-MEDLINE command in CINAHL, on the other hand, was the least effective method of de-duplication as it resulted in forty false positives, which was the highest number amongst all

Table 2
Number of de-duplicated citations and breakdown of the options. We found that using the non-MEDLINE option in CINAHL reduced the benefit of searching multiple databases. Multi-database searching is necessary because different articles are indexed differently in different databases, so there may be articles retrieved from CINAHL that are indexed in MEDLINE but are not retrieved by the MEDLINE search. The danger of the non-MEDLINE command is that it deletes these records, reducing some of the benefit of the multi-database search.
Beyond the desire to minimize false positives, there is as yet no definitive consensus regarding how best to find and delete duplicates, although the prevalence and potential impact of duplicates remains a critical issue for those undertaking systematic reviews [8]. In 2014, Bramer et al. published a study testing the efficacy of deduplicating with various reference managers, such as RefWorks, EndNote, Mendeley, and more [9]. According to the authors, de-duplicating exact citations in RefWorks performed the worst and deduplicating with their proposed algorithm, named the Bramer method, yielded the best results in terms of accuracy and speed [9]. Because Bramer et al. did not distinguish the differences between false negatives and false positives, we were unable to directly compare their results to the results of our study.
A 2013 study by Qi et al. revealed that relying solely on the auto-searching feature of reference management software, such as EndNote, is inadequate when identifying duplicates for a systematic review [8]. the automatic de-duplicating option in EndNote is inadequate and must be supplemented by handsearching, but the results also reveal that using this option may lead to losses of articles that should not be deleted.
These data suggest that researchers will have to individually determine their own thresholds of acceptability for false positives. If none are acceptable, none of the citation management deduplication options can be used. If the researcher is confident that all key articles would be found by hand-searching and deems a relatively low percentage of false negatives and positives to be acceptable (Table 3, online only), we recommend Mendeley as the most effective tool. Effort should be made to individually investigate all of the close duplicates in RefWorks and Mendeley to check for false positives. In addition, the results from any deduplication technique should always be manually reviewed to check for remaining duplicates. Using formulas in Excel, such as highlighting duplicates, can be a useful tool to speed up this process.
Even with these preliminary recommendations, we must emphasize that de-duplication of results is complex. Examples of some technical issues causing difficulties in identifying duplicates automatically while creating the gold standard datasets are: n differences in journal names (e.g., ''and'' instead of ''&'') n punctuation (e.g., some titles are exported with a period at the end, others are not) n translation differences of non-English article titles n author information or order of author names These issues are often the result of unintentional human error that occurs during the processing of individual records, and eliminating them proves challenging. Nevertheless, as commercial service providers, database administrators need to be more vigilant. Elmagarmid et al.'s article provides an extensive list of duplicate detection algorithms and metrics that can be used to clean up databases [11].

Limitations
This study does have limitations. Only two ''gold standards'' were used, and results may vary with other search topics. We were not able to explore the de-duplication options of other reference management software such as Zotero and Reference Manager. Future research may involve expanding the selection of reference management software.
There are many other directions that future research on this topic could take as well. For example, researchers could investigate the effectiveness of combining de-duplication codes (e.g., ..dedup) and options that can be used in bibliographic databases and refining those results with de-duplication features that various reference management software packages offer. Researchers could also test non-MEDLINE and non-MEDLINE journals commands in Embase to determine if these codes are more effective than the non-MEDLINE command in CINAHL. To foster further advancement in this field, more participation in research by librarians and information specialists is encouraged.