The persistence of error: a study of retracted articles on the Internet and in personal libraries.

Objective: 
To determine the accessibility of retracted articles residing on non-publisher websites and in personal libraries.


INTRODUCTION
A retraction notice is issued to alert readers when a published study is no longer scientifically valid or trustworthy. Unfortunately, reaching readers long after the original article was published remains a chronic problem. There is little evidence that retraction notices make much difference to the citation behavior of authors.
Retracted articles continue to be cited as valid studies for years after retraction notices have been issued [1][2][3]. While there is evidence that articles receive fewer citations after retraction compared to a control group [4,5] or in high-profile cases exposing extensive research fraud [6], highly cited articles continue to be frequently cited after retraction [2]. Communicating significant errors to readers through an article correction may be even more challenging. In cases where articles were corrected and republished, citation rates to the corrected version only begin to surpass those to the flawed original version eight to twelve years after republication [7]. If science is a self-correcting process [8,9], the persistence of error in the literature reflects a chronically slow and inefficient process.
There are many plausible explanations for why retraction messages fail to reach their intended audience: & Journal publishers are inconsistent in how they alert readers. In spite of established best practice guidelines [10], journals are not consistent in how they report retractions [11,12]. And while some publishers stamp retraction watermarks on their electronic articles, others append retraction statements as notes, list retraction statements solely on the journal website, or fail to alert readers in any form [13]. & Readers depend on many access conduits to the published literature. In the absence of publisher access, readers depend on many informal access routes to the literature, including direct author requests, reliance on peers, and informal sharing networks, among others [14,15]. An early study of the prevalence of published articles on public non-publisher websites found a highdegree of access, especially to recent articles published in high-profile journals [16]. Lacking control over access points may make it more difficult for published updates to reach their intended audience.  N Automated methods to notify readers when a paper is no longer correct, valid, or trustworthy may help to reduce the persistence of error in the scientific record and improve public trust in the veracity of scientific documents. self-archiving [17], and while few publishers permit self-archiving of the final published version of record, many authors either do not understand these limitations or disregard them [18,19]. Public deposit of author manuscripts is required by some granting agencies as condition for funding (the National Institutes of Health, for example), and an increasing number of universities have passed institutional selfarchiving mandates for their faculty [20]. The culmination of these overlapping policies has resulted in a plurality of coexisting article versions. & Scientists depend on personal libraries. Researchers routinely save copies of relevant articles-electronically or in print-and depend upon bibliographic software-such as EndNote, Mendeley, RefWorks, and Zotero, among others-to aid in the authoring and citation process [21,22]. At present, none of these tools function to alert the user when a correction or retraction has been made. As readers pay little attention to published errata [23], some have argued that scientists should routinely search for updates to the references they cite in their papers [24]. Others propose that submitting authors attest to journal editors that they have checked their references against a master list of retracted articles [25]. While noble in their intents, neither of these proposals makes practical sense. Since only a tiny fraction of 1% of the scientific literature is corrected or retracted each year [11], there is little incentive for the reader to search for article updates.
Despite impediments to the successful communication of corrections and retractions to readers, there have been several attempts to improve the system: & Bibliographic indexing. Since 1984, the National Library of Medicine has indexed retraction notices in the MEDLINE literature index and creates a bidirectional link between the retraction notice and the original article record [26]. & Best practices for medical publishers. The Committee on Publication Ethics has established a set of guidelines for publishers issuing and communicating retractions [10], which has been endorsed by the International Committee of Medical Journal Editors [27]. & Article status lookup. CrossRef has initiated a new service called CrossMark, which allows readers to lookup status updates to an article simply by clicking an icon located conspicuously on the first page of the portable document format (PDF) version of the article and from the full-text article on publishers' websites [28,29]. The service is limited to article versions for which the publisher has committed to maintain with updates and precludes author manuscripts and other versions of articles that are outside the control of the publisher.
The purpose of this study is to investigate the extent of publicly accessible copies of retracted articles on the public Internet and in the personal libraries of scholars. The study does not include copies located on publishers' websites as this has been reported recently by Steen [13]. By understanding where these articles reside, and in what form, we can be in a better position to understand how errors continue to be promulgated through the scientific literature, predict the efficacy of new interventions such as CrossMark, and propose new services that enable publishers to more efficiently contact readers when an article has been updated or is no longer valid.

METHODS
This study is comprised of two parts: (1) a search for copies of retracted articles on the public Internet and (2) a search for copies of retracted articles in personal libraries.

Retracted articles on the public Internet
Articles identified as being retracted were identified using MEDLINE (via PubMed), a bibliographic database of biomedical literature produced by the National Library of Medicine. Searching for the phrase ''retracted publication'' in the Publication Type field for articles published between 1973 and 2010 identified 1,790 records. Eleven records were removed from the data set as the original MEDLINE record identified a collection of meeting abstracts and not a specific paper, leaving 1,779 records for analysis.
Using the search engine, Google, title searches were performed by an experienced librarian to locate a public copy of a retracted article on a non-publisher website. Title searches were restricted to PDF and performed to identify articles when there was possibility of title variation between the printed article title and the MEDLINE record. For example, titles containing Greek letters (e.g., alpha, beta, gamma) were searched in their symbol form, text form, and omitted from the title search to maximize positive matches. Long titles, especially multipart titles, were approached by searching each part of the title or by finding unique strings of words that would lead to a more accurate match. The PDF file of the article, when available, was downloaded to ensure a positive identification and to confirm public accessibility. All searches were conducted between July and September 2011.
Publicly accessible copies of retracted articles were classified by article version (published version, final manuscript, author proof, reviewer manuscript, other) and type of website hosting the article. For non-English language websites, Google Translate was used to help in the classification.

Records of retracted articles in personal libraries
Software using the Mendeley application programming interface (API) was written to search for the presence of records of retracted articles in the personal libraries of Mendeley users. Mendeley [30] is a popular bibliographic tool that helps scholars manage references, create bibliographies, and share papers with other Mendeley users. While there are other bibliographic management tools used by scholars (EndNote, RefWorks, Zotero, among others), at the time of this study, Mendeley was the only product A study of retracted articles that permitted the development of such an interface to their data set. While we do not consider the behaviors of Mendeley users to be necessarily generalizable to the behaviors of all bibliographic database users, Mendeley does provide us with a novel window into the private libraries held by scholars. The source code, documentation, and implementation information can be found at https://www.github.com/fireisbo/retr/.
The Mendeley interface was designed to operate in batch mode using a search strategy that relied upon title keyword matches, PubMed identification numbers (PMIDs), and digital object identifiers (DOIs). Eliminating case and common punctuation, the algorithm first performed a title keyword search, saving a list of possible matches, and then attempted to verify the results with a PMID or DOI, if available. A positive match was returned with a Mendeley uniform resource locator (URL) and the number of users with the same record-these are defined by Mendeley as ''readers.'' If multiple versions of the same record were located, the search algorithm combined and reported the total number of Mendeley readers. The relationship between public accessibility of retracted articles and their frequency in Mendeley libraries was tested by logistic regression.

Retracted articles on the public Internet
Two hundred eighty-nine (16.2%) of the 1,779 retracted articles could be located on non-publisher websites. Twenty-seven (9.3%) of the 289 articles could be found in multiple locations (maximum 5), providing a total of 321 publicly accessible retraction copies.
Three hundred four (95%) of these retracted copies were the publishers' version; 13 (4%) were final manuscripts (the accepted, peer reviewed, and author-revised manuscript); and 4 (1%) were other article versions such as the author proof or reviewer manuscript ( Table 1).
The PubMed Central (PMC) repository was the most frequent location of retracted articles, containing 138 (43%) of found copies. Ninety-four (29%) of retracted articles were located on educational websites-such as on the web pages of professors, research labs, or institutes-on course pages or pages belonging to journal clubs. Twenty-four (7%) were found on commercial websites, such as those selling medical devices, dietary supplements, or direct-topatient services. Sixteen (5%) were found on advocacy websites, such as a support group for patients with multiple sclerosis, and just 10 (3%) were found in institutional repositories.
As article retractions have increased in recent years, so has accessibility to public versions of those articles ( Figure 1).
Just over one-quarter (26% or 82) of retracted articles located in this study contained some retraction statement. Sixty-six of these 82 copies (80%) were accessible from the page-view in PMC-a format that provides access to an article 1 page at a time within a larger web page, which contains bibliographic data. Removing these and focusing on full PDF files of retracted articles, just 16 (5%) of the articles contained some form of retraction statement.

Retracted articles in personal libraries
We could locate 1,340 (75.3%) records for retracted articles in the personal libraries of Mendeley users. On average, each record was found in 3.4 libraries (SE50.19, maximum5133).
Generally, retracted articles published by the most prestigious scientific journals (Science, Nature, New England Journal of Medicine, Cell) were found more frequently in personal libraries than articles published in specialist journals.
Excluding both institutional and government repositories-as these sources often reflect formal depository agreements with authors or publishershighly shared articles (defined as those articles shared Davis by 4 or more Mendeley users) were more than 3 times as likely to be found on the public Internet (odds ratio 3.28; 95% C.I. 2.33-4.61, P,0.0001).

DISCUSSION
The openness of science is greatly promoted by the informal sharing of scholarly documents. This benefit of access, however, comes with a cost when retracted articles persist, without notification of invalidation, on the public Internet and in personal libraries. Through a general web search engine, using ordinary means, 1 in 8 retracted articles could be located on non-publisher websites; the vast majority of these papers contained no information of their retracted status. In addition, records for three-quarters (75%) of these articles could be found in the personal libraries of scholars. Without a systematic means to update readers on the status of articles, error is likely to persist in the scientific literature.
Institutional and funder mandates, requiring the deposit of final manuscripts in a public repository, may increase the dissemination of incorrect and invalidated papers, especially to those without formal access to the published literature. With few exceptions, digital repositories are not designed with the functionality to remove, update, or append what is deposited in them and, at present, lack editorial oversight that would allow such changes to take place. The benefits of providing free access to unmanaged versions of scientific papers may come with the risks of undermining public trust in these services.
As the PMC repository contained 138 (43%) of all retracted articles located in this study, changing the access channels to PMC hosted articles may provide a partial solution. For instance, PMC could prevent access to the full-image view (PDF) of retracted articles and direct all requests through their pageview option-a view that presents the bibliographic record of the article along with its retraction notice. As PMC is but one of many public sources of retracted articles, however, this is not a complete solution.
In spite of various copyright restrictions, the vast majority (95%) of publicly accessible retracted papers were the publishers' version, suggesting that the CrossMark symbol, if implemented broadly, would be viewed by most readers irrespective of the papers' location. There are caveats in the way CrossMark is implemented that may limit its efficacy, however.
First, the employment of a CrossMark update logo on the publishers' website and PDF make the reader responsible for initializing the verification check. While retractions are serious, they are rare events, comprising just 0.02% of the current medical literature indexed in MEDLINE [11]. Whereas the CrossMark data set will include other types of notices, including corrections and editorial expressions of concern, the vast majority of reader-initiated checks would result in no status updates to the current document. Such repetitive exposure to nonevents may desensitize readers to the importance of checking for article updates and lead to systemic nonuse after a period of novelty has worn off.
Second, scientific authors routinely use bibliographic software to assist them in the writing and citation process [21]. As retractions can take place years after the initial article was published, authors may cite an earlier (unretracted) version of the article, depend upon an earlier citation, or rely upon the incorrect citations of other authors [31,32]. As CrossMark is unable to push alerts to readers or Frequency of article retractions indexed in MEDLINE and prevalence of article copies discovered on public, non-publisher websites A study of retracted articles replace older PDF files on user machines with current copies, it is likely that retracted papers-especially older papers published without the CrossMark symbol-will be cited for some time.
There are many factors that contribute to the persistence of error in the scientific literature. Focusing on just one aspect of where the chain of communication is dysfunctional is unlikely to fix the problem entirely: a tripartite approach is needed.

Before reading
Alerting readers through bidirectional indexing of correction and retraction statements has been a function of MEDLINE since 1984 [26]. Other literature indexes, such as Web of Science and Scopus, connect the original article to the retraction statement through citation linking, although this form of alert is far less explicit to the searcher. General search engines (e.g., Google and Google Scholar), while heavily employed by the academic and clinician as a tool for discovery [22,33,34], presently lack any formal mechanism to alert readers of significant corrections or retractions to the article. Upon implementation, CrossMark metadata will be openly available to literature databases and search engines, allowing such update notices or links to be integrated into search results.

Before writing
Bibliographic software such as EndNote, RefWorks, Mendeley, and Zotero could also take advantage of the open CrossMark metadata to develop alerting systems that would notify readers when an article in their libraries has been updated and provide a DOI link to the most current version.

Before publishing
The references included in journal manuscripts could be checked against the CrossMark database and tagged when retractions or corrections have been issued. A report would be generated for the journal editor, which would then be conveyed to the corresponding author. While there are valid reasons for citing a retracted paper-to highlight its error, for instance-many citations are perfunctory in nature and are not required to understand the nature or significance of the research [35,36]. The careful avoidance of citing invalid work (or contextualizing it within a clear negative citation) would help reduce the miscommunication of erroneous research to other readers.
None of these steps is enough to prevent the persistence of error in the literature; however, taken together they may help to greatly minimize its effects.

Generalizability
This study is limited to articles indexed in the MEDLINE database and therefore reflects a biomedical bias to its generalization. Other literature indexes (Web of Science and Scopus) were not included in this study because neither provided adequate indexing of article retractions. The prevalence of public access to articles on non-publisher websites may vary greatly across disciplines. In some disciplines, such as high-energy physics and economics, manuscripts and working papers are widely circulated in their prepublication form.

Identification of public copies
This study is limited to what could be found using the Google search engine. As many PDF copies of articles lack metadata or full-text indexing -a problem that is more acute for the older literature-the estimates for the true prevalence of public copies may be underreported.

Public copies on publisher websites
While we limited the study to publicly accessible copies of articles on non-publisher websites-as these copies are out of the control of the publisher-we located an additional nineteen final manuscripts (or ''papers in press'') that were still publicly accessible from the publishers' websites, none of which included a statement of retraction.

Identification of copies in personal libraries
As the Mendeley database is built from personal library records, title variation may have resulted in under-identifying records. In addition, confirmation of a record does not necessarily mean the reader has access to the article, although Mendeley does allow sharing of papers among readers within registered groups.

ACKNOWLEDGMENTS
Enrico Silterra, software engineer at Cornell University Library, developed the Mendeley search software and public interface.
Thanks to Gabriel Peterson, assistant professor, School of Library and Information Sciences, North Carolina Central University, who provided substantive feedback on the manuscript prior to submission.
This study was funded by the Publishers International Linking Association, which oversees the operation of CrossMark.
The author was solely responsible for the design and conduct of the study, collection, and management of the data; data analysis and interpretation; and preparation and submission of the manuscript, and can take responsibility for the integrity of the data and the accuracy of the data analysis.