Keywords
bookmarklet, grey literature, open science, peer review, pre-print, pre-publish, search engine, server
bookmarklet, grey literature, open science, peer review, pre-print, pre-publish, search engine, server
We sincerely thank reviewers for their constructive comments and thoughtful suggestions. We incorporated several of their suggestions, as reflected in the Conclusions and Limitations sections in the revised version. Specifically:
We added a few sentences on using Google Scholar to search for preprints under Conclusions: “Google Scholar (GS), a popular scholarly literature search engine that provides cross-discipline search functionality, does not include preprint articles as a filter option. Hence, many avid GS users try a workaround by including preprint with the query term, (E.g., “asthma preprint” or “CRISPR preprint”) with the assumption of retrieving only preprint articles fetched from major preprint servers. In contrast, the GS search results in a mixed population of articles comprising both actual preprints and peer-reviewed published articles in which the term “preprint” appears somewhere in the full text of the article.”
We mentioned the need for reordering the search results by date by adding a new paragraph at the end of the Limitations section: ”Currently, the search.bioPreprint default search results are ordered by relevance without any option to re-sort by date. The authors are aware of the pressing need for this added feature and if possible will incorporate it into the next version of the search tool.”
We also added a sentence under Conclusions: “Referees during the grant or journal article review process might also find this bookmarklet useful as it quickly retrieves pre-published articles via the cross-platform preprint search.”
See the authors' detailed response to the review by Prachee Avasthi
See the authors' detailed response to the review by Cynthia Wolberger
Preprint servers are online repositories that manage access to manuscripts that have not yet been peer-reviewed or formally published in a traditional manner. Preprint manuscripts are not copyedited, but they do undergo a basic screening process to check against plagiarism, offensiveness, and non-scientific content. Authors may make revisions at any point, but all versions remain available online. It should be noted that the term “preprint” in this context refers to manuscripts posted by the authors themselves onto specific online servers, not articles made available online by publishers a few weeks ahead of traditional publication.
Preprint articles can be more difficult to discover than those published traditionally, as they are not currently indexed in Medline and therefore do not appear in PubMed search results. This suggests that many timely and relevant research reports potentially fall through the cracks, as the time it takes to traditionally publish a biomedical manuscript can take anywhere from a few months to a few years. This lengthy process is seen by researchers to be a hindrance to scientific advancement. In response, there is a developing movement of preprint advocates who propose that preprints play a role in “catalyzing scientific discovery, facilitating career advancement, and improving the culture of communication within the biology community”1. Preprint servers “enable authors to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals”2.
The history, rationale, and controversy surrounding preprint servers and the pace of the current publication process has been well addressed in other manuscripts3–14, news items15–22, and blogs or white papers23–32. We do not intend to duplicate this information here, but suggest exploration of our reference list for an overview of the current state of the topic.
There are currently only a small number of preprint servers catering to biological and biomedical research manuscripts.
arXiv is a venerable preprint server covering physics, mathematics, computer science, nonlinear sciences, statistics, and quantitative biology since 1991. arXiv is funded by Cornell University Library, the Simons Foundation, and many member institutions.
bioRxiv, operated by Cold Spring Harbor Laboratory, covers new, confirmatory, and contradictory results in research ranging from animal behavior and cognition to clinical trials, neuroscience to zoology.
F1000Research, a member of the Science Navigation Group, provides an open science platform for the immediate publication of scientific communication. Posters and slides receive a digital object identifier and are instantly citable. Articles with associated source data are published within a week and made available for open peer review and user commenting. Articles that pass peer review are then indexed in PubMed, Scopus, and Google Scholar. It should be noted that F1000Research is not technically a preprint server, but is included here because it does provide access to articles prior to and during the peer review process. See the Limitations section for details.
PeerJ Preprints covers biological, medical, life, and computer sciences. Their aim is to reduce publishing costs while still efficiently publishing innovative research, with an emphasis on not yet peer-reviewed articles, abstracts, or posters. Submissions are free, can be a draft, incomplete, or final version, and are typically online within a day after editorial approval.
Our intention is to present a resource that facilitates the quick and easy identification and access of scientific content located on preprint servers. The Health Sciences Library System at the University of Pittsburgh (HSLS) developed a tool to help researchers to quickly search preprint databases and discover cutting edge, yet-to-be published or reviewed biomedical research articles, search.bioPreprint (Figure 1). This search engine encompasses a federated search of arXiv, bioRxiv, F1000Research, and PeerJ Preprints. For ease of reading we will continue to refer to all sources of preprint articles as “preprint servers,” including the open science publishing platform F1000Research. We chose to publish this article in F1000Research and bioRxiv in order to support the preprint movement and to elicit feedback on usage of the tool, which will be updated as needed.
search.bioPreprint was created using the proprietary software IBM Watson Explorer, formerly Vivisimo Velocity, version 8.0-2 (IBM Corp, Armonk, New York, USA) to generate a meta search engine that compiles search term results from a pre-selected list of multiple sources into a single list ordered by the relevance of matching query terms. The results can then be further filtered by Source (e.g., the preprint servers of origin) or by Topic (e.g., microcephaly for a Zika virus search). The Topic search is accomplished via clustering, meaning the search results are organized on the fly by similarity in subject matter. Additionally, a “remix” link displayed next to the clustered topics reveals new secondary topics. This is done by clustering the same search results again, but explicitly ignoring the topics that were used in the initial clustering process.
The Health Sciences Library System at the University of Pittsburgh has repeatedly utilized IBM Watson Explorer software to develop, implement, and maintain several federated search engines focused on a variety of topics. These include: search.HSLS.OBRC –a portal for discovering bioinformatics databases and software via the Online Bioinformatics Resource Collection33, Clinical Focus –a portal providing quick access to high-quality clinical information34, and Clinical eCompanion –a portal with information for primary care35. Similarly, the U.S. National Library of Medicine (NLM) utilized the same software to create search engines for MedlinePlus, MedlinePlus en Español, and the NLM library website.
The search engine was created following the software manufacturer’s protocol. Briefly, the search url and parameters are entered for each site, then the results are selected based on the XPath of the results within the HTML page. Finally, each individual source is bundled into a single source to provide one search for multiple sites.A maximum of 200 total results are returned based on the licensing agreement with IBM; this also contributes to a short wait for return of results. The selected sources for retrieving preprint articles using search.bioPreprint are: (1) the quantitative biology section of arXiv.org, (2) bioRxiv, (3) F1000Research and (4) PeerJ Preprints.
As an example, typing a single-word query term, such as CRISPR, into the search box results in ninety-one preprint articles culled from the aformentioned preprint servers (Figure 2, searched on 2 May 2016). Clicking on an article title redirects to that article at its original source. Search results may be narrowed by Topic or Source using the filters on the left side of the page. Using the CRISPR example, the ninety-one search results are grouped into shared Topics: fourteen articles on “Bacterial,” twelve articles on “Protein,” six articles on “Genome engineering,” etc. Expanding individual topics reveals a list of subtopics: clicking on the topic “Protein” redistributes the twelve articles into subtopics, including “CRISPR-Cas9,” “Image, Palindromic Repeat,” “Mutants, Generated,” etc. Clicking on a topic or subtopic reconfigures the search results to limit to these filtered articles.
Clicking on the “remix” button appearing next to “Top 91 results” regroups the original search results into additional topics such as “Cells,” “Advances,” “Drosophila,” etc that are not present in the first results iteration (Figure 3). This provides another opportunity to discover pertinent preprint articles, especially if a large number of results is returned.
The search results may also be filtered by Source. Selecting this will change the default display of topic-focused clusters to articles organized by Source, which in the current iteration is one of the four preprint servers searched by this tool: nineteen from F1000Research, two from PeerJ Preprints, six from arXiv, and sixty-five from bioRxiv (Figure 4).
Quotation marks are recommended for searches with exact phrases, e.g., Zika virus. The necessity of this was discovered after examing the search parameters of the various preprint servers. As one of the preprint servers by default joins words in a multi-word query with the Boolean operator “OR” then a search for a phrase such as zika virus produces multiple articles where the only matching term is virus. Using quotation marks for a search of more than one word mitigates this problem and considerably improves the quality of results. A search for “zika virus” thus produces seventy-nine articles that are topically filtered into “Zika virus infection,” “Microcephaly,” “Discovery,” “Dengue Virus,” etc (searched on 2 May 2016).
The “Search within clusters” box allows for searching within the search results, and can be used to identify specific articles within the cohort of Zika virus preprints that are not immediately apparent from topical clustering. Entering vaccine in the search box highlights the topics and subtopics containing articles bearing the word vaccine: under “Zika virus infection” is “Preventing Zika Virus Infection;” under Dengue Virus is “Antibodies, Vaccine” and “Community, Vector.” Selection of highlighted topics or subtopics reconfigures the results to limit to vaccine-related Zika virus preprints (Figure 5).
A bookmarklet is a special type of web browser widget containing an embedded software command that extends the application of the browser by adding a one-click function as a bookmark. We created a bioPreprint-bookmarklet using JavaScript in order to seamlessly integrate a search for any word or phrase from any web page with the information stored in preprint servers. After dragging/dropping the bioPreprint-bookmarklet into any web browser, the next step is to highlight a word or phrase of interest then click the bookmarklet. This will result in a pop-up window displaying preprint articles containing the text of interest (Figure 6).
All web browsers that support JavaScript (Google Chrome, Mozilla FireFox, Internet Explorer, Apple Safari, Opera) are compatible with the bookmarklet. In case the favorites/bookmark bar is not visible we provide instructions for displaying it on commonly used browsers. A video describing how to install the bookmarklet in a web browser is also available.
Imagine a researcher is searching PubMed for articles on “RNA-seq quantification” and comes across a paper recently published in Nature Biotechnology, “Near-optimal probabilistic RNA-seq quantification”36. This paper introduces a new software program, Kallisto, that analyzes RNA-seq data by two orders of magnitude faster than previously used software. This is notable as it removes the computational bottleneck for RNA-seq data analysis. After reading about this new software, the researcher decides to check whether it has been widely adopted by perusing the published literature.
A search in PubMed with the search term “Kallisto” results in only the original article (searched on 2 May, 2016). This is well within expectations, considering the recent publication date of the article, 4 April 2016. There has not been enough time for researchers to know about the software, let alone write papers citing it.
To continue to try and gauge the usage of Kallisto in RNA-seq data analysis, the researcher might take an alternative approach: instead of searching PubMed, try searching for preprint articles. This can be achieved with a single click of the bioPreprint-bookmarklet once it is installed in the researcher’s web browser. Upon viewing the article abstract on the PubMed search results page, highlighting the word “Kallisto,” and clicking the bioPreprint-bookmarklet, a pop-up appears with the search.bioPreprint search results: sixteen preprint articles, two from arXiv, thirteen from bioRxiv, and one from F1000Research (searched on 2 May, 2016). Interestingly, the second article on the results page is the preprint version of the Nature Biotechnology paper on Kallisto software, submitted to the arXiv preprint server (Figure 6). The authors submitted their preprint on 11 May 2015, almost one year before its publication in Nature Biotechnology, with concomitant indexing by PubMed37.
It is worth noting that since the availability of the Kallisto paper as a preprint, fifteen preprint articles have cited the use of Kallisto software38–51, searched on 2 May, 2016). These articles cover numerous topics, including development of new software, single cell RNA-seq analysis, and quantification of the relative abundance of transcripts in various experimental settings.
A student gathering information from the internet about the regulation of gene expression happens upon the GTEx Project Community Scientific Meeting website. GTEx stands for the Genotype-Tissue Expression project (GTEx), which aims to develop an atlas of human gene expression and its regulation across various tissue types. Intrigued by the scope of this project, the student is curious to know how GTEx project data have been utilized in research.
The bioPreprint search engine and bookmarklet can quickly satisfy the student’s curiosity by providing easy access to GTEx-related articles hosted by various preprint servers that may or may not be published “in print” yet. This process is simple, unique, and the student doesn’t even need to leave the current web page to go on a literature hunt. Rather, all GTEx-related articles will appear in a new window with only two clicks, the first highlighting the word GTEx and the second on the previously-installed bioPreprint-bookmarklet. The result is sixty-seven articles showcasing the use of GTEx data in a variety of research topics including “Genome Wide Association Studies,” “Allele, Specific expression,” “Expression Quantitative Trait Loci,” etc (searched on 2 May 2016).
These use cases emphasize the power of the bioPreprint search engine and associated bookmarklet in delivering scientific research articles that are not only hard-to-find and yet-to-be traditionally published, but also on demand at the point of reading. And the “point of reading” can be anything on the web: journal articles, news items, blogs, PubMed/Google Scholar search results, etc.
Until the creation of search.bioPreprint there has been no simple and efficient way to identify biomedical research published in a preprint format, as they are not typically indexed and are primarily discoverable by directly searching the preprint server websites (articles that pass peer review in F1000Research are the exception). Google Scholar (GS), a popular scholarly literature search engine that provides cross-discipline search functionality, does not include preprint articles as a filter option. Hence, many avid GS users try a workaround by including preprint with the query term, (E.g., “asthma preprint” or “CRISPR preprint”) with the assumption of retrieving only preprint articles fetched from major preprint servers. In contrast, the GS search results in a mixed population of articles comprising both actual preprints and peer-reviewed published articles in which the term “preprint” appears somewhere in the full text of the article.
During the final stages of manuscript preparation an online database aiming to index preprint articles was launched, PrePubMed, which despite appearances is not an official resource from the National Library of Medicine (NLM), the National Center for Biotechnology Information (NCBI), or PubMed. We want to acknowledge this new resource, but emphasize that search.bioPreprint offers full text searching where available (currently, bioRxiv and F1000Research, and arXiv in the future) as well as topical and source-based clustering of results. In addition, our tool has been available since mid-February 2016, around the same time as the ASAPbio meeting, where it was mentioned during discussions.
The underlying technology upon which search.bioPreprint was built is flexible enough to integrate additional resources into the search engine. As new preprint servers are introduced, search.bioPreprint will incorporate them and continue to provide a simple solution for finding preprint articles. We welcome feedback that introduces new preprint resources and addresses usability concerns.
The bioPreprint-bookmarklet enables each and every word or phrase appearing on any website to be integrated with information in articles stored in preprint servers. The on-demand delivery of preprint articles at the point of reading enables researchers to discover brand new pre-published articles quickly and be updated with cutting edge, yet-to-be-reviewed information that is challenging to discover by traditional literature searching methods. Referees during the grant or journal article review process might also find this bookmarklet useful as it quickly retrieves pre-published articles via the cross-platform preprint search.
Our intention is that the combined use of the aforementioned tools helps to fulfill the unmet need of the scientific community for immediate dissemination of research outcomes, ultimately resulting in improved scientific communication and far-ranging insights and innovations.
While arXiv, bioRxiv, and PeerJPreprints are considered to be preprint servers, F1000Research belongs to a separate class. It offers a unique publishing platform in which a transparent peer review process is integrated into the article publication practice and thus holds three categories of articles based on peer review status: (1) recently submitted and awaiting peer review, (2) passed peer review, and (3) not passed by peer reviewers. Only articles that pass the peer review process are indexed in literature databases such as PubMed. F1000Research permanently hosts all articles irrespective of peer review status. Therefore, it represents a blended system of preprint server and traditional online journal. Search.bioPreprint does not separate these three types of F1000Research articles and therefore returns both non-peer reviewed and reviewed articles together in the search results. Nevertheless, the peer review status is easily visible when searchers are directed to the F1000Research site from the search.bioPreprint search results. As F1000Research hosts many articles whose peer review status (before passing peer review) could be considered the equivalent of preprints, we decided to include this as a source of preprint articles. Users should note a key difference, however, as all articles in F1000Research are committed to formal peer review and should therefore not be submitted to any additional journals.
The quality of the search results generated by the bioPreprint search engine is confined by the search parameters of the individual preprint servers. If the preprint servers alter their search algorithms, a concomitant adjustment of underlying codes used by the bioPreprint search engine is often required. Unfortunately, this can be done without any public notification and is only discoverable upon a thorough analysis of bioPreprint search results. The University of Pittsburgh Health Sciences Library System has a quality check team involving two librarians to ensure the accuracy of search.bioPreprint results. The team routinely compares the search results produced by several preset query terms with the previous results and reports any discrepancies to the development team.
The average time taken to display search results is not always optimal. The speed of the search.bioPreprint results return stems from multiple factors: individual preprint servers’ searching speed, efficiency of the IBM Watson Explorer software, and computational power of the server hosting the bioPreprint search engine. While some contributing factors are outside of our control, efforts will be undertaken to speed up the search process by continually upgrading the power of the host server.
Currently, the search.bioPreprint default search results are ordered by relevance without any option to re-sort by date. The authors are aware of the pressing need for this added feature and if possible will incorporate it into the next version of the search tool.
search.bioPreprint is freely accessible at http://www.hsls.pitt.edu/resources/preprint. The preprint search engine was created using the software, IBM Watson Explorer, formerly known as Vivisimo Velocity. IBM Watson Explorer is a proprietary software, hence, its source code is not available.
The bioPreprint-bookmarklet is freely available at http://hsls.pitt.edu/biopreprint-infobooster.
The JavaScript code embedded in the bookmarklet is:
“javascript:(function(){(function(t,u,w){t=''+(window.getSelection%3Fwindow.getSelection():document.getS election%3Fdocument.getSelection():document.selection%3Fdocument.selection.createRange().text:'');u=t %3F'http://search.hsls.pitt.edu/vivisimo/cgi-bin/query-meta%3Fv%253Aproject=preprint% 26query=%2522'+encodeURIComponent(t)+'%2522':'';w=window.open(u,'_blank','height=750,width= 700,scrollbars=1');w.focus %26%26 w.focus();if(!t){w.document.write('<html><head><title></title></head><body style="padding:1em;font-family:Helvetica,Arial"><br/><p>First%2C highlight a word or a group of words from any website that you are browsing (journal article%2C PubMed search result%2C news article%2C blog%2C etc.)%2C and then click on this bookmarklet to retrieve cutting edge%2C yet-to-be published or reviewed biomedical research articles related to your selected word(s).</a><p>Check the <a href=\"http://media.hsls.pitt.edu/media/BioPreprint_ac0316.mp4\">How to Video</a>for instruction.</p><br/><p><img src="http://www.hsls.pitt.edu/sites/all/themes/liberry_front/logo.png" alt="HSLS Logo"></p><script>var q=document.getElementById("q"),v=q.value;q.focus();q.value="";q.value=v;</script></body></html>'); w.document.close();}})()})();”
AC conceived the concept and wrote the Implementation section. CI created the search.bioPreprint logo, assisted with concept refinement and design, and wrote all documentation, including preparation of the initial drafts of the manuscript. AC and CI created the figures. JL developed the search engine. AZ created the bookmarklet and webpage. All authors were involved in revision of the draft manuscript and have agreed to the final content.
The authors wish to gratefully acknowledge the following individuals for their help with various aspects of the creation of search.bioPreprint and manuscript preparation: Peter Coles for writing a blog on the insightful use of bookmarklets, Julia Dahm for the creation of the video describing how to install the bookmarklet, Melissa Ratajeski for providing helpful comments on the manuscript, Nancy Tannery for providing helpful comments on the manuscript and offering general support for this project, and Fran Yarger for offering general support for this project.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 20 Jul 16 |
read | |
Version 1 16 Jun 16 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)