Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

When drafting a manuscript such as this article, a survey of relevant literature and related articles from the same subfield is required. Traditionally, such a survey is conducted by combining searches based on specific keywords and references found in bibliographies (i.e. references citing, or cited in, the selected articles). This process requires multiple attempts of different keyword combinations and manual examination of numerous citations. A long list of both relevant and irrelevant articles is often encountered requiring manual curation after further reading, yet not all of the important relevant articles are necessarily included after this long and tedious process. The task of finding truly relevant articles becomes increasingly timeconsuming and challenging as the number of scientific articles published every year expands at a rapid pace. For example, the PubMed (http://www.ncbi.nlm.nih.gov/ pubmed) database operated by the US National Center for Biotechnology Information at the US National Library of Medicine (NLM), has seen a sustained annual growth rate of nearly 5% for the past 4 years; from 25.2 million records in 2015 (57), 26.4 million records in 2016 (58), 27.5 million records in 2017 (59) and is now approaching 29 million records in 2018 as of the latest report (66). The overall success of this classic approach depends on finding optimal keyword combinations as it is key for ensuring the precision and completeness (i.e. sensitivity) of the search results.
The aforementioned scenario is an illustration of an 'informational' search goal (8,64), essentially an exploratory request for discovering the most relevant resources to 'find out' about a particular topic. For purely 'navigational' queries that seek out a single 'correct' result (8,64), the classical keyword based approach is generally sufficient. However, for a more complete survey, the keyword input given by the user is susceptible to a number of rigidities including misspellings, ambiguity and desired scope. To address these challenges, traditional search engine operators implement various workarounds including query spelling correction, query suggestions, query reformulation and query expansion. A recent NLM study describing 'under-the-hood' improvements to PubMed demonstrate the integration of such mitigation techniques (25). As this problem remains unsolved, alternative approaches may be employed.
One way to avoid keywords in searches is to use a seed article. A seed article can provide the title, the abstract and even its full text, containing significantly more information than a combination of a few pre-selected keywords. Aside from these content-based features, seed articles may also provide additional metadata including the list of authors, publication venue and date, citation information and potentially manually curated categorical information such as assigned Medical Subject Headings (MeSH) descriptors (52). Most importantly, all of these features are extracted and processed automatically without requiring any user input or intervention. Thus, for an article-based search, the search engine will accept as input either the selection of an existing document from its database or a user uploaded document in its entirety, in order to generate the list of results.
Article-based relevance searches, often referred to as research paper recommender systems, is an active area of research with more than 100 different approaches published since 1998 (7). Representative examples of commonly used methods used for calculating the similarity between text documents are Okapi Best Matching 25 (BM25) (39,40), Term Frequency-Inverse Document Frequency (TF-IDF) (65) and PubMed-Related Articles (PMRA) (49). These and other decade-old methods remain the backbone behind real-world recommender systems such as PubMed (http://www.ncbi.nlm.nih.gov/pubmed), ResearchGate (http://www.researchgate.net) and CiteULike (http://www.citeulike.org). Numerous newly developed methods are not yet translated into practice because it is unclear whether they have improved over existing methods (7). Improvements in paper recommender systems can be measured by 'user studies', 'online evaluations' or 'offline evaluations'. User studies aim to quantify user experience by interacting directly with explicit testers of the system; online evaluations are a more passive approach wherein user activity is silently monitored and recorded for assessment, and offline evaluations are akin to traditional benchmarking processes where algorithmic performance measures are evaluated on appropriate datasets. The problem with user studies and online evaluations is that considerable active participation is required to infer any meaningful conclusions (7). Accordingly, a large-scale user study would require costly recruitment and management of a large number of participants, whereas for online evaluations it is challenging to attract large numbers of active users to a deployed system except for a few systems with large established user bases. Moreover, the data used for evaluating such systems are often not made publicly available. In addition, online evaluations collect implicit measures of satisfaction such as click-through rates; however, non-clicked items are not necessarily an indicator of negative samples.
Offline evaluations, on the other hand, require a goldstandard benchmark for comparing and contrasting the performances of different systems. Previous work has focused on building gold standards of query-to-document relevance in biomedical literature. Queries are generally defined as some topic comprised as a natural language statement, a question or a set of keywords conveying a specific information need. Representative examples of existing gold-standard query-to-document resources include the OSHUMED (32) collection, the first TREC Genomics track (33,73) (the only applicible biomedical-focused track within TREC), the BioASQ (72) competition and the bioCADDIE (13) dataset. Although these datasets are suitable for benchmarking traditional information retrieval, they do not consider the similarity between document pairs as a whole. Re-purposing query-to-document datasets to the document-to-document similarity problem has been attempted; for example, adaptation of the 2004/2005 TREC Genomics data (34,35) has been evaluated by previous studies (11,49,74). However, a pair of documents may both be related to a certain topic but only loosely related to each other as complete units.
Constructing a gold standard dataset of sufficient size for offline evaluations is not a trivial endeavor. In the context of query-to-document relevance, a recent study concluded that 'there is unfortunately no existing dataset that meets the need for a machine-learning-based retrieval system for PubMed, and it is not possible to manually curate a large-scale relevance dataset' (24). We propose here that relevance labels between document pairs in the documentto-document context can be 'crowd-sourced'. This ideology has a number of successful applications; for example, computational resources are crowd-sourced by the BIOINC (1) distributed computing system for the Folding@Home (47) and Rosetta@Home (28) projects, and human problem solving resources are crowd-sourced for 'distributed thinking systems' as in the Foldit (14,28) project.
Here, we have established the RElevant LIterature SearcH (RELISH) consortium of 1500+ scientists from 84 countries around the world. The consortium annotated over 180,000 document-document pairs indexed by PubMed, nearly 400 times larger than the single other human annotated data collection we could find with just 460 annotations by 90 authors (50). Analysis of the collected data indicates diversity and consistency of annotations across different levels of research experience as well as its usefulness in benchmarking different documentdocument comparison methods. Futhermore, these data could be utilized in future work to train deep machine learning models that may substantially increase relevancy of recommendations compared to existing methods.

Methods
The overall procedure for establishing the RELISH database was as follows. First, we established the articlebased PubMed search engine (APSE; https://pubmed.ict. griffith.edu.au) for recommending articles ('candidatearticles') potentially relevant to an input article ('seed-article') given by, or assigned to, a user. The APSE allowed users to assess and annotate recommended candidate articles regarding their degree of relevance to a respective seed article and facilitated submission of these annotations. Then, we established the RELISH consortium of scientific authors ('participants') and invited them to the APSE to evaluate 60 candidate articles for one or more seed articles that they are interested in or, preferably, they have authored. Finally, annotations submitted by participants were compiled and organized into the RELISH database. A participant's 'contribution' was defined as the submission of annotations for all 60 candidate articles with respect to a seed article.
The remainder of this section explains database construction, including the APSE's corpus data and candidate article recommendation methods, participant recruitment and annotation procedures and performance evaluation metrics.

APSE: document corpus
The seed article corpus used by the APSE server was extracted from the biomedical literature database hosted by PubMed. The corpus database was constructed using the 2017 baseline downloaded from PubMed's public FTP server (ftp://ftp.ncbi.nlm.nih.gov/pubmed), followed by incremental synchronization with daily update releases. Only articles with available title and abstract metadata were included in the corpus, no other selection criteria were imposed. As of 3 July 2018, this collection contained 18,345,070 articles, from which participants selected seed articles.
The candidate article corpus, from which the APSE server generated recommendations that were presented to participants, was a subset of the seed article corpus. This subset included only recent articles that were published within the past decade. The primary justification for this was to increase participant motivation with potential discovery of recent work within their field, considering the decline of participation willingness with increasing publication age as speculated previously (77). As of 3 July 2018, this candidate article subset contained 8,730,584 articles.
All raw article texts were pre-processed into usable indexing elements (tokens) by the following pipeline. We first applied Lucene's 'UAX29URLEmailTokenizer', which follows Unicode Text Segmentation rules (16) to divide the text (while also tagging detected URLs or email addresses). Next, possessives (trailing apostrophe + s) were removed from each token, and any single character, numeric, URL or email tokens were removed. Stop-word tokens were filtered according to a list of 132 official MEDLINE stop words retrieved from a previous source (6). Finally, tokens were stemmed to their root form using Lucene's 'PorterStemFilter'.

APSE: recommendation system
The APSE recommendation system is built on three baseline methods: PMRA (49), BM25 (39,40) and TF-IDF (65). PMRA is the technique used by the 'Similar Articles' function on the official PubMed site. Here we used the Entrez Eutilities (43) ELink to retrieve related articles for given seed articles. BM25 is a representative probabilistic relevance technique, implemented here by Lucene's 'BM25Similarity' class using default parameters of k1 = 1.2 and b = 0.75. TF-IDF is a representative vector space model technique, implemented here by Lucene's 'ClassicSimilarity' class.
For the individual method evaluations presented in this study, the original method-specific list of results were used (in this case the same candidate article may be shared by multiple methods). Each method's list of candidate articles was sorted by descending score. To generate the list of recommendations presented to the user for annotation, we took the individual method-specific result lists and combined them into a unified list. We set a maximum of 60 articles per seed article so as to include at least the top 20 highest-scoring non-redundant candidate articles per method.
The unified list assembly procedure is as follows: in a round-robin fashion between all methods, consider the current top-scoring article from a method. If the article is not already in the unified list, add it. Otherwise, disregard this article and continue with the next best-scoring article until an addition to the unified list has been made. Essentially, this procedure is performing a three-way merge while ensuring an equal selection of non-redundant recommended articles from each method. This unified list of recommendations was not presented in the order of predicted similarity. Instead, before presentation to users, it was shuffled randomly to reduce the possibility of a systematic bias in annotation.

RELISH: participant recruitment
To achieve the goal of a large dataset facilitating future deep machine learning-based method development, we established the RELISH consortium of biomedical researchers who have annotated studies potentially related to one or more seed articles that they had authored or were interested in. To maximize the quality and efficiency of annotation, we encouraged participants to use articles that they had authored as the seed articles. All participants contributed voluntarily. Only submissions with annotations completed for all 60 recommended candidate articles per seed article were accepted.
Our recruitment strategy involved two major phases: the first was internal referrals, personal invitations, social media posts and a correspondence letter (9); the second was directly contacting authors of papers published in 2018. For this secondary phase, we built a contact list of the corresponding authors with 2018 papers indexed in PubMed Central (63) and created a personalized email including the top three candidate articles (the highest-scoring nonredundant document from each method) relating to their article along with an invitation to join our consortium.
A total of 129,190 unique authors were contacted regarding 72,764 unique PubMed articles. Of these articles, 54,071 (74.3%) had contact information for a single author only (assumed to be the corresponding author), 7452 (10.2%) had contact information for two authors (assumed to be first and corresponding authors), while the remainder had contact information for more than two authors. In any case, all available contacts per article were invited, as to encourage multiple annotators of the same article for consistency analysis.
Finally, of the 1570 total unique participants, we had 342 (21.8%) participants join by the initial recruitment phase. During the secondary recruitment stage, 1228 (78.2%) authors accepted our invitation yielding an overall response rate of 0.95%, thus the majority of contributions resulted from this phase.

RELISH: annotation procedure
An overview of the annotation procedure is shown in Figure 1. Each participant was asked to annotate the degree of relevance ('relevant', 'somewhat-relevant' or 'irrelevant') between a seed article and recommended candidate articles. We gave the following definitions for the degrees of relevance as a guide for label assignment: 1. Relevant: an article topically relevant to the seed article; within the same specific sub-field of research, i.e. an article that would be interesting to read further or could have been cited within the original work. 2. Somewhat-relevant: an article missing some key topical details of the seed article; within the broader area of research but does not specifically fit into the sub-field, i.e. unlikely to be considered for a citation within the original work. 3. Irrelevant: an article unrelated to the seed article and obviously does not fit inside the specific sub-field of research.
Although definitions of 'relevance' are subjective, we aimed to make the choice as simple as possible by using this threepoint system of document pair 'closeness', as opposed to a higher number of relevance degrees, like the 5-point or 10- point scales (44). Additionally, this three-point scale befits collapse to binary classification (a two-point 'positive' or 'negative' scale); the 'relevant' and 'irrelevant' classes intuitively map to the 'positive' and 'negative' classes, respectively, leaving only the 'somewhat-relevant' class to be mapped to either the 'positive' or 'negative' class depending on required evaluation strictness. Before submitting annotations, contributors were asked to provide their level of experience from the following options: PhD student, years of experience after PhD studies (less than 5, between 5 to 10 or more than 10) and other (unspecified, potentially comparable to PhD experience with degrees like MD or PharmD).

Evaluation metrics
The expert-annotated document pairs provided the opportunity to evaluate the performance of the three baseline methods for the first time. To simplify the evaluation, we collapsed our three-state label classes to fit a two-state classification in which, unless stated otherwise, 'relevant' seed candidate article pairs were assigned to the 'positive' class, while 'somewhat-relevant' or 'irrelevant' article pairs were assigned to the 'negative' class. Using this stratification, we assessed method's performances using both binary classification and information-retrieval metrics.
For binary classification metrics, we used Matthews correlation coefficient (MCC) and area under the [receiver operating characteristic (ROC)] curve (AUC). Metrics used for information-retrieval are precision of top-N ranked results (P@N) and mean reciprocal rank (MRR). MCC and AUC emphasize good discrimination between 'positive' and 'negative' classes, whereas P@N and MRR metrics emphasize highly ranked 'positive' class results.
MCC (54) is the elective metric of the machine learning community for summarizing classification performance; although it is not restricted to binary classification in particular, for the binary case, it coincides with the square root of the chi-squared statistics (42). Using a variable similarity score threshold parameter, contingency tables measuring separation between positive and negative classes are generated, from which the MCC is determined. Here, we have reported the maximum value. A balanced measure is given even when the distribution of 'positive' and 'negative' class instances are unbalanced (5), where perfect classifications are given a MCC value of 1 and random classifications are given a MCC value of 0.
AUC (29) is determined by a ROC, a plot of sensitivity as a function of specificity at varying threshold parameter values. The area under this ROC curve estimates the probability that a classifier would rank a randomly chosen positive instance higher than a randomly chosen negative instance (23), where perfect classifications are given an AUC value of 1 and random classifications are given an AUC value of 0.5.
P@N (15) is the proportion of results assigned to the 'positive' class within a fixed N of top ranked results. Here, we have reported N values of 5, 10 and 20; 5 and 10 were chosen as a measure of 'first-page' results initially seen by users, and 20 was used as an indication of the total 'positive' class proportion per method (as there only 20 results considered per seed article for individual method evaluation). However, P@N is limited by weighting equally each result within the top-N.
MRR (62) measures average closeness of the first 'positive' class result to the top of its result list over a sample of queries Q. Here, not all positions are weighted equally, highly ranked 'positive' class results are rewarded more than lower-ranked ones. It is determined by averaging the reciprocal rank (RR), i.e. 1/rank i , where rank i is the position of the first encountered 'positive' class result in a list of putative matches, for each query in Q. An MRR value of 1 indicates that for each query, the highest ranked result belonged to the 'positive' class.

Results
Here we introduce RELISH-DB, the database that resulted from the annotation work of the RELISH consortium. All annotations submitted by participants before 27 July 2018 were consolidated into the initial revision of RELISH-DB.
The remainder of this section presents statistics of database content and consortium participation, and the results for annotation consistency analyses and method performance evaluations.
Annotations were contributed by a diverse group of highly experienced original authors Table 1 shows simple statistics of RELISH-DB content according to participants' research experience. Annotations were received from a total of 1643 unique scientists around the world, of whom 1570 are registered participants (affiliated consortium members) and 73 contributed anonymously. Figure 2 shows that the contributors to this project originated from diverse geographic locations, including 84 unique countries with the largest clusters located in Europe (586), North America (356), China (265) and Oceania (161). Altogether, the participants annotated 3017 seed articles or 181,020 (3017×60) labelled document pairs. The average number of submissions per participant was 1.9 (all anonymous submissions were considered as individuals). The majority of participants (77%) submitted annotations for just one seed article, 11% for two, 3% for three and 9% for four or more, respectively. A few dedicated participants annotated more than 50 seed articles. The majority of contributions (63%) were made by experienced researchers with 5 or more years' experience post-PhD. More significantly, the majority of seed articles (91%) were evaluated by one of the original authors according to name matches. These statistics suggest high-quality annotations within RELISH-DB. Figure 3 shows a word-cloud distribution of MeSH descriptor frequencies for all annotated documents, normalized by baseline frequencies in all PubMed documents. There seems to be a slight over-representation of publications on highthroughput 'Omics' technologies (e.g. genomics, genomewide association study, nucleotide sequencing, proteomics). It is possible that researchers in the field of bioinformatics/ computational biology were particularly eager to participate in the study of annotating documents as they are probably the ones seeing the most benefit in having a large annotated dataset of high quality. It may also be due to the first participant recruitment phase via internal referrals and personal invitations were somewhat enriched with scientists working in the bioinformatics/computational biology areas. On the other hand, it could simply reflect the general trend in biomedical research of using more and more 'Omics' technologies. Nevertheless, the diversity of annotated content was illustrated by a 76% coverage of all unique MeSH descriptors in the official PubMed library collection by all documents within RELISH-DB (every seed article plus every candidate article per seed article). Such a diversity in research fields is important for benchmarking literaturesearching tools applicable to all biomedical research fields.

Partial relevance was the most popular annotation
The distribution of the three possible relevance labels (relevant, somewhat-relevant and irrelevant) across all contributions per seed article is shown in Figure 4. The chart on the left indicates the frequency of seed articles with n  candidate articles in respective relevance labels. It should be noted here that the frequency refers to the proportion of all 3017 seed articles, and the sum of frequencies in each column is not 1. For example, considering the column at zero number of articles, it should be interpreted that for relevant, 2.7% or 84 seed articles have no candidate articles marked as relevant. Summing 2.7%, 0.5% and 9.7% for respective relevance labels means that 12.9% or 392 seed articles in total have no candidate articles in one of the relevance labels. The peaks for relevant, somewhat-relevant and irrelevant annotations was at 8 (106 articles, 3.5%), 17 (113 articles, 3.7%) and 0 (291 articles, 9.7%), respectively. These distribution peaks suggest that many articles fell into the grey area of partial relevance. Additionally, the preference for partial relevance can be observed from the box plot on the right side of Figure 4 where the median number of somewhat-relevant candidate articles per seed article is 20, compared to 17 for relevant and 16 for irrelevant. Furthermore, when considering the average and standard deviation of candidate articles per seed article in each relevance label, somewhat-relevant had the high-est average and the lowest deviation of 21 ± 11, compared to 19 ± 14 for relevant and 19 ± 15 for irrelevant. The overall dominance of relevant and somewhat-relevant annotations suggests the reasonable performance of the three baseline methods in filtering out obviously irrelevant articles.

Consistency according to method performance
Our definition of relevance, partial relevance and irrelevance is subject to different interpretations by different individuals. The perceived efficacy of document recommendation systems depends on the expected agreement between the subjective opinions of different individuals. To examine if such agreement exists, we compared the performance of the three baseline methods on annotations made by scientists of different experience levels, annotations of the same articles by different individuals, annotations across different research fields and durations of time spent annotating. To carry out these comparisons, seed articles for the benchmark datasets were selected from RELISH-DB  according to the following conditions. First, we removed 44 seed articles that PMRA provided fewer than 20 article recommendations for; this was to ensure a fair comparison with the other baseline methods that will always have 20 recommendations per seed article. Next, 186 seed articles without any positive or negative samples were removed as assessment of discrimination is impossible when all candidate articles belong to either the positive or negative class. Finally, we set aside 154 duplicate annotations of the same seed article by different participants ('D154'), and 400 randomly selected seed articles as an independent test set ('T400'), to be analyzed separately.
This led to a total of 2233 seed articles ('ALL2233') available for evaluation. From this set we derived an 'NR' (non-redundant) set, comprised of a single seed article from each unique participant to avoid potential bias against specific annotators. In this dataset, for participants with multiple seed article contributions, a seed article was selected at Table 2. Overall evaluation set performance results for the PMRA, BM25 and TF-IDF methods. 'ALL' includes all seed articles from all participants, and 'NR' includes only one seed article from each participant, corresponding to the 'ALL2233' and 'NR1220' sets, respectively. Performance was measured using MCC, AUC (ROC), MRR and P@N, given as (mean ± stdev). The bold values indicate maximal metric value between the three baseline methods.
random. This 'NR' set had 1220 seed articles ('NR1220') in total, less than the total number of unique participants due to aforementioned exclusions.
Here, we have collectively defined candidate articles marked as 'somewhat-relevant' or 'irrelevant' as negative samples, with those marked as 'relevant' defined as positive samples. This was to challenge the three baseline methods and allow the measurement of their performance according to binary classification. For all statistical testing we used the Wilcoxon signed-rank test (75) implemented by the SciPy (38) library. Table 2 compares the performance of the three baseline methods within the 'ALL2233' and 'NR1220' sets. The table shows that the three methods all had quite similar performances, with TF-IDF having the slightest edge in all metrics except precision @ 20. Moreover, the method performance trend was shared between the 'ALL2233' and 'NR1220' sets, suggesting that participants with multiple seed article annotations have not detectably biased the overall result. This is likely due to the majority of all unique participants submitting annotations for a single seed article only. Another interesting observation is that PMRA produced the most candidate articles marked as relevant (indicated by average precision @ 20), but seemed to have difficulty ranking these extra positive samples towards the top of the result list.

Consistency between all and non-redundant sets
Statistical P-values from Wilcoxon signed-rank tests are given in Table 3. In both the 'ALL2233' and 'NR1220' sets, TF-IDF was significantly higher than PMRA with Pvalues less than 0.05 for MCC, AUC, MRR and P@5. For the 'ALL2233' set, BM25 was also significantly higher than PMRA in terms of MCC and AUC; however, in the 'NR1220' only MCC is significant. In all cases there was no significant difference in either MCC or AUC between BM25 and TF-IDF. However TF-IDF did significantly outperform BM25 in terms of P@5. These results suggest that out of the three baseline methods, TF-IDF was most effective at ranking relevant candidate articles highly.

Consistency between author and non-author annotators
As previously mentioned, according to author name matches, the majority (91%) of our annotated seed articles were provided by one of the respective authors. Here we investigate whether method performance is affected considering annotations given by non-authors. These nonauthor annotators may have more general knowledge and not be biased by a close-up view of their given field like an author could be, potentially providing a more objective assessment.
A total of 181 seed articles were identified to have been annotated by a non-author (anonymous annotations were not included), submitted by 73 unique annotators. These articles were split into respective 'ALL' and 'NR' subsets accordingly. Table 4 shows method performances, which we are comparing to results presented in Table 2. A differing trend in method preference can be observed within the 'ALL' set; however, measured performance and deviation of performance between methods is comparable. Within the 'NR' set, both method performance and preference is almost identical, although according to the precision measures the methods seem to have fared slightly better. Overall, these results suggest that there is no significant difference between annotations contributed by annotators who authored the seed article versus external annotators who did not author the seed article.
Consistency among different experience levels Table 5 shows method performance on the 'NR1220' set broken down by annotator experience level. Similarly to the overall result discussed in the previous section, all of  The bold values indicate maximal metric value between the three baseline methods.    difference was observed between AUC values of the three methods among different levels of experience. This is echoed by P-values of the distribution comparison using Wilcoxon signed-rank tests presented in Table 6; unlike the overall result, there are not many statistically significant differences in performance between methods within the experience groups.
Finally, although the overall average was nearly identical among annotators with different experience levels, fluctuations around the average (standard deviations around 0.2) were quite substantial. This large fluctuation, however, may not have been caused by individuals' subjective opinions. This is explored further with the assessment of single participants who annotated many seed articles in the next section.

Consistency within individual annotators
As some dedicated contributors had annotated a large number of seed articles, we were able to assess method performance within these groups. Table 7 presents the results for the three largest groups, 'A1', 'A2' and 'A3' that had 92, 55 and 49 seed articles, respectively. In fact, these groups appeared to also exhibit a similar level of fluctuation as observed in the results for both the overall and divided by experience sets. Thus, it is hypothesized that the observed fluctuation could have been a result of a small number of candidate article samples (20) per seed article per method for calculating the various measures or it could be attributed to true differences in the ability to rank candidate articles, i.e. that the three methods rank candidate articles better for some seed articles than others.
Additionally, as method performance is again quite close overall, some noise appeared to exist regarding the preferred method within these individual annotator groups. It could be argued that 'A1' prefers BM25, 'A2' prefers TF-IDF and 'A3' prefers PMRA. Furthermore, precision-based metrics for 'A1' demonstrated fluctuation of 'difficulty', compared to the 'A2' and 'A3' groups, with regard to   the number of 'positive' annotations per seed article. For example, a P@20 of around 0.1 means that on average each method was trying to rank 2 relevant articles higher than 18 irrelevant articles that must have also had large content-based similarity of some form allowing their initial detection by the baseline methods. In contrast, 'A2' and 'A3' were relatively easier, especially considering the MRR value close to 1, meaning that on average the top result was actually relevant.

Consistency among different research fields
Here, we assessed performance consistency across different research topics, defined by MeSH descriptors associated with each seed article (if any). Articles that shared a MeSH descriptor were clustered accordingly. The eight largest clusters were evaluated, Figure 6 shows method AUC performance within these clusters (lines and error bars of standard deviation corresponding to right yaxis), and the number of seed articles within each cluster

Consistency of annotation for articles labelled by multiple annotators
A more direct assessment of consistency is to examine the same document pairs annotated by different contributors, using the 'D154' set. These contributions covered a total of 74 unique articles. However, 32 articles were removed due to duplicate submitted annotations from a single participant (i.e. the same participant contributed annotations more than once for the same seed article). Subsequently excluded were two articles that were annotated by three participants. In total, 40 unique articles remained in this consistency set (80 of 154 contributions; each article here was annotated by exactly two unique participants). We measured the agreement by first collapsing ternary class annotations into their binary class equivalents. Next the Jaccard index (48) was calculated, defined as the intersection size of the same positive annotations plus the inter- section size of the same negative annotations divided by the union size of all annotations. The Jaccard index distribution can be seen in Figure 7, where the overall trend was found to indicate strong consistency, having only 3 out of the 40 cases below 50% agreement. The average annotation consistency was 75%. In one of the low-agreement cases, we contacted the original contributors and found that one annotator defined relevance by 'AND' (satisfaction of all topic keywords), whereas the other by 'OR' (as long as it contained one essential topic keyword). Nevertheless, there was high consistency observed among the majority of dualannotated document pairs.

Consistency regarding duration of time spent for annotations
The duration of time a participant spent annotating seed articles was an additional aspect we tested for consistency. This duration was estimated from the time of loading the first page of results to the time of submission. As this time was estimated and is likely to contain a small degree of error due to noise in the logs, we restricted the duration of time spent to between 2 and 120 min. Figure 8 demonstrates method performance as a function of time spent on annotations for the 'NR1220' set. Bin sizes (bars corresponding to left y-axis) and method performances (lines and error bars of standard deviation with square, circle and diamond markers for PMRA, BM25 and TF-IDF, respectively corresponding to right y-axis) are indicated. As the majority of annotations were submitted within 17 min (75%), to equalize bin size as much as possi- ble, bin intervals had to be varied across the x-axis (they are not uniformly sized). There was no obvious association observed between time spent on annotations and the resultant method performance, indicating that there was no detectable systematic bias underlying time spent for annotations.

Consistency among thresholds for binary classification
Distinguishing positives from negatives requires a threshold for the similarity score calculated by PMRA, BM25 or TF-IDF. An ideal method should have the same threshold for different seed articles regardless of research field and abstract/title length. This would be beneficial to real-world recommender systems by enabling a dynamic number of articles deemed as relevant to be presented to the user rather than the traditional approach of delivering a fixed number of recommendations.
We analyzed the distribution of method score thresholds in Figure 9. Raw method scores were first normalized using the min-max method and then divided into 50 equally sized bins. All of the methods' threshold distributions were found to be relatively centralized around their respective peaks; however, PMRA clearly compared well to other methods and had the most stable cut-off values with a standard deviation of 0.06, compared to 0.08 and 0.09 for BM25 and TF-IDF, respectively. This finding suggests the possibility of setting a unified relevance cut-off.  relevant recommendations (avg. 8%). Meanwhile, the overlap among all methods was low with the highest overlap between BM25 and TF-IDF at 32% on average. Thus, clearly none of these baseline methods could provide complete coverage of all relevant articles. Relatively few recommendations were shared by the three methods, indicating that is likely possible to establish a well-performing hybrid method that would aggregate outputs of multiple methods.

Data Records
A complete dump of collected annotation data as at 27 July 2018 was deposited to a figshare repository. The dataset is released without copyright, available at https:// figshare.com/projects/RELISH-DB/60095, under the CC0 license. All data records were stripped of personally identifiable information and then converted to JSON format (41). Record fields include a unique identifier ('uid'), Pub-MedID of the seed article ('pmi'), annotator experience level ('experience'), whether the annotator was an anonymous or registered user ('is_anonymous') and annotator response ('response') containing the lists of candidate article PubMedIDs corresponding to the assigned degree of relevance (i.e. one of 'relevant', 'partial' (somewhat-relevant) or 'irrelevant').

Usage Notes
To make RELISH collections useful for future method developers, we have established a data server at https:// relishdb.ict.griffith.edu.au. This data server has three modules for data annotation, data retrieval and method testing.
Data annotation: data annotation functions exactly as the APSE server. As shown in Figure 1, users can search for publications related to any article of interest, with the option to voluntarily annotate article relevance. New submissions will be automatically saved for inclusion in the next database version. Data retrieval: we have made available several pre-built datasets (the evaluation sets within this work) on our data server. These include the complete versions of the 'ALL2233' and 'NR1220' sets, as well as copies of these sets broken down by annotator experience level. In addition, the three single-annotator sets ('A1', 'A2' and 'A3') are also available. Furthermore, we also provided an option to allow users to generate a dataset according to various pre-defined parameters (dataset size, experience level, single seed article or multiple seed article annotators and cut-offs for the number of candidate articles labelled as relevant or irrelevant per seed article). As shown in Figure 11, pre-compiled datasets can be inspected by starting at (1). Alternatively, user-defined datasets may be generated by starting at (2). After generating a new dataset, or clicking on an existing one, the dataset view page will be displayed as shown in stage (3). Here, dataset details including size, number of positive and negative pairs, and any custom parameters that were set will be given. For each dataset, we make available the respective subsets of article metadata, annotation data and an example result file in the format the automatic evaluation will accept, as shown in stage (4). The metadata includes the PubMedID, title and abstract for each article involved in the dataset. Because we are using publicly licensed PubMed metadata, we allow for direct download of this metadata in JSON format (41). Relevance data are provided in TREC relevance format (56). Method developers can divide downloaded datasets into training and independent test sets according to their own specific needs. Method testing: to facilitate comparison and avoid overtraining, we set aside a blind-test set for critical assessment of method performance. This selection of 400 randomly selected NR seed articles have their annotation data withheld from the rest of the database. The dataset has two variants ('BT1' where partially relevant articles are considered as 'positive' samples, whereas in 'BT2' partially relevant articles are considered as 'negative' samples). Baseline method performances for this set are shown in Table 8. Overall, the results are consistent with what was observed for the complete dataset. We envision that method developers can download pre-built or user-defined datasets for training and independent tests. Subsequently, they can download 'BT1' or 'BT2' to provide additional tests by submitting the result file in TREC result format (61) as shown in stage (5) of Figure 11. Evaluations will be automatically executed and a list of recorded evaluations for the dataset will be shown in stage (6). After uploading a result file for a new evaluation, or clicking on an existing evaluation in the list, the evaluation results will be displayed: firstly, a summary of metrics over the dataset as a whole is given by stage (7) and secondly, the metrics are broken down into a per seed article query basis for inspection in stage (8).

Motivation and Limitations
Many users, both expert and non-expert, require literature searching for various purposes. Generally, the information need revolves around knowledge expansion of a topic; whether the goal is to find pertinent supporting material to solidify insight or to broaden their vision by locating different aspects, of a particular claim, idea or field of research.
Users most commonly rely on manual keyword searching methods to serve their needs (19,21,31,37,70). While it is recognized that this type of searching tools are indispensable (19); a major concern, one that the use case of our work attempts to address, is how users can be easily dissatisfied by the results given by this procedure. For example, produced results largely depend on, and may substantially vary, based on how well the user is able to 'fine-tune' their queries delivered to the searching system (31). Moreover, the conversion process between an information need to an effective query is often a potential source of difficulty (19,37). While search engines often offer 'advanced' operators for more elaborate queries, the Figure 11. An overview of the database retrieval and evaluation process: (1) use existing datasets-both pre-built and user-generated; (2) create custom datasets-allows for user-defined dataset construction by tuning dataset generation parameters; (3) dataset details-shows the size, number of positive and negative pairs and any custom parameters used to generate; (4) dataset article data-allows user download of (a) raw article metadata (id, title and abstract) (b) annotation data and (c) sample result file for evaluation function; (5) result file upload-allows input of result files for automatic performance assessment; (6) uploaded evaluations-results of uploaded result file evaluations for the respective dataset will be presented here; (7) overall evaluation view-provides a performance summary of the dataset as a whole; (8) detailed evaluation view-performance details are broken down on a per-query basis. The bold values indicate maximal metric value between the three baseline methods.
user unfriendliness of these syntaxes diminish their use to a negligible rate (69). Ultimately, users' needs are met in a 'hit-or-miss' basis (2,36,69) within this approach.
Here, the use case of our work falls into the 'recommendation' sub-field of literature search. Specifically, we focus on pairwise similarity detection between documents.
Considering the rate of literature growth and the associated 'information overload' for users, remedial approaches like recommender systems have been proposed to tackle the difficulty and time burden of keeping track of the most promising and relevant studies (12,30). While at the present time recommender systems may be considered an uncommon use case for biomedical literature search systems, such as PubMed, there has been interest shown in the development of additional biomedical searching tools (53), including systems that allow identification of similar publications (18,20,26,60,76). Moreover, the frequency of their use depends strongly on the accuracy of recommendations. It is the developers of these types of recommendation systems we anticipate to benefit the most from our database.
This work is not the first to propose the use of a seed article (manuscript) as a query instead of keywords (10,27,37,46,55,71). As stated in (31), the following scenarios illustrate potential applications of a system built upon this ideology: editors or reviewers wishing to explore a subject matter of which is not their specialty (e.g. to find other potential peer reviewers), researchers simply wishing to find related studies to a paper of interest (e.g. for the purpose of locating citations or venues for their work), researchers who are pursuing a new area of which they have little existing knowledge (e.g. a student following up on a paper given by their supervisor) and researchers wishing to keep up with latest developments in their field based on their previous publications.
Finally, we must stress that the basis of our use case here is a complementary alternative to keyword search, it is not intended as a replacement. An accurate keyword-less recommendation system is sure to benefit both expert and nonexpert users, especially in this era of information overload. In addition, this is more than a database exclusively for seed article-based search. It can be used to train keywords or sentence-based search by examining if keywords or the title from one article would automatically lead to the other articles. That is, a method can be trained for title/sentencebased and title/abstract based search on this benchmark.

Discussion
Our work represents a community-wide effort to establish a database of document relevance that is suitable for machine learning. It should be noted that our APSE system facilitating the annotation process was not intended as a standalone search engine. Rather, it was developed and employed exclusively for data collection. Nevertheless, using three different methods appeared to provide a more complete search compared to systems powered by a single technique such as TF-IDF in JANE (67) or PMRA in PubMed (49). This is consistent with comments from many users who found interesting articles previously unknown to them.
More than 1500 scientists around the world participated in the RELISH consortium and annotated the relevance for over 180,000 document pairs that represent diverse research fields, covering 76% of all PubMed MeSH descriptors without clear bias (Figure 3). These annotations are of high quality as more than 90% of the article pairs were annotated by original authors and 63% by scientists with 5 or more years research experience post-PhD. While the majority annotated a single seed article, a few dedicated researchers annotated more than 50 seed articles, and some independently annotated the same seed articles. Together, the resulting dataset provides the largest manually annotated benchmark for detecting document similarity in biomedical literature search.
A number of potential systematic biases were considered during the construction of this database. Potentially affected aspects of our methodology primarily include how the documents presented to annotators were selected and how annotators themselves were selected. We discuss here the biases we were able to identify and explain the rationale behind how they were dealt with.
Inherent selection biases could be argued regarding the baseline methods used to generate the sets of documents for annotation. Firstly, since only the top 20 non-redundant documents from each method were presented to annotators, we have necessarily excluded potentially related articles with lower similarity scores from this pool. We could have taken more results from a single method, but then bias towards that single method would have been introduced. Additionally, we chose to take this number of results from each method to reduce the burden on annotator's participation time, as we appreciate that volunteered time to contribute is valuable. Secondly, as the baseline methods rely exclusively on exact term matches to generate respective similarity scores, it is possible that semantically related documents are missing from this pool. Although there was no scalable solution present to circumvent this, we feel it should be mentioned here.
Potentially subjective bias is a major concern affecting document relevance manually annotated by such a diverse group of scientists. This could be in terms of a biased consideration of recommended documents according to the annotator's close-up view of their given field as opposed to the more objective consideration annotators with general knowledge could have. We opted to use authors of their own papers as relevance judges to increase participation rate, the effort required from participants was again considered here. This is because our goal for this benchmark was to capture sub-field similarity such that articles marked as relevant could have been suitable for citation by the originating seed article. This task requires highly specialized experts in respective sub-fields to make that determination. In other words, the more sub-field experience of an annotator, the more accurate annotations will be made. Although non-PhD researchers are underrepresented in their participation of database construction, it will not affect the overall accuracy of annotations as we have demonstrated that there is no bias according to the experience level of annotators and across different fields of researchers. In fact, using expert annotations will ensure the production of relevant articles for all non-PhD researchers, clinicians and the general public in their searches.
If a subjective bias exists, one would expect that the performance of the three baseline methods (PMRA, BM25 and TF-IDF) would have certain systematic trends. We showed that method performances were nearly the same across different annotator experience levels ( Figure 5), different research topics ( Figure 6) and different durations of time spent annotating ( Figure 8) and whether annotators evaluated articles they authored or not (Tables 2 and 4). Although the magnitude of the performance fluctuation was large between different annotators, the same magnitude of fluctuation was observed within individual annotators ( Figure 5), suggesting that the source of fluctuation likely was small size of samples (20 per seed article for each method) and/or true differences in difficulty (i.e. some seed articles are easier to find relevant articles for than others). More importantly, examining same document-document pairs annotated by different researchers indicated a high average consistency of 75% (Figure 7). Furthermore, high consistency in the dataset was also suggested by the relatively narrow distribution of thresholds that were used to maximize the MCC for each seed article (Figure 9). This indicated that relevance can be defined across different experience levels or research fields. The results collectively indicate that despite the presence of any subjective bias, a consensus exists regarding what is actually relevant literature for the majority of researchers, despite differences in research areas and experience levels.
High-quality data collected by the RELISH consortium offered an unprecedented opportunity to compare the three baseline methods (PMRA, BM25 and TF-IDF). Somewhat surprisingly, all three methods performed similarly, with TF-IDF presenting a slight edge. Our finding is somewhat different from previous studies (22,51,68) where the classic TF-IDF was often considered inferior to methods developed afterwards. This result highlights the importance of having a large benchmark dataset. Moreover, all three methods evaluated here performed only moderately well with an average AUC of around 0.7 and average MCC of around 0.5. In addition, different methods did capture different sets of relevant articles ( Figure 10). This further indicates the necessity of large benchmark evaluation datasets, significant room exists for improving literature search with future method development.
Encouraging method development was the principal motivation behind the assembly of this database. A recent survey of research paper recommendation systems revealed the urgent need for a common evaluation framework (4), as current method evaluation results are rarely reproducible (3). This work was inspired by large human-annotated databases like ImageNet (17), where labelled images have promoted method development and helped to revolutionize object detection and image classification (45). Our database should allow novel methods to be developed based on modern deep machine learning techniques to address the major deficiencies of current methods. Additionally, the benchmark datasets as a result of our database will enable objective comparison between old and new methods, increasing the uptake of newer methods and their translation into practice. To assist unbiased evaluation, we have made the RELISH database freely accessible to the academic community (released under the CC0 license). Furthermore, we set aside 400 × 60 document-document pairs for blind prediction using the automatic evaluation by our database server at https://relishdb.ict.griffith.edu.au.
We expect that dissemination of this manuscript will attract more scientists to contribute more annotations. A major struggle for us was participant recruitment, the first phase of our recruitment strategy (internal referrals, personal invitations, social media posts and a correspondence letter) ultimately failed to attract enough contributions to establish a meaningful dataset. Because of this failure we reluctantly decided to proceed with the second phase of sending direct emails. Although the response rate was low (around 1%), it was good enough to achieve the sizeable database we have now. Our server will continue to collect annotations. We hope to publish the next version of the database with at least double the number of present annotations.

Conflict of interest
None declared.

Author contributions
P.B. participated in the design, carried out the study, implemented the websites and search systems and wrote the manuscript. RELISH consortium annotated the articles. Y.Z. conceived the study, participated in the initial design, assisted in analyzing data and wrote the manuscript. All authors read, contributed to the discussion and approved the manuscript.

RELISH consortium members
Stephen J. Bush 68