Searching for rare diseases in PubMed: a blind comparison of Orphanet expert query and query based on terminological knowledge

Background Despite international initiatives like Orphanet, it remains difficult to find up-to-date information about rare diseases. The aim of this study is to propose an exhaustive set of queries for PubMed based on terminological knowledge and to evaluate it versus the queries based on expertise provided by the most frequently used resource in Europe: Orphanet. Methods Four rare disease terminologies (MeSH, OMIM, HPO and HRDO) were manually mapped to each other permitting the automatic creation of expended terminological queries for rare diseases. For 30 rare diseases, 30 citations retrieved by Orphanet expert query and/or query based on terminological knowledge were assessed for relevance by two independent reviewers unaware of the query’s origin. An adjudication procedure was used to resolve any discrepancy. Precision, relative recall and F-measure were all computed. Results For each Orphanet rare disease (n = 8982), there was a corresponding terminological query, in contrast with only 2284 queries provided by Orphanet. Only 553 citations were evaluated due to queries with 0 or only a few hits. There were no significant differences between the Orpha query and terminological query in terms of precision, respectively 0.61 vs 0.52 (p = 0.13). Nevertheless, terminological queries retrieved more citations more often than Orpha queries (0.57 vs. 0.33; p = 0.01). Interestingly, Orpha queries seemed to retrieve older citations than terminological queries (p < 0.0001). Conclusion The terminological queries proposed in this study are now currently available for all rare diseases. They may be a useful tool for both precision or recall oriented literature search. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0333-0) contains supplementary material, which is available to authorized users.


Background
There is currently no consensual definition of what is a rare disease: in Europe, a disease is considered rare if it affects less than 1 in 2000 citizens, while in United States of America (USA), the threshold was set at 200,000 in the entire population [1] (approximately 1 in 1600 according to the USA census bureau [2]).
These gross definitions lead to a major heterogeneity between rare diseases: Most of genetic diseases are rare diseases, but some infectious diseases, cancer and auto-immune diseases are also rare. They may occur at any point in life There are geographical variations. A disease may be rare in one country (like Periodic disease in France) but quite frequent in another (Periodic disease in Armenia) Some are well known and have been described for a number of years, whereas some have been recently discovered and information is scarce.
Furthermore, these definitions have led to the knowledge of 5000 to 8000 rare diseases and to the "paradox of rarity": each disease is rare, but patients with rare diseases are numerous. Having a clear vision of the prevalence of rare diseases is not an easy task, nevertheless, it is commonly accepted that approximately 5 to 10 % of the population suffer from rare diseases (8)(9) % in the USA [1], 6-8 % in the European Union [3]). In both regions, this corresponds to approximately 30,000,000 patients suffering from a rare disease, making it a real public health concern [4].
This heterogeneity and frequency of rare diseases translates into numerous different situations in which some information is needed: Finding a physician with adequate experience may be easy when a reference center exists, but can be a real difficulty if care pathways are not identified [5] Providing medical care for patients with a rare disease is a difficult task for physicians. Even if the care episode does not concern the rare disease. Writing a systematic review about a rare diseases, or doing a short review in order to write a research article, requires querying one or more bibliographic databases with as much relevant keywords as possible [6].
It seems of public health importance to provide all these participants with the appropriate tools to easily retrieve relevant information about rare diseases.
PubMed is one of the most popular search engines to access medical literature. It browses the MEDLINE bibliographic database, which gathers a large part of biomedical scientific articles, and some other minor resources [7]. MEDLINE is indexed using the MeSH® thesaurus. Although PubMed theoretically allows to access the literature about rare diseases, including the most recent scientific discoveries, the combination of the following elements may hinder users: the relative novelty of MeSH terms for rare diseases [8]. Until 2010, the MeSH contained only a few rare diseases, also, citations pertaining to rare diseases published before 2010 are not indexed precisely for rare diseases. Since this date, 10,354 rare diseases, as defined by the Office of Rare Diseases Research (ORDR) [9], have been introduced in MeSH (source MeSH 2014), the delay in article MeSH-indexing in PubMed [10], which can be several weeks to several months, according to the importance of the journal, and the health professionals, or the lay-persons, lack of knowledge about MeSH [11].
It is therefore difficult for physicians, and furthermore patients, to query Pubmed in an effective way, and especially to find an article about rare diseases published before 2010 or in recent months.
Several institutions (Genetic and Rare Diseases Information Center [12] and Orphanet [13]) already gather information on their website about rare diseases including a brief summary, clinical information and many links to other resources. Sometimes a link to a PubMed expert based query is provided, limiting users task to citation relevance assessment. Nevertheless, in the case of Orphanet these queries do not always take advantage of all the MeSH/PubMed functionalities and they are far from providing a comprehensive coverage of all rare diseases. Moreover, the methodology of establishing these queries is not disclosed on the Orphanet website. The aim of this study was to propose a set of queries linked to each rare disease term in Orphanet and to evaluate these queries against those developed by Orphanet.

PubMed overview
PubMed is the most frequently used bibliographic database used by biomedical scientist throughout the world. It therefore constitute a standard in terms of information retrieval. MEDLINE is the major component of PubMed, gathering almost 90 % of the 26 millions of PubMed citations. MEDLINE curators affect to each citation a list of MeSH terms to describe it with a controlled level of granularity. The MeSH atomic part is the MeSH concept, a class of synonymous termsi.e. all terms gathered in a MeSH concept are true synonyms. MeSH concepts closely related to each other in meaning may be gathered in a MeSH descriptor (MeSH D) or a MeSH supplementary concept (MeSH SC), one of them being the prefered concept, and the other being narrower, broader or related to the preferred one. Both MeSH D and MeSH SC aims at indexing the citation, but they exhibit some differences. First, MeSH SC are quite specific terms: they are used to index chemicals, drugs, and other concepts such as rare diseases. Second, MeSH SC, unlike MeSH D, are not classed, they are only linked to one or more MeSH D, usually broader, by a specific relationship. Lastly, there are a lot more MeSH SC (≈200,000) than MeSH D (≈27,000).
Pubmed users may specify what search field they want to use in their query using between-bracket operators. Table 1 presents some operators and their meaning.

PubMed queries Orpha queries
Orphanet PubMed queries were manually created by Orphanet experts. These queries are available on the Orphanet web site (URL: www.orpha.net), on each disease page (for the diseases that have an Orphanet PubMed query, of course). For exemple, for the Orphanet concept "retroperitoneal fibrosis", the PubMed query is: retroperitoneal fibrosis[majr] OR Retroperitoneal fibrosis [ti]. For the orphanet concept "Blount disease", the query is: Blount disease[tw] OR tibia vara [tw].

Terminological queries
In addition to the MeSH thesaurus, several other terminologies and ontologies are available on rare diseases: (a) a formal ontology named HRDO (Human Rare Disease Ontology) [14] was developed based on the Orphanet classification. This ontology is available in five European languages: English, French, German, Spanish and Portuguese; (b) the Online Mendelian Inheritance in Man (OMIM) database, developed at Johns Hopkins University [15]; (c) the Human Phenotype Ontology (HPO), a formal ontology, which allows the description in an unambiguous fashion of phenotypic information in medical publications and databases [16]. The HPO is freely available at http://www.human-phenotype-ontology.org.
One of the authors (SJD) has created exact match mappings between MeSH, OMIM, HPO and HRDO based on a natural language processing/conceptual based algorithm [17,18] suggestions. Exact match mapping means that the two concepts are real synonyms (e.g. the "Absent corpus callosum cataract immunodeficiency" MeSH concept and the "Vici syndrome" HRDO disease). Using these alignments, PubMed queries are created automatically, according to a published algorithm [19]. The algorithm output depends on the type of MeSH term mapped to: MeSH concept, MeSH SC or MeSH D (see Table 2

Relevance evaluation
Thirty rare diseases were randomly selected from the subset with both an Orphanet query and a terminological query. The selected rare diseases are listed in Table 4 (at the end of the document). The diseases with a prevalence higher than 1/2000 were considered as not rare. One author (GK) gathered the first ten citations retrieved (PubMed "recently added" ranking), for each rare disease, using the following queries: With Q Orpha the Orpha query and Q Term the terminological query. Therefore, Q 1 retrieved citations common to both Orpha and terminological query, Q 2 retrieved citations specific to the Orpha query and Q 3 retrieved citations specific to the terminological query. He (GK) then hid the retrieving query: the evaluators were blinded vs. the type of query. The anonymised citations were split between four physicians (FD, LR, MS and NG) in such way that: (i) each citation was evaluated twice and, (ii) each evaluator shared each third of their evaluations with one different evaluator.
Evaluators had to answer the following question for each citation: "Does the article directly concern the disease?" In case of any disagreement, a third evaluator evaluated the citation and the discrepancy was resolved by consensus.
More information regarding relevance evaluation is available in Additional file 1.

Statistical analysis
Agreement between evaluators was measured by kappa. HRDO rare diseases may be split into two: terms with an Orpha query and terms without Orpha query. These two sub-populations were compared according to available determinants to ensure generalizability.
For each rare disease, it is possible to estimate the precision (p i ) of each query (Q 1 , Q 2 , Q 3 ; see Eq. 4).
With n(rel Qi ) and n(eval Qi ) the number of relevant citation and the number of evaluated citation for the query i, respectively. Orpha queries and terminolgical queries were compared according to micro average precision, number and publication date of retrieved citations, and use of MeSH terms. Non-parametric tests were used: Fisher's test for qualitative variables (micro average precision and MeSH use) and Wilcoxon test and Kruskal-Wallis test for quantitative variables (number of citation and date). The Dunn test allows pairwise comparison after Kruskal-Wallis.

Results
HRDO, in its 09/11/2013 version, inventory 9060 diseases and groups of diseases. Seventy-eight were not considered as rare diseases because the prevalence, as specified by Orphanet, was above the European threshold, also, the study considered only the 8982 rare diseases. Table 3 lists the number of alignments created or validated by SJD.
Only 2284 HRDO rare diseases have a manually validated Orphanet query (25.4 %). A terminological query is generated for each disease in Orphanet (was it rare or not). Orpha queries and terminological queries respectively retrieved 0 citations in 5 (<1 %) and 4370 (48.7 %) cases (see Fig. 1). Considering both "no query" and "0 citations" situations, there is a useful terminological or Orpha queries for 51.3 or 25.4 % of HRDO rare diseases, respectively.
The 30 selected rare diseases and the number of citations retrieved by each query are listed in Table 4. Terminological queries retrieved more citations more often than Orpha queries (17 terminological queries For example, SJD has created 1247 synonymy mappings between an HRDO concept and a MeSH descriptor retrieved more results than Orpha queries while only 10 orpha queries retrieved more results than terminological queries; p = 0.01; Wilcoxon test). As some queries retrieved less than 10 citations (see Table 4), only 553 PubMed citations were assessed for relevance (instead of 30 × 3 × 10 = 900). Kappa indexes before the adjudication procedure range from moderate (0.41) to almost perfect (0.86) agreement. Overall kappa was 0.68 (substantial agreement). The precision of each query, computed after adjudication process, are listed in Table 4. The intersection query (Q 1 ) is significantly more precise than Q 2 (p = 0.01; Fisher's test) and Q 3 (p < 0.001; Fisher's test). However, there was no significant difference between Q 2 and Q 3 precision (0.61 vs. 0.52, respectively; p = 0.13; Fisher test). For the 30 selected diseases, there was significantly more terminological query that fully used the MeSH thesaurus (28 vs. 8; p < 0.001; Fisher's test; data not shown).
When considering relevant citations alone, it is noteworthy that citations retrieved by Q 2 (i.e. only by orpha queries) are significantly older than those retrieved by Q 1 and Q 3 (p < 0.0001 in both cases, Dunn's test with Bonferroni correction). Median publication dates are 2013, 2005 and 2014 for Q 1 , Q 2 and Q 3 , respectively. The results are very similar when considering all the citations evaluated, whether they were relevant or not.

Discussion
There is no differences between Orpha and terminological queries in terms of precision. However, Orpha queries retrieved significantly fewer results. Moreover, citations retrieved only by Orpha queries are significantly older than citations retrieved by terminological queries, and, Orphanet provides queries for only 25.4 % of rare diseases while terminological queries retrieved at least one PubMed citation for 51.3 % rare diseases. This suggests a differentiated approach according to the user objectives: a precision-interested user should use the intersection query, which will retrieve the most relevant citations, a recall-interested user might be interested in the union query.
Nevertheless, for almost 75 % of HRDO rare disease, there is no other solution but the terminological query.
Physicians are probably more interested in precision than in recall. A researcher, in contrast, may be more interested in recall for their literature review. However, in many cases, only the terminological query is available leaving the user no choice. A potentially interesting use of the set of terminological queries is its use to find medical experts about rare diseases [5], where noise is a less important problem. The set of terminological queries is available from the Health Terminology Ontology Portal [20] (URL: http://www.hetop.eu).
Two mechanisms may explain the more up to date set of results retrieved by terminological query: (i) the major part of the difference is a consequence of the evaluation method. As terminological queries retrieve more results than orphanet queries, we can hypothesized that there is both more recent and more older citations. However, PubMed ranks recent results first and we only evaluated the first ten resultsi.e. the more recent. (ii) Some keywords added by the terminology expansion are quite recent, and not yet taken into account by orphanet expert in their queries.
While these hypotheses limit the value of the up-todate effect of terminological query, it raises a maintenance issue. Creating and maintaining a query is very time-consuming and it is probably one of the main limitation of Orphanet query. For terminological queries, the maintenance may only be necessary when terminologies evolve. For example, Vasilevsky et al. [21] recently enhanced HPO with terms that patients, doctors, and machines can all understand. This evolution will require a limited validation maintenance for terminological queries. However, the convergence of these terminologies (with the Orphanet Rare Diseases Ontology [22]) may ultimately importantly reduce the maintenance tasks.

Strengths and limitations
Only two sets of queries were compared in this study: one from the Orphanet [13] and one based on terminologies [20]. The queries from the Genetic and Rare Diseases Information Center [12] were not tested for this study because of their limited design (they often only rely on OMIM record references [23], which are not updated on a regular basis). In fact, these queries cannot retrieve any citation that has not been considered by the OMIM authors. The added-value against the OMIM record references is therefore very limited. Also, only two set of queries may be considered as gold standard for a terminology queriescomparison: Orpha query and free text queries, which most users are likely to submit. They both present pros and cons.
Using free text query sounds like a very pragmatic approach, close to the reality. Results would be easy to interpret. Nevertheless, it is difficult to establish due to the impossibility of formalizing such a gold standard.
The label choice has a major influence over the query result: if a label from the MeSH is used, PubMed will automatically recognize the MeSH term and perform a semantic expansion, otherwise, the query may be tokenized and each term searched for separately, which would introduce a lot of noise. Using Orpha query may seem questionnable: only 25.4 % of the rare disesases are provided with a query and query production processus is unclear. However, these queries are somewhat validated by Orphanet expert and they are available online.
For these reasons, the use of Orpha query as a gold standard seemed to be preferable. The question the evaluators have to answer for each citation is quite generic and it might not be adapted to the real users context. One difficulty is to reach an acceptable inter evaluator agreement, the only way to assess the quality of the relevance assessment. A more specific question was tested: "Is the citation useful for medical care?" but agreement was very low.
The main limitation of this study is probably the quality assurance of terminology mapping: relying on one expertise is not sufficient for sensitive data, and while the help of an automatic algorithm may limit the false positive rate at the same time it also increases the false negative rate. Also, proper quality assurance might probably have slightly enhance terminological query performance. Nevertheless, the results presented in our study, with no difference in precision, demonstrate that a sufficient high mapping quality was achieved.
This study demonstrates some strengths. First, the evaluation of citations by two independant physicians unaware of the query and the adjudication procedure render the judgement as reliable and unbiased as possible. Second, the results are theoretically generalizable because of the random selection of the diseases, which led to a similar distribution of disease prevalence in the studied corpus compared to the entire HRDO.
The main strength of the terminological approach presented here is the availability of a query for each rare disease in each terminology. The cost of this approachmaintenance of mappingseems very limited. Queries take advantage of the rich synonymy of classifications (HPO, HRDO, OMIM, MeSH), and, when there is an alignment to MeSH, of MeSH indexing. The semantic expansion used here could be enhanced using UMLS, nevertheless, this resource has already been shown to be too noisy [24].

Query structure -MeSH
Orpha queries and terminological queries are structurally different. Terminological queries are based on the automatic exploitation of terminological knowledge, therefore the queries are structurally simple, i.e. all the keywords are linked by a "OR" in the query. Orpha query, as manually designed, may be more complex, implying all the boolean operators (AND, OR and NOT). Even if an exact match MeSH term does not exist it is possible to use a combination of MeSH terms relevant to the disease. Overall, as previously mentioned above, the creation and maintenance of Orpha queries is a much more time consuming task.
MeSH use is also problematic because of the novelty of rare disease MeSH terms [8]. Therefore, decades of citations about rare diseases are only indexed using free text and MeSH term recall is necessarily low. Nevertheless, the indexing of citations with MeSH will gradually increase, enhancing the recall of queries based on MeSH terms, the mapping between Orphanet diseases and MeSH terms is therefore important to maintain.

Conclusions
There is a terminological query for each rare disease. This query precision was not statistically different from the precision found for Orpha queries. The terminological queries proposed in this study are a useful tool for both precision or recall oriented literature search in combination with the Orpha query, if available.

Additional file
Additional file 1: Contains all the evaluated citation, their metadata and their final evaluation. As authors are French, this file is in French, however, the entire work was performed in English. Column 1 contains the unique disease ID, column 2 the disease name, column 3 the query, column 4 the MeSH level, column 5 the citation PubMed ID, column 6 the final answer for relevance ("Does the article directly concern the disease?"), column 7 the year of publication, column 8 the journal title, column 9 a link toward the citation, column 10