A large-scale dataset of patient summaries for retrieval-based clinical decision support systems

Retrieval-based Clinical Decision Support (ReCDS) can aid clinical workflow by providing relevant literature and similar patients for a given patient. However, the development of ReCDS systems has been severely obstructed by the lack of diverse patient collections and publicly available large-scale patient-level annotation datasets. In this paper, we collect a novel dataset of patient summaries and relations called PMC-Patients to benchmark two ReCDS tasks: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR). Specifically, we extract patient summaries from PubMed Central articles using simple heuristics and utilize the PubMed citation graph to define patient-article relevance and patient-patient similarity. PMC-Patients contains 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations, which is the largest-scale resource for ReCDS and also one of the largest patient collections. Human evaluation and analysis show that PMC-Patients is a diverse dataset with high-quality annotations. We also implement and evaluate several ReCDS systems on the PMC-Patients benchmarks to show its challenges and conduct several case studies to show the clinical utility of PMC-Patients.


Introduction
Clinicians often rely on Evidence-Based Medicine (EBM) to combine clinical experience with high-quality scientific research to make decisions for patients [1].However, finding relevant research can be challenging since the number of scientific publications is growing exponentially, leaving many clinical questions unanswered [2].To address this issue, there has been increasing research interest in utilizing Natural Language Processing (NLP) and Information Retrieval (IR) techniques to retrieve relevant articles or similar patients for assisting patient management [3,4,5,6,7].In this article, we introduce the term "Retrieval-based Clinical Decision Support" (ReCDS) to describe these tasks.ReCDS can provide clinical assistance for a given patient by retrieving and analyzing relevant articles or similar patients to determine the most likely diagnosis and the most effective treatment plan.
ReCDS with relevant articles is grounded in EBM, where the target articles to retrieve are up-to-date clinical guidelines or high-quality evidence such as systematic reviews.Therefore, the majority of ReCDS studies have focused on retrieving relevant research articles [8,9,10], which are primarily facilitated by the Clinical Decision Support (CDS) Track [11,12,3] held annually from 2014 to 2016 at the Text REtrieval Conference (TREC).Each year, the TREC CDS Track releases 30 "medical case narratives", which serve as idealized representations of actual medical records, including patient information such as past medical histories and current symptoms.Participants are asked to return relevant PubMed Central (PMC) articles for each patient with regard to a given aspect (diagnosis, test, or treatment).Although suf-ficient article relevance can be annotated for each patient under the TREC pooling evaluation setting [13], the size and diversity of the test patient set in TREC CDS are very limited.Consequently, the generalizability of system performance to uncovered medical conditions may be constrained.
ReCDS with similar patients, on the other hand, is much under-explored.In brief, "similar patients with similar features have similar outcomes" [14].Retrieving the medical records of similar patients can provide valuable guidance, especially for patients with uncommon conditions such as rare diseases that lack clinical consensus.Nevertheless, there are various challenges in conducting this type of research.Unlike scientific articles, there is currently no publicly available collection of "reference patients" to retrieve from.Moreover, defining "patient similarity" is non-trial [14] and large-scale annotation is prohibitively expensive.As a result, there are only a few studies on similar patient retrieval [15,16], all of which use private datasets and annotations.
The aforementioned issues make it clear that a standardized benchmark for evaluating ReCDS systems is greatly needed.Ideally, such a benchmark should contain: (1) a diverse set of patient summaries, which serve as both the query patient set and the reference patient collection; (2) abundant annotations of the patient summaries with relevant articles and similar patients.Due to privacy concerns, only a few clinical note datasets from Electronic Health Records (EHRs) are publicly available.One notable large-scale public EHR dataset is MIMIC [17,18].However, it only contains ICU patients without any relational annotations, making it unsuitable for evaluating ReCDS systems.
In this article, we aim to benchmark the ReCDS task with PMC-Patients, a novel dataset collected from the case reports in PMC and the citation graph of PubMed.Case reports denote a class of medical publication that typically consists of: (1) a case summary that describes the patient's admission, treatment, progress, discharge, and follow-up situations; (2) a literature review that discusses similar cases and relevant articles, which are cited and recorded in the PubMed citation graph.To build PMC-Patients, we first extract 167k patient summaries from case reports published in PMC using simple heuristics.For these patient summaries, we then annotate 3.1M relevant articles and 293k similar patients using the PubMed citation graph.PMC-Patients is one of the largest patient summary collections, with the largest scale of relation annotations for benchmarking ReCDS.Besides, the patients in our dataset show a much higher level of diversity in terms of demographics and medical conditions than existing patient collections.Our manual evaluation shows that both patient summaries and relation annotations in PMC-Patients are of high quality.
Based on PMC-Patients, we formally define two ReCDS tasks: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR).We systematically evaluate the performance of various feature-based and learning-based ReCDS systems, and the experimental results show that both ReCDS-PAR and ReCDS-PPR are challenging tasks and we call for further improvements.We also present highly-relevant case studies to demonstrate the potential application and significance of our retrieval tasks in three typical clinical scenarios.
Figure 1 and Figure 2 show an overview of our dataset and ReCDS benchmark, respectively.In summary, the key contributions of this study are three-fold: 1. We introduce PMC-Patients2 , a first-of-its-kind dataset consisting of 167k patient summaries extracted from case reports.We systematically characterize PMC-Patients, and show that it is a diverse dataset with high-quality annotations.2. Based on PMC-Patients, we formally define two tasks and provide the largest-scale resources to benchmark Retrieval-based Clinical Decision Support (ReCDS) systems: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR).ReCDS-PAR contains 3.1M relevant patient-article pairs, and ReCDS-PPR contains 293k similar patient-patient pairs.3. We systematically evaluate various ReCDS systems on the PMC-Patients benchmark.We also conduct several case studies to demonstrate the clinical utility of PMC-Patients.

Material and methods
To collect the PMC-Patients dataset, we utilize the full-text literature resources in PubMed Central (PMC) 3 and the citation relationships in PubMed4 , which will be described in Section 2. and Patient-to-Patient Retrieval (ReCDS-PPR).We will present the tasks in Section 2.2, and introduce the baseline methods in Section 2.3.

PMC-Patients Dataset
We only use PMC articles with at least CC BY-NC-SA license (about 3.2M) to build the redistributable PMC-Patients dataset.The collection pipeline can be summarized in three steps (the implementation details and graphical illustration of the pipeline can be found in Appendix A): (a) We identify potential patient summaries in each article section using extraction triggers, which are a set of regular expressions searching ...

Patient-to-Article Retrieval
Similar Patient 1 (3341747-2): A 32-year-old female patient reported to our department with a swelling on the right lateral dorsum of the tongue, which she noticed a few months back.There was bleeding occasionally from the swelling.
Similar Patient 2 (5824501-1): A 6-year-old boy reported with painless palatal lesion which was gradually increasing in size.His mother gave a history that she noticed the lesion at the age of 4 months and due to static and asymptomatic ... Similar Patient 3: ...

PMC-Patients
(n = 155k) Patient-to-Patient Retrieval for specific patterns of patient summaries, such as "Case report" and "Patient representation" in the section title.(b) For sections identified in (a), we extract patient summary candidates using several extractors.Extractors operate at the paragraph level, so a candidate patient summary always consists of one or several complete paragraphs.Besides, we also extract the candidates' demographics (ages and genders) using regular expressions.(c) We apply various filters to each candidate patient summary extracted in (b) to exclude candidates that are too short, non-English, or without patient demographics.
For each extracted patient summary in PMC-Patients, we use the citation graph of PubMed5 to automatically annotate (1) relevant articles in PubMed and (2) similar patients in PMC-Patients.
Annotating relevant articles: We assume that if a PubMed article cites or is cited by a patient-containing article, the article is relevant to the patient.Formally, we denote a patient as p, and the article that contains p as a(p).We define any PubMed article a relevant to the patient p, denoted as Rel(p, a ) = 1, if: Annotating similar patients: We annotate similar patients based on relevant articles.For each patient in PMC-Patients, if its relevant articles contain other patients in the dataset, we will label them as similar patients.Formally, we define any two patients p x and p y in PMC-Patients similar, denoted as

PMC-Patients ReCDS Benchmarks
The PMC-Patients dataset contains 167k patient summaries, annotated with 3.1M relevant articles and 293k similar patients.Based on the dataset, we define two benchmarking tasks for ReCDS: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR).Both are modeled as information retrieval tasks where the input is a patient summary p ∈ P, where P denotes the PMC-Patients dataset.For ReCDS-PAR, the objective is to retrieve PubMed articles relevant to the input patient from the corpus A. Instead of using the entire 33.4M articles in PubMed, we restrict the retrieval corpus to contain only articles relevant to at least one patient.Formally, A = {a | ∃p ∈ P, Rel(p, a) = 1} and contains 1.4M articles, which is a more feasible setting.For ReCDS-PPR, the objective is to retrieve patients similar to the input patient from PMC-Patients.The benchmark statistics are shown in Table 1.
We split the train/dev/test on the article level.Specifically, we randomly select two subsets of articles (5k in each) from which PMC-Patients is extracted and include the corresponding patients in the dev and test dataset as query patients.Patient summaries extracted from other articles are included as the training query patients and also used as the retrieval corpus P.

Baseline models
We implement three types of baseline retrieval models for both ReCDS-PAR and ReCDS-PPR: sparse retriever, dense retriever, and nearest neighbor retriever.
Sparse retriever: We implement a BM25 retriever [19] with Elasticsearch 6 .The parameters of the BM25 algorithm are set as default values in Elasticsearch (b = 0.75, k 1 = 1.2).For ReCDS-PAR, we index the title and abstract of a PubMed article as separate fields and the weights given to the two fields when retrieving are empirically set as 3 : 1.
Dense retriever: Dense retrievers represent the patients and articles in a low dimensional space using BERT-based encoders and perform retrieval based on maximum inner-product search.Concretely, we denote the encoder as f , and e d = f (d) refers to the low-dimensional embedding generated by the encoder for a given passage d.Then for a query patient q and an article a in our retrieval corpus A, the relevance score between them is defined as the inner product of their embeddings: s dense (q, a) = e q • e a .The similarity score s dense (q, p) between q and a patient p ∈ P is defined similarly.
We first try direct transferring of Sentence-BERT [20] and Contriever [21], two widely-used dense retrievers pre-trained on MS MARCO [22], a general domain retrieval dataset of large scale.Then we train our own dense retrievers by fine-tuning pre-trained encoders on the PMC-Patients dataset.To be specific, for a given query patient q i , a similar patient / relevant article p + i , and a set of dissimilar patients / irrelevant articles p − i,1 , p − i,2 , . . ., p − i,n from the training data, we use the negative log-likelihood of the positive passage as the loss function: We train the dense retrievers with in-batch negatives [23], where We train several different encoders, all of which are Transformer encoders [24] initialized by domain-specific BERT [25], including PubMedBERT [26], Clinical BERT [27], BioLinkBERT [28], and SPECTER [29].For the ReCDS-PPR task, only one encoder is used, while for the ReCDS-PAR task, we train two independent encoders to encode patients and articles separately, due to their structural differences.
Nearest Neighbor (NN) retriever: We assume that if two patients are similar, then their respective relevant article and similar patient sets should have a high overlap degree, based on which we implement the following NN retriever similar to [30].For each patient in the training queries p ∈ P, we define its relevant article set as R(p) = {a|a ∈ A, Rel(p, a) = 1}.For each query patient q, we first retrieve top K similar training patients p 1 , p 2 , . . ., p k ∈ P as its nearest neighbors using BM25 7 .We take the union of their relevant articles as the candidate set: Then the candidate articles c i ∈ C(q) are ranked by relevance scores s NN (q, c i ) defined as: NN retriever for ReCDS-PPR is implemented similarly.

Results
In this section, we will first analyze the characteristics of the PMC-Patients dataset in Section 3.1, including basic statistics and patient diversity.We then show the dataset is of high quality in terms of the summary extraction and the relation annotation in Section 3.2.In the end, we present the performance of baseline methods on the ReCDS-PAR and ReCDS-PPR benchmarks in Section 3.3.

PMC-Patients Dataset
Scale: Table 2 shows the basic statistics of patient summaries in PMC-Patients, in comparison to MIMIC, the largest publicly available clinical notes dataset, and TREC CDS, a widely-used dataset for ReCDS.For MIMIC, we report the statistics of discharge summaries of both MIMIC-III and MIMIC-IV.For TREC CDS, we combine the data released in three years' CDS tracks (2014-2016) and use the "description" fields.PMC-Patients contains 167k patient summaries extracted from 141k PMC articles, making it the largest patient summary dataset in terms of the number of patients, and the second largest in terms of the number of notes.Besides, PMC-Patients has 3.1M patient-article relevance annotations, which is over 27× the size of TREC CDS (113k in total).PMC-Patients also provides the first large-scale patient-similarity annotations, consisting of 293k similar patient pairs.Length: On average, PMC-Patients summaries are much longer than TREC descriptions (410 v.s.92 words), but shorter than MIMIC discharge summaries (410 v.s.over 1k words).Figure 3    Demographics: The age distributions of PMC-Patients and MIMIC-IV are presented in Figure 3 (b).There are too few patients to observe the age distribution in TREC CDS, so we don't include it in the figure.On average, patients in PMC-Patients are younger than MIMIC-IV (43.4 v.s.58.7 years old), and patient ages are more evenly distributed (6.39 v.s.6.09 Shannon bits).PMC-Patients covers pediatric patients while MIMIC-IV does not.The gender distribution in both datasets is balanced.PMC-Patients consists of 52.5% male and 47.5% female, while MIMIC-IV consists of 48.7% male and 51.3% female.
Medical conditions: We also analyze the medical conditions associated with the patients.For PMC-Patients, we use the MeSH Diseases terms of the articles, and for MIMIC, we use the ICD codes8 .The most frequent medical conditions are shown in Figure 3 (c).In PMC-Patients, the majority of frequent conditions are related to cancer, with the exception of COVID-19 as the second most frequent condition.In MIMIC-IV, severe non-cancer diseases (e.g.hypertension) have the highest relative frequencies, and their absolute values are much higher than those of the most frequent conditions in PMC-Patients.For example, hypertension and lung neoplasms are the most frequent condition in MIMIC and PMC-Patients, respectively.Over 60% of MIMIC patients have hypertension, while less than 4% of patients in PMC-Patients have lung neoplasms.In addition, PMC-Patients covers 4, 031/4, 933 (81.7%)MeSH Diseases terms, relatively more than the 8, 955/14, 666 (61.1%)ICD-9 codes and 16, 464/95, 109 (17.3%)ICD-10 codes covered by MIMIC-IV.

Dataset Quality Evaluation 3.2.1. Patient summary extraction
In this section, we evaluate the quality of the automatically extracted patient summaries and demographics in PMC-Patients.The evaluation is performed on a random sample of 500 articles from the benchmark test set.Two senior M.D. candidates are employed to label the patient note spans at the paragraph level and the patient demographics.Agreed annotations are directly considered as ground truth, while disagreed annotations are discussed until a final agreement is reached.
Table 3 shows the extraction quality of PMC-Patients and the two human experts against the ground truth.A total of 604 patients are extracted by human experts.The patient note spans extracted in PMC-Patients are of high quality with a larger than 90% strict F1 score.The extracted demographics are close to 100% correct.Besides, two annotators exhibit a high level of agreement, with most disagreements being minor differences regarding the boundary of a note span.

Patient-level relation annotation
To evaluate the quality of patient-level relation annotations in PMC-Patients, we retrieve top 5 relevant articles and top 5 similar patients using BM25 for each patient extracted by the human experts in the previous section (604 patients from 500 articles), resulting in over 3k patient-article and 3k patient-patient pairs for human annotation.To annotate patient-article relevance, we follow the guidelines of the TREC CDS tracks [11,12,3], where we annotate the type of clinical question that can be answered by an article about a patient, including diagnosis, test, and treatment.To annotate patient-patient similarity, we follow the recommendations from [14], where we annotate whether two patients are similar in multiple dimensions: features, outcomes, exposure, and others.To assess the binary relational annotations in PMC-Patients against the multi-dimensional human annotations, we simply convert the latter into an integer score by counting the number of relevant or similar aspects.For example, if two patients are annotated as similar in terms of "features" and "outcomes", we will give it a score of 2.

ReCDS Benchmark Results
The performance of various baseline methods on the test set of two ReCDS tasks is shown in Table 4. Surprisingly, BM25 remains a strong baseline that achieves the best performance on MRR for both tasks and also performs the best on nDCG@10 for ReCDS-PPR.This indicates the importance of matching the exact words in the case reports for retrieving similar patients or relevant articles.Sentence-BERT and Contriever, two dense retrievers trained on the general domain MS MARCO dataset, do not generalize well to our ReCDS tasks.Their performance on all metrics is much worse than the BM25 baseline, which is consistent with previous studies [31,32] that dense retrievers may fail to perform zero-shot retrieval in specific domains such as biomedicine.
On the other hand, dense retrievers fine-tuned on PMC-Patients show significant performance improvements over general domain retrievers, indicating the importance of domain-specific fine-tuning.Although BM25 performs better on MRR, fine-tuned retrievers achieve the highest P@10 on both tasks and the highest nDCG@10 on ReCDS-PPR.They also have much higher recall than BM25 which suffers from vocabulary mismatch, showing that semantic matching is indispensable to retrieve more relevant articles or similar patients.Fine-tuned Clinical BERT performs the worst among other domain-specific BERTs.This is probably due to the pre-training corpus and tasks of these encoders: PubMedBERT and BioLinkBERT are pre-trained on PubMed; SPECTER and BioLinkBERT incorporate citation graph in pre-training; while Clinical BERT is trained on MIMIC, whose language distribution is quite different from PubMed, and never learns citation relationships.However, the metrics of the best baseline method are still quite low, highlighting the challenge of the PMC-Patients ReCDS benchmark.
NN retriever generally performs worse than BM25 and dense retrievers, indicating that measuring patient-article relevance based on citation graph distance may not be suitable for the task.

Patient summary dataset and case reports
Traditionally, patient summary datasets are collected from clinical notes in EHRs, such as MIMIC, MTSamples 9 , the THYME project [33], the n2c2 10 (originally named i2b2 11 ) project, and the OHNLP Challenges [34,35].However, except for MIMIC, these datasets are limited by size and diversity, typically containing only several hundred to a few thousand clinical note pieces and focusing on specific diseases.
More recently, clinical case reports have been utilized to construct datasets, but most of the existing works focus on specific tasks such as named entity recognition [36,37], abbreviation resolution [38], and semantic similarity [39].They only use case reports as a source of clinical texts, and the resulting datasets are task-oriented, rather than a patient summary dataset.Only MACCR [40], CAS [41,42], and the E3C project [43] present patient summary datasets extracted from case reports.Among them, MACCR focuses 9 www.mtsamples.com 10 https://n2c2.dbmi.hms.harvard.edu/ 11https://www.i2b2.org/NLP/DataSets/Main.php on curating structured metadata of clinical case reports instead of using freetext patient summaries.CAS and the E3C project mainly focus on European languages such as French and Spanish rather than English, with the dataset scales still limited to several thousand.In contrast, PMC-Patients is much larger, more diverse, and contains patient-level relation annotations.

Retrieval-based clinical decision support
Due to the lack of an adequate patient summary dataset and the prohibitive costs of manual annotations, there is currently no large-scale ReCDS benchmark dataset available.Most existing methodology researches on ReCDS-PAR use TREC CDS and TREC Precision Medicine (PM) [44,45,46,47].TREC CDS focuses on retrieving relevant PMC articles for given patient summaries curated by human experts or excerpted from MIMIC with specific intents (e.g.finding treatment/diagnosis). TREC PM focuses on retrieving relevant literature from PubMed or MEDLINE 12 and eligible clinical trials that can provide precision medicine-related evidence for a cancer patient, given the patient's cancer type, genetic variants, basic demographics, and other potential factors.However, each year, only 30-50 patient summaries are released and annotated with patient-article relevance, which also severely limits the patient diversity in these datasets.Furthermore, TREC PM only contains cancer patients.In contrast, PMC-Patients has a much larger collection of patient summaries (167k) that cover a wider range of medical conditions, and the largest scale of patient-article relevance annotations (3.1M).
To the best of our knowledge, there is no publicly available similar patient retrieval dataset.PMC-Patients bypasses the difficulty and expense of patient-level annotations using the PubMed citation graph and construct the first large-scale ReCDS-PPR dataset of 293k patient-patient similarity annotations.

Clinical significance
ReCDS provides valuable insights for healthcare providers in diagnosis, testing, and treatment of a queried patient, particularly in medically grey zones where high-level evidence is scarce, personalized management for multiple active comorbidities, and off-label use of novel therapeutics.We here present three case studies in the following section to demonstrate how PMC-Patients can benefit clinicians in different ways.Specifically, we focus on retrieval of similar patients since this is much less explored than relevant article retrieval.Table 5  Out-of-textbook treatment for disease failing standard-of-care, thereby advancing implementation of off-label therapeutics Table 5: Case studies on three patients under different scenarios.For each query patient, we present an example of the retrieved similar patients from PMC-Patients, with corresponding description and significance of assistance in query-patient management.
The first case involves a diagnostic dilemma of early-onset idiopathic thrombocytopenia, with co-occurred, seemingly unrelated conditions of renal disease, hearing loss, and suspicious family history.The top retrieved patient shows MYH9 mutation [48], which is the exact etiology of this case.MYH9 -related thrombocytopenia is extremely rare (1:20,000-25,000) [49] and is thus challenging to diagnose for non-experts.Other retrieval results also show other possible diagnoses including Alport syndrome [50] and anti-basement membrane disease [51].Its capability to recognize associated features from multiple manifestations and proposing insightful diagnoses is therefore greatly useful, especially in rare diseases.
The second case presents a female patient with a history of atrial fibrillation and deep venous thrombosis who shows acute hepatobiliary symptoms.ReCDS retrieves highly relevant cases, covering most common conditions including cholecystitis [52], bile leak [53], and Mirizzi syndrome [54].Impressively, ReCDS is able to bring up potentially dangerous bleeding complications (hemobilia), via suspecting anticoagulation use from her cardiac and thrombotic comorbidities [55] This requires further monitoring and testing, thus standing as important reminder in busy clinics where non-major medical problems can be easily ignored.
The third case asks an open question for treatment of metastatic melanoma failing standard care, pursuing answers in precision medicine similarly as the TREC PM 2020 track [56].The retrieved cases include attempts of ipilimumab/nivolumab rechallenge, BRAFi and MEKi rechallenge [57], and single agent PD-1 inhibitor [58], each of which providing sound evidence with detailed clinical course background for an oncologist's reference.Additionally, the approach itself favors effective treatment combinations (paradoxically thanks to positive report bias), and thus dynamically encourages evidence accumulation towards more promising directions, facilitating future clinical trial designs.
In conclusion, ReCDS can benefit clinicians in various ways, by recognizing rare diseases, overcoming testing blind spots, and advancing treatment evidence.With its potential to improve quality of medical care, ReCDS is especially valuable for clinicians in this era of precision medicine and personalized health.

Limitations and future work
Our experiments demonstrate that there is still much room for improvement on the PMC-Patients ReCDS benchmark.We outline some potential directions for further research: 1.Many patient summaries in PMC-Patients have token counts that far exceed BERT's 512 token limit, and truncation is applied in our baselines, which suffer from inevitable information loss.Therefore, retrieval performance may be further enhanced by using efficient transformers [59] such as Big bird [60] and Longformer [61].2. Reranking using pre-trained encoders based on cross attentions, including cross-encoders [62], poly-encoders [63], and ColBERT [64] may significantly improve retrieval performance.3. Our experiments indicate that both lexical and semantic features are crucial for ReCDS.Previous research has explored combining sparse retrieval and dense retrieval in the general domain [65,66], which may also be useful for the PMC-Patients benchmark.
ReCDS has also been shown helpful in various clinical tasks including question answering [67], and patient outcome prediction [68], where retrieved relevant articles serve as additional evidence for the model to refer to.More recently, with the huge success achieved by Large Language Models (LLMs), many studies have explored further augmenting LLMs with retrieval evidence [69,70,71,72].PMC-Patients can serve both as a benchmark for training and evaluating retrieval systems and as an evidence collection for improving clinical tasks and augmenting clinical LLMs such as ChatDoctor [73].

Conclusion
In this paper, we present PMC-Patients, a large-scale, diverse, and publicly available patient summary dataset with patient-article relevance and patient-patient similarity annotations.Based on PMC-Patients, we formally define two tasks and provide the largest-scale dataset to benchmark ReCDS: Patient-to-Article Retrieval (ReCDS-PAR) and Patient-to-Patient Retrieval (ReCDS-PPR).We evaluate various ReCDS systems on PMC-Patients ReCDS benchmarks and show that both tasks are quite challenging, calling for further research.We also conduct several case studies on our proposed ReCDS benchmark to show the clinical utility of our dataset.relevant articles or source articles from which similar patients are extracted."Scenario" is not part of the input since we do not distinguish different scenarios when training the retrievers.

1 .Figure 1 :
Figure 1: Overview of the PMC-Patients dataset architecture.Patient summaries are extracted by identifying certain sections in PMC articles.The cited articles and patients are considered relevant and similar, respectively.Patients from the same report are also considered similar.
year-old woman presented to the Department of Oral and Maxillofacial Surgery at Nagoya Ekisai Hospital (Nagoya, Japan) with a chief complaint of malaise and a 7-month history of swelling of the left buccal mucosa.The patient had no congenital swelling of the left ... PubMed (n = 1.4M)Relevant Article 1 (19924783): Title: Treatment guideline for hemangiomas and vascular malformations of the head and neck.Abstract: ascular anomalies are among the most common congenital and neonatal dysmorphogenesis, which are separated into hemangiomas and vascular malformations.They can occur in various areas ... Relevant Article 2 (18946678): Title: Developmental and pathological lymphangiogenesis: from models to human disease.Abstract: The lymphatic vascular system, the body's second vascular system present in vertebrates, has emerged in recent years as a crucial player in normal and pathological processes.It participates ... Relevant Article 3:

Figure 2 :
Figure 2: Overview of the PMC-Patients ReCDS benchmark.Given a query patient, there are two tasks: 1. Patient-to-article retrieval requires returning relevant articles from PubMed; 2. Patient-to-patient retrieval requires returning similar patients from PMC-Patients.
(a) presents the length distributions of PMC-Patients, TREC CDS descriptions, and MIMIC-IV discharge summaries.
(a) Patient Summary Length Distribution (b) Patient Age Distribution (c) Patient Medical Condition Distribution

Figure 3 :
Figure 3: (a): Length distributions of PMC-Patients compared to MIMIC-IV discharge summaries and TREC CDS descriptions (x-axis truncated).(b): Patient age distributions of PMC-Patients compared to MIMIC-IV.*Exact ages of patients older than 89 years old are obscured in MIMIC and thus taken as 90 in the figure.(c): Relative frequency of top 30 ICD codes in MIMIC-IV (left) and MeSH Diseases terms in PMC-Patients (right).The colors are associated with relative frequency, and the color bar attached to the figure illustrates this.

Figure B. 5
shows the distributions of the human scores (x-axis) grouped by the relation annotations in PMC-Patients (Irrelevant v.s.Relevant and Dissimilar v.s.Similar).T-test shows that patient-article and patient-patient pairs with PMC-Patients annotations have significantly higher human scores than those without (p < 0.01 for both cases).Besides, almost all positive pairs are considered relevant/similar by a human expert, indicating PMC-Patients automatic relational annotations achieve quite high precision.

Figure B. 5 :
Figure B.5: Distributions of the human-annotated relevance (left) and similarity (right) scores grouped by PMC-Patients automatic annotations.

Table 1 :
Statistics of the ReCDS-PAR and ReCDS-PPR benchmarks.A/P: Average number of relevant articles per query patient.P/P: Average number of similar patients per query patient.

Table 3 :
Extraction quality of the PMC-Patients dataset and two experts against the ground truth.Note span recognition is evaluated by F1 score.Age recognition is evaluated by min(annotated age, true age)/max(annotated age, true age).Gender recognition is evaluated by accuracy.All numbers are percentages.

Table 4 :
PAR and PPR performances of baseline retrievers (in percentage).Numbers in bold indicate the best results in each column.Precision (Prec) and nDCG are calculated at 10, and recall is calculated at 1,000.
shows the three cases under different scenarios with query patient summaries, examples of similar patients retrieved from PMC-Patients, and demonstrations of the clinical significance.The detailed inputs and outputs for performing case studies are shown in Appendix C.
5. PMID: 22028722.Diagnosis and treatment of multiseptate gallbladder with recurrent abdominal pain.4. PMID: 29434164.Nivolumab Induces Sustained Liver Injury in a Patient with Malignant Melanoma.5. PMID: 26805247.Multidisciplinary Therapy for Advanced Gastric Cancer with Liver and Brain Metastases.