Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm. Our approach involves two key components in a pipeline-based model: (i) a data enrichment technique for enhancing both queries and documents during the first retrieval stage, and (ii) a novel re-ranking schema that uses a Transformer network in a setup adapted to this task by leveraging the structure of the CT documents. We use named entity recognition and negation detection in both patient description and the eligibility section of CTs. We further classify patient descriptions and CT eligibility criteria into current, past, and family medical conditions. This extracted information is used to boost the importance of disease and drug mentions in both query and index for lexical retrieval. Furthermore, we propose a two-step training schema for the Transformer network used to re-rank the results from the lexical retrieval. The first step focuses on matching patient information with the descriptive sections of trials, while the second step aims to determine eligibility by matching patient information with the criteria section. Our findings indicate that the inclusion criteria section of the CT has a great influence on the relevance score in lexical models, and that the enrichment techniques for queries and documents improve the retrieval of relevant trials. The re-ranking strategy, based on our training schema, consistently enhances CT retrieval and shows improved performance by 15\% in terms of precision at retrieving eligible trials. The results of our experiments suggest the benefit of making use of extracted entities. Moreover, our proposed re-ranking schema shows promising effectiveness compared to larger neural models, even with limited training data.


Introduction
Clinical trials (CTs) are crucial to the progress of medical science, specifically in developing new treatments, drugs, or medical devices [22]. Awareness and access to these studies are still challenging both for patients and physicians, making the recruitment of patients a significant obstacle to the success of trials [20,22].
Even if a sufficient number of patients is found, the recruitment process requires screening the patients for eligibility, which is a labour-intensive task [5]. Automated identification of eligible participants not only promises great benefits for translational science [20] but also aids patients by allowing them to be included in specific trials [14].
In recent years, several initiatives have been proposed to build automatic systems for matching patients to CTs [32,14,24,26]. The task has been defined as an information retrieval problem under the patient-to-trials evaluation paradigm [25]. Here, the query is constituted by patient-related information, either in the form of electronic health records (EHRs) or ad-hoc queries, and the documents are the CTs [14].
This retrieval task involves the semantic complexity of matching the patients' information with heterogenous, multi-fielded CT documents [28]. In addition to this, the eligibility criteria often use complex language structures (e.g. concepts can be negated twice) and medical jargon given in either semi-structured or unstructured ways [3].
To date, the existing approaches have revealed a significant lack of balance between efficiency and effectiveness. While pipeline-based models showcase promising performance, the substantial model sizes required to achieve competitive results raise concerns regarding costly deployment and limitations on reproducibility. This work presents a system for CT matching that uses data enrichment techniques for supporting CTs search with probabilistic lexical model as fist retriever, and a re-ranking setup with a Transformer network with a moderate size.
On the one hand, we develop a data enrichment process for both queries and documents. It consists of entity recognition and negation detection modules, applied to both the patient's description and the eligibility section of CTs. The data enrichment process also provides the classification of both patient's descriptions and CT eligibility criteria into current, past and family-medical conditions. The extracted information boosts the importance of affirmative and negative mentions of diseases and drugs for both the documents and queries in the traditional retrieval scenario.
On the other hand, we define a training strategy for re-ranking trials using a pre-trained language model in a two-step schema that leverages the structure of CT by considering not only the traditional topical relevance objective but also the eligibility criteria. Taking the result from our first stage retrieval process, we then match patient's information with descriptive sections of the trials for re-ranking based on topical relevance. Later, we further train this model by matching patient data with trial eligibility criteria in an attempt to discriminate documents as eligible or excluded.
We evaluate our work on the TREC Clinical Trials track 2021 and 2022 collections, showing that our methods improve finding relevant trials. More specifically, our contributions are as follows: 1. We evaluate the utility of individual sections of CT text on the performance of the lexical retrieval system showing that the inclusion criteria section alone contributes the most to the effectiveness of the search system.
2. We introduce a new query and document enrichment model that uses information extraction modules to handle challenges posed by unstructured EHR descriptions and eligibility criteria sections of CTs. The additional data explicitly highlight sections of patients' medical history and establish a novel way of handling a negation from the eligibility criteria. Rather than relying on dictionaries to find these entities, we use neural network-based information extraction models.
3. We propose and develop an effective re-ranking setup adapted to CT retrieval considering different learning objectives for training. We evaluate its quality both on the general, pre-trained BERT model, as well as biomedical domain-focused versions.

Background
This section describes previous work on CT matching with various paradigms, approaches to extract information from clinical data and from patient-related information, and neural re-ranking for CT retrieval.

Clinical trials matching
The TREC Clinical Trials track concerns the task of matching single patients to clinical trials. Other tasks concerning CT matching mentioned in the literature are cohort-based retrieval [15] and trial-to-trial retrieval [37].
In the context of the TREC CT track, patient-related information is written as free-text, whereas the document collection consists of a snapshot of ClinicalTrials.gov database 1 . Each clinical trial contains multiple fields, including two titles (brief and official one), condition, summary, detailed description, and eligibility criteria. The content of these fields can range from structured (e.g. gender and age of eligible patients) through semi-structured (e.g. eligibility criteria section) to unstructured (e.g. description and summary). The eligibility criteria field contains inclusion and exclusion criteria, a core aspect of the CT matching task. Trials were judged using a graded relevance scale of three points: 0 if the patient is not relevant to the CT, 1 if the patient is topically relevant but excluded based on the eligibility criteria, and 2 when the patient fulfils the eligibility criteria.
The TREC CTs track differs from previous medical TREC tracks in several aspects. TREC Precision Medicine 2017-2020 [24] is concerned with matching CTs to a patient summary consisting of only the patient's disease, relevant genetic variants, and basic demographic information. On the other hand, TREC CT topics consist of an unstructured patient summary. TREC Clinical Decision Support 2014-2016 [23] used topics written similarly (free-text patient descriptions), but the task was to match patients to PubMed publications, instead of CT documents. Moreover, none of the previous tracks used a graded relevance scale focused on eligibility. Figure 1 provides an example of a patient's EHR description and of the sections from a relevant CT. Using a bag-of-words approach to tackle the patient-to-trial matching problem may pose difficulties as both the patient's description and the CTs contain many irrelevant terms, thereby introducing noise. Moreover, both can contain negated key terms (for instance, the exclusion criteria), the handling of which is essential for deciding eligibility but may not be trivial even when using n-grams or neural network-based models [9,34]. Additionally, the sections of queries and documents may have different importance because of their time dependency (i.e., past or present conditions) and because they can refer to either patients or patients' family medical history. One can try to overcome these issues by structuring both the query and documents, and extracting relevant items first.
Previous work attempted to solve a CT matching task using various lexical and neural models. Leveling [18] annotated a corpus with terms from medical dictionaries and with negations for retrieving trials for the TREC Precision Medicine track. A large number of systems reported in the TREC CT used variants of the Okapi BM25 model [12] or the Divergence from Randomness (DFR) model [1] that has demonstrated potential in the biomedical information retrieval field.

Information extraction from clinical data
Information extraction from clinical data has been an active area of research in recent years. Previous work has focused on automatic extraction of eligibility criteria from clinical trial protocols. For instance, Dasgupta et al. [2] presented a method for identifying and segmenting eligibility criteria into five semantic categories, including demographic information, health status, treatment history, laboratory test reports, and lifestyle. The EliIE system [13] was proposed for converting free-text eligibility criteria for clinical research into a structured and formalised format using a 4-step process including entity and attribute recognition, negation detection, relation extraction, normalization of concepts and output structuring.
Other studies aimed to extract information from patients' health records. The development and uptake of NLP methods for processing free-text clinical notes has been growing in recent years. A systematic review of the literature by Sheikhalishahi et al. [30] showed that there is a significant increase in the A prospec�ve, open-label, non-interven�onal, mul�centre study in adult pa�ents with asthma or COPD who are treated with Salme-terol/flu�casone Easyhaler. During the study the Salmeterol/flu�casone Easyhaler will be used according to the Summary of Product Characteris�cs. Clinical effec�veness of the treatment will be evaluated with change in asthma or COPD symptoms during 12 weeks treatment.

Pa�ent Descrip�on -#23
A 39-year-old man came to the clinic with cough and shortness of breath that was not relieved by his inhaler. He had these symptoms for 5 days during the past 2 weeks. He doubled his oral cor�costeroids in the past week. He is a chef with a history of asthma for 3 years, suffering from frequent cough, wheezing, and shortness of breath and chest �ghtness. The symptoms become more bothersome within 1-2 hours of star�ng work every day and worsen throughout the work week. His symptoms improve within 1-2 hours outside the workplace. Spirometry was performed revealing a forced expiratory volume in the first second (FEV1) of 63% of the predicted. His past medical history is significant for seasonal allergic rhini�s in the summer. He doesn't smoke or use illicit drugs. His family history is significant for asthma in his father and sister. He currently uses inhaled cor�costeroid (ICS) and flu�casone 500 mcg/salmeterol 50 mcg, one puff twice daily. Figure 1: An example of a clinical trial and a description of a patient eligible for this trial. Highlighted items are described in detail in Section 3. Example adapted from Pradeep et al. [21].
use of machine learning methods for NLP in clinical notes related to chronic diseases, and that deep learning is an emerging methodology. The ConText algorithm aims to determine whether conditions mentioned in clinical reports are negated, hypothetical, historical, or experienced by someone other than the patient [10]. The n2c2 n2c2/OHNL 2019 shared task [31] focused on extracting family history information from clinical notes. Garcelon et al. [8] utilised heuristics to detect medical history and negated terms in patients' records.
Despite these efforts, there has been a lack of approaches that integrate information extraction techniques to enhance both query and document representation. Specifically, there is a lack of methods that effectively combine the extracted terms to determine eligibility ranking. This presents an opportunity for further exploration in the field.

Neural approaches for CT
Several approaches using Transformer-based architectures and pre-trained models, such as BERT [4], have achieved state-of-the-art effectiveness in some of the biomedical information processing applications. In CT retrieval, there have been multiple attempts to use BERT embeddings in both dual-encoder and cross-encoder retrieval setups with different pre-trained models such as BioBERT or ClinicalBERT [11,29,28]. These results correspond to implementations of methods applied to traditional ad-hoc retrieval tasks and have not outperformed multiple experiments under traditional retrieval models [26,27]. On the other hand, Pradeep et al. [21] proposed a multi-stage neural ranking system for the CTs matching problem, relying on T5-based models, currently with state-of-the-art results in multiple retrieval tasks, including CT.
According to the findings presented in TREC CT 2021 [26], T5-based models currently outperform smaller transformers models in CT retrieval. In this paper, we propose an effective training strategy that takes into account various aspects of clinical trial retrieval, including topical relevance and eligibility criteria, as separate learning objectives. We evaluate its quality both on the general, pretrained BERT model, as well as biomedical domain-focused versions. Our approach results in a strong competitor to T5-based models with a much simpler architecture, as demonstrated by the official results reported in TREC CT 2022 [27]. Specifically, our model performs second-best overall, outperformed only by the model proposed by Pradeep et al. [21] in the best-performing category.
These findings suggest that BERT-based models with our proposed training strategy can provide a viable alternative to T5-based models in clinical trial retrieval.

Methodology
This section describes the steps for processing CTs' and patients' descriptions used as input to probabilistic lexical models. We then define our two-stage neural re-ranking pipeline.

Clinical trial processing and ranking
We parse the content of a clinical trial document to split it into specific sections. The eligibility criteria section contains a crucial component of the trial used to distinguish if a patient is eligible for a given trial. Our CT processing is focused on making the eligibility criteria as fine-grained as possible so we can easily discriminate aspects referring to medical history and drugs. We start by using parser based on heuristics to split the eligibility criteria section of clinical trials into two lists containing inclusion and exclusion criteria, respectively.
We further classify each sentence from inclusion and exclusion as concerning one of the three sections: 'current medical condition', 'past medical condition' and 'family medical history'. Our motivation is that admission notes (which the topics simulate), consist of several sections that do not have equal impact on the patients' relevance to the trial and may even be irrelevant to their eligibility. Similarly, clinical trials can have different types of information stored in their eligibility section.
We then use a pre-trained entity extraction model together with an algorithm for determining negation to detect affirmative and negative drug and disease entities in both inclusion and exclusion sections.
In the next step, we remove double negations coming from negated exclusion criteria. For every entity in the exclusion criteria, we swap their modifier (from affirmative to negative and vice versa). For instance, the exclusion criterion 'Patients who are not smoking' becomes the inclusion criterion 'Patients who are smoking'. This step may not always be correct; nevertheless, we found it to be a good approximation, allowing us to collapse these two sections into one.
The final result is a single list of extracted entities, classified by their section and modifier. All extracted keywords from a document D i can be described by the set where A stands for a list with affirmative entities, N for negative entities, and cmc, pmc and f mh for current medical conditions, past medical conditions, and family medical history, respectively.
We can then enrich the CT documents representation by expanding them with the extracted keywords. However, in order to preserve the semantic information about each extracted entity (section and modifier information), we use prefixing with special tokens. Furthermore, as many of these entities are multi-word expressions, we concatenate the tokens using the underscore character '_' to create a single token. Specifically, we create new tokens by adding them the prefixes 'cmc', 'pmc' and 'f mh' for each respective section and additionally 'no' when an entity is negated (e.g. N pmc i = ['myasthenia gravis','shortness of breath'] becomes ['pmc_no_myasthenia_gravis', 'pmc_no_shortness_of _breath']). We append the new tokens to the pre-processed document and use the enriched document to create an index for the lexical retrieval models.

Patient description processing
As we are essentially aiming to match the patient to the CT criteria, we follow a similar approach to enrich the input query. A patient's description is split into the sections of current medical conditions, past medical conditions, and family medical history. As for the trials, we run an entity and negation detection algorithm for each section. Extracted keywords for query Q j can be represented as where each element contains a list of extracted entities. Finally, after tokenisation, the query for lexical models containing the original patient description is enriched by appending the extracted entities.

Filtering
Following approaches from previous work [18,16,28], we employ filtering on the age and gender to eliminate trials for which patients would be excluded based on the demographic criteria. We parse the age and gender information from patient descriptions for all patients. In clinical trials, this corresponds to 'minimum_age', 'maximum_age' and 'gender' fields. In this step, we remove the trials for which the patient is ineligible based on these two criteria.
Furthermore, we try rule-based parsing to extract information about smoking and alcohol consumption from both patients and clinical trials. Similarly to the demographic filters, we use this information to filter out ineligible patients.

Re-ranking
Taking advantage of the structure of the documents and the topic processing discussed in Sections 3.1 and 3.2, respectively, we define a training schema with two objectives. Here, inspired by the notion of curriculum learning, the approach aimed at decomposing complex knowledge and designing a curriculum for learning concepts from simple to hard [38], we follow the heuristic that the CT retrieval task can be decomposed into two sub-tasks. First, we set the retrieval objective, which simply relies on discriminating topical relevance (both eligible and excluded documents are relevant). Second, we set the objective of eligibility classification (only eligible documents are relevant).
We use the pre-trained language model BERT [4] with the standard approach known as cross-encoder neural ranking model. For fine-tuning, a linear combination layer is stacked atop the Transformer network, whose parameters are tuned with a ranking loss function. We use a pairwise loss function and train the model for re-ranking outputs from the process described in previous sections. Thus, the model is trained for these two objectives consecutively, such that there are two instances of the same model that we optimise with the following loss: where d (+)/(−) denotes embedded-relevant or non-relevant documents to the topic representation q, φ represents the model's parameters with the final linear layer, and s is the predicted score.
As shown in Figure 2, we match patient information with descriptive sections of the trials for reranking based on topical relevance (+d corresponds to sections of relevant trials). We consider the eligibility classification a harder task. Moreover, we hypothesise that for this task, the model could benefit from the knowledge that it already has from the previous training. We further train this model by matching patients' information with criteria in an attempt to discriminate documents as eligible or excluded (+d corresponds to trials categorised as eligible, and −d corresponds to trials categorised as relevant but discarded).
This process results in two different models. During inference time, we follow a similar schema: we take the BM25 rank and re-rank twice the top-k ranked trials using the two resulting models, respectively. When referring to this re-ranking procedure we call it TCRR: Topical and Criteria Re-Ranking.

Experiment setup
This section details the datasets and baselines we have employed as well as the evaluation procedure. Additionally, we discuss the implementation of the methods described in the paper.

Dataset
The corpus released by TREC contain 375,580 clinical trials. In 2021, 75 topics (patient notes) were used, and 50 more were created in 2022. There are 35,832 relevance judgements in 2021 and 35,394 in 2022. More details of the datasets can be found in Table 6 of Appendix A. Clinical trial documents released by TREC are in XML format and consist of several sections. In our experiments we consider the following sections: brief title, official title, description, summary, conditions and criteria.  For our experiments, we use the sets of topics as they where provided. For neural re-ranking, we present experiments using the topics from 2021 for training and from 2022 for testing and vice-versa. Additional splitting for training and development for neural models is described in Section 4.4.

Evaluation
We follow the evaluation procedure from the TREC Clinical Trials track, which is the standard evaluation procedure for ad-hoc retrieval tasks. As the relevance assessment is given in a graded relevance scale (eligible, excluded, or not relevant), the performance of the models is measured using normalised discounted cumulative gain (nDCG). We present results as reported by TREC, using nDCG@5 and nDCG@10, Precision (P@10), and Reciprocal Rank (RR).
We treat unjudged documents as non-relevant, ensuring that our results are not biased towards models that retrieve a large number of unjudged documents. Furthermore, we focus on Precision as the primary metric for optimising retrieval models. Our goal is to identify eligible trials, and Precision provides strict feedback to achieve this aim.

Baselines
As discussed in Section 3.4, for our custom re-ranking we train two different models: TopicalRR and CriteriaRR. When used independently, we consider them as baselines: TopicalRR The model trained for re-ranking based on the topical objective is initialised with the weights of bert-base-uncased 2 (as well as other two domain-specific trained models: BioBERT 3 and Clinical-BERT 4 ).
CriteriaRR The model trained for re-ranking based on the eligibility criteria classification objective is initialised with the weights of the TopicalRR. We further train this model.
Additionally, we consider the following two neural models as baselines: TraditionalRR The cross-encoder we use to compare our proposed training procedure with the traditional training, we train the model from the same checkpoint bert-base-uncased.
MonoBERT One of the comparable models implemented from the TREC CT track. It is based on the cross-encoder architecture and trained on data drawn from the corpus in a weakly supervised fashion [27].

Implementation details
We use the ScispaCy [19] and medspaCy [6] to implement our entity extraction experiments. We apply the spaCy NER model trained on the BC5CDR dataset to detect disease and drug mentions.
We have decided to use the ConText algorithm [10] to determine whether extracted entities were negative or affirmative. While more recent algorithms are available for identifying assertions in clinical text [36], we opted for the ConText algorithm due to its established track record and availability inside the ScispaCy library. Moreover, as our approach is focused on enriching not only 125 queries but also 375,000 clinical trial documents, an additional criterion for selecting the ConText model was its scalability.
Text is lowercased, and tokenised with the spaCy's en_core_sci_lg model; punctuation and stopwords are removed. As a main lexical retrieval model, we use the BM25+ [35] "out-of-the-box", i.e. without parameter optimisation, implemented in the Rank-BM25 5 Python package. Furthermore, for the first two experiments, we also test two other lexical models, namely TF-IDF [33] and DFR model based on inverse document frequency with Bernoulli after-effect and H2 normalisation (In_expB2) [1], both implemented in the Terrier search engine 6 .
On the other hand, we use PyTorch Lightning [7] and Transformers 7 to implement the neural re-ranking pipeline. As discussed in Section 3.4, we train different models for re-ranking with different configurations (see Section 4.3). The TopicalRR, after splitting the datasets into train (60%), development (10%), and test (30%) is trained on 8192 samples from the training set per epoch divided into batches of 16 samples. We further train this model on 1024 samples with batches of 16 to get to the CriteriaRR. Samples for these two models were selected as described in Section 3.4 and as shown in Figure 2. We pick positive samples only present in BM25 rankings as well as hard negatives from ranked-irrelevant or unlabeled documents. We re-rank top-50 trials from the BM25 run 8 . Finally, to compare our proposed training procedure with the traditional training of a cross-encoder, we train the TraditionalRR from the same checkpoint bert-base-uncased, on 2048 samples, where relevant documents are only those categorised as eligible.
All neural models are trained for ten epochs, with early stopping based on Precision. Our training was performed on an Nvidia Quadro RTX 8000 GPU.

Results
We begin by assessing the effectiveness of using clinical trial sections. Subsequently, we examine the influence of extracted entities and filtering techniques. Then, we conduct neural re-ranking experiments.

Clinical trials sections
We first evaluate the utility of different sections of CTs. We extracted inclusion and exclusion sections for 91% of clinical trials. For the remaining 9% of trials, we assume that both criteria sections are empty. We create several indexes and retrieval models with different combinations of sections as input features. The results for the BM25+ model are presented in Table 1. The first eight rows represent results when only one section was used to create an index, whereas the remaining rows present runs conducted on the concatenations of selected sections. Results for In_expB2 and TF-IDF retrieval models are presented in Table 7 of Appendix B. 5 https://github.com/dorianbrown/rank_bm25 6 http://terrier.org 7 https://github.com/huggingface/transformers 8 We ran experiments changing the cutoff from 20 to 100 with a step of 10 to find 50 as the optimal cutoff. Among single section runs, the usage of the inclusion field alone yields the highest results for Precision@10 and nDCG@5, both for 2021 and 2022 data. Moreover, for 2021 topics, the inclusion section also achieves the highest nDCG@10 and RR from all single topics, and it is on par with the run, which uses all sections except criteria combined (run 6 versus run 13).
Notably, for 2022, the summary field achieves the highest RR among all single-field runs. This is true for all three retrieval models. This can be caused by having the first relevant trial more generic (i.e. covering broader or more common diseases) and relevant but not necessarily specific to the patient's conditions. Figure 4 of Appendix C shows a topic-by-topic comparison for RR and P@10 for the BM25+ model. We can observe that there are still some topics for which the model using the inclusion section achieves a higher RR score than the summary field.
Concatenating more sections to create an index improves the on-average nDCG scores. However, this does not always hold for the metrics that consider the distinction between eligible and ineligible (P@10 and RR).
The exclusion section achieves the worst results from all single section runs (run 7), even when compared to runs using only the title of a clinical trial. Moreover, simply adding the text from the exclusion section for the bag-of-words approaches decreases the retrieval performance when compared to using the inclusion section only (run 16 versus 14). These outcomes motivate our subsequent experiments and document enrichment techniques described in Section 3.1, where we try to structure the knowledge contained in the eligibility section to take advantage of the available data.
The results for In_expB2 and TF-IDF (Table 7 of Appendix B) models follow a similar trend, with the differences for 2022 data even higher than for the BM25+ model. This outcome shows that our findings can be generalised to other lexical models.

Impact of extracted entities
To determine the impact of the extracted entities, we selected the optimal configuration of input sections from the previous step, which used the summary, description, titles, conditions, and inclusion criteria (run 14). We use these sections as a base document representation and enriched it with different combinations of extracted entities: c -only current medical conditions, cf -current and family medical history, cp -current and past medical conditions, cfp -current, family and past medical conditions. The results for the BM25+ model are presented in Table 2. Using extracted items from patients positively impacts the final score. The highest Precision scores are achieved with extracted affirmative and negated entities for the current and family medical history. The low impact of past medical condition can be explained by an infrequent occurrence of this data in patient descriptions in the TREC dataset and the quality of the ConText algorithm. Extracted entities contribute more positively to the measures where judgements distinguish between eligible and ineligible patients. The best-performing model (14d) comprises all available extracted data (affirmative and negative entities for current, past and family medical history) to enrich the index. This tells us that our proposed method can potentially improve the retrieval with complex negated sentences. However, the relative performance gain is low, and a detailed analysis is needed to understand how it can be further improved.
An example of extracted entities is presented in Table 3. As can be seen, the performance of both entity extraction and section classification models generates both false positives and false negatives, which influences the final retrieval result. Further fine-tuning on domain data could improve the quality. Results for In_expB2 and TF-IDF retrieval models are presented in Table 8 of Appendix B. The In_expB2 model on TREC CT 2021 data is the only one for which our query and document enrichment techniques are not improving results. We hypothesise that this is the case as the starting model (run 14) was already a very strong model compared to other baselines. For the TF-IDF model, we can observe that the enrichment with current and past medical entities yields the best results both for 2021 and 2022 data. Figure 5 of Appendix C presents a topic-by-topic analysis of the results in terms of the number of relevant trials ranked in top 20 using lexical models. We can observe an incremental gain both from extracted entities and filtering.

Effectiveness of filtering
Next, we test several filtering methods as described in Section 3.3. As a base run, we take our best configuration from the previous experiment: BM25+ run enriching data with current medical conditions and medical history of the patient and family (run 14d). Results for TREC CT 2021 are presented in Table 4. Our filtering results align with other researchers' results, confirming that utilising age and gender fields can improve the quality of the final matching. The usage of both filters (run e) removes, on average, 26.3% trials from the top 1000 retrieved documents for all topics of the 2021 collection, improving the P@10 score by 4.9 percentage points over the unfiltered run. Out of these two fields, the contribution of the age filter has more impact and is significantly better than the base run.
On the other hand, smoking and alcohol related-filtering does not help to improve the results further (runs f and g). We grouped this filters together as our algorithm did not identify any smoker, and only nine drinking patients in TREC CT 2021 topics. Despite only these few mentions, we observe deterioration of the results. Table 5 shows the results of the re-ranking procedure discussed in Section 3.4. We used the different models for re-ranking the results of a BM25 rank. We report the evaluations on the 2022 data. Models were trained on the 2021 data. Result of the TCRR model corresponds to the official TREC CT 2022 evaluation [27].

Neural re-ranking
As we hypothesise, in the context of CTs, the model benefits from the decomposition of the retrieval problem into two objectives, as it is experienced by TCRR (see Section 3.4), the model exposed to the two learning objectives and best performing. We also provide results for TopicalRR and CriteriaRR, independently, which are the models exposed only to the first (topical relevance) and second (eligibility classification) learning objectives. Additionally, we present results for the regular re-ranking setup TraditionalRR.
For this set of experiments, we are mainly interested in the evaluation in terms of Precision since, in a real scenario, only eligible trials are considered. Given that on average other proposed systems perform poorly, as shown by the TREC CT median results [26,27], precision (P@10) anywhere near 50% is regarded as a good result for this task. We analyse results from the proposed approach and find a significant improvement between the performance of TCRR models (TCRR and TCRR Bio ) and BM25 at a 95% confidence level. On average, this approach allows Bert-based models to gather more relevant documents than the selected baselines in the top 10. We report results on different domain specific pre-trained models that we trained following our proposed approach. Again, we evaluated the best performing model, TCRR Bio , in terms of Precision and found the improvement statistically significant.  Both techniques applied to lexical models, namely extracting drug and disease entities and filtering by age and gender, have a positive impact in finding more eligible trials. However, only the run with filtering is able to retrieve consistently fewer ineligible trials than the baseline run. We can also see that, on average, our best non-neural run (14d-AG), retrieves twice as many trials for which a patient is eligible than excluded.
Similarly, the TCRR neural re-ranking is further improving the number of relevant trials, but helps in removing ineligible only for the first 15 trials. One possible explanation is that we re-ranked only the top 50 trials retrieved by the first-stage ranker.

Discussion
In this work, we revisit the pipeline-based model for patient-to-CT matching. First, we report an extensive set of experiments for the first stage retrieval and propose an effective enrichment procedure to get the best out of the initial ranks. Second, we propose an adaptation of training a cross-encoder to the CT problem, taking advantage of the structured nature of the considered documents and the task.
We find that the inclusion criteria section has the most considerable impact on the retrieval score for all three tested lexical models meaning that these models cannot use all the available information. These outcomes motivate our further work in structuring queries and documents using entity extraction and negation classification methods. The results show improvements in finding relevant trials when applying data enrichment methods.
We show results for experiments on different configurations of our pipeline and compare our approach with different models previously used for the task. We focus on BERT-based models, which so far have not necessarily outperformed probabilistic lexical ranking models for the clinical trial matching task. Even though the results in Table 5 also show how changing the initial weights of the model can affect the overall performance (i.e., by choosing domain-specific model like BioBERT), we show that the improvements of our proposed approach are not due to the selection of a domain-specific pre-trained model, which is the case of the TCRR.
These results also provide an idea of which pre-trained model fits the task best. Overall, the TCRR initialised with BioBERT weights shows promising results, while ClinicalBERT weights were not the best choice in this scenario.
To our knowledge, this study is the first to focus on enriching documents and queries showing gains in the models' ability to find more eligible trials. Furthermore, our novel re-ranking concerning eligibility shows additional improvement for this task, comparable to the more expensive approach using the T5 architecture [21].
Our proposed re-ranking formula is different as it explicitly models the eligibility decisions instead of using only the topical relevance. This distinguishes our study from the previous works concerning clinical trial re-ranking [28].
Although this work focused on CT retrieval, we believe the approach can also be applied to other IR tasks where first, they involve ranking documents based on topics, and, in a second instance, the retrieval results are tailored by considering more specific criteria or constraints. One example of such a task is the selection of primary studies (citation screening) for the systematic literature reviews [17].
There are several limitations of this study, both related to the dataset and the models. Usage of the TREC CT collection implies that the patient descriptions are relatively short, i.e., EHR admission note-style documents. We acknowledge that our approaches could have problems handling longer sequences.
Additional limitations are related to the amount of data available for training and evaluating systems on the CT retrieval task. This issue, in our study, explicitly affects the curriculum learning scenario in the eligibility determination objective. It may limit the model in learning relevant patterns needed to scale to different clinical settings or patient populations.
Furthermore, the topics are written only in English. This does not concern clinical trials, for which the ClinicalTrials.gov database is the leading international source. Nevertheless, multilingual medical retrieval may present challenges for both lexical and neural models, as the nuances and complexities of medical terminology can vary significantly across languages. Addressing these limitations and developing strategies for multilingual medical retrieval is an essential area for future research.

Conclusion
This paper presents an approach for clinical trial retrieval under the patient-to-trial paradigm. We investigate the impact of individual clinical trial sections showing that the 'inclusion' section alone contributes the most to the final retrieval score. Moreover, we evaluate the handling of complex eligibility criteria for matching patients to clinical trials by combining input from information extraction modules into a lexical retrieval model. The extracted drug and disease entities and their negations positively impact the retrieval of eligible trials. On the other hand, filtering based on gender and age proved to be successful in eliminating ineligible trials.
Additionally, we propose an effective training strategy for neural re-ranking of clinical trials based on two distinct learning objectives. The first objective is the traditional relevance objective, while the second objective focuses on giving importance to the eligibility criteria and involves a classification objective that distinguishes between eligible and discarded samples. Our results indicate that even with limited data, this model is capable of further improving the Precision of our approach. Even though our proposed system involves many single components, it showcases an alternative approach to the clinical trial matching problem, emphasising the importance of eligibility criteria. In future work, we plan to measure the impact of extracted entities on neural re-ranking models.

A Datasets summary
A summary of datasets is presented in Table 6. B Other lexical models Table 7 presents results for the clinical trial documents sections impact on the ranking with In_expB2 and TF-IDF models. Table 8 shows results for the query and document enrichment experiment with In_expB2 and TF-IDF models. C Topic-by-topic analysis Figure 4 shows topic-by-topic comparison for RR and P@10 for BM25+ using inclusion (run 6) summary (run 4) and summary, description, titles and condition sections concatenated (run 13). Figure 5 presents the number of relevant trials at the top 20 retrieved trials for the three best BM25+ runs from each experiment.