Journal of Biomedical Informatics: X

A


Introduction
The biomedical sciences are pioneers for open-access publication, with the PubMed database alone indexing over 27 million journal articles. Given the rich knowledge contained in these articles, obtaining insights from the publications can be used to address a variety of biomedical problems. The sheer volume of unannotated text dwarfs that of the annotated documents and hence it is imperative to utilize unsupervised machine learning models to capture the semantic meaning of words and phrases from such large corpus which in turn can be used for various downstream biomedical tasks.
For many Natural Language Processing (NLP) tasks based on vector space models, the text is transformed into meaningful vector representations to help improve performance. Recent efforts have introduced new neural network models that can induce semantically meaningful word representations (or embeddings) from large corpora [1,27,36,3]. Dense, low-dimensional vector representation of words are learned such that similar words are close in space. The ability to preserve semantic and syntactic similarities between words been shown to be very useful in a variety of NLP tasks including information retrieval [12], part-of-speech (POS) tagging [9], text summarization [39,46], sentiment analysis [13,24], named entity recognition (NER) [23,42], synonym extraction [18] and relation extraction [19]. Moreover, several biomedical domain word representations have been created from biomedical literature [21,38] and the impact of training word vectors on corpus from various domains for downstream biomedical tasks is explored by [43,33].
Although word embeddings have achieved great success in wordoriented tasks such as NER and POS tagging, they perform poorly on phrases-oriented tasks such as Semantic Role Labeling [8]. The common approach to train state-of-the-art embeddings such as Word2Vec [25], GloVe [36], and FastText [3] is to learn the vector representation for each individual word. Phrase representations are then constructed using compositional approaches of the unigram vectors [45,47,22]. However, the compositional approaches (e.g., sums and products of the word vectors) are often order-insensitive and fail to capture the semantic meaning of the phrase [28]. Unfortunately, in the biomedical domain, many key concepts are often expressed as multi-word phrases [20] and thus are critical for capturing lexical semantics. Furthermore, biomedical phrases may only be weakly compositional, or unlikely to be expressed only based on the meaning of its part. As motivating examples, the phrases 'Glasgow Coma scale', 'open reading frame', and 'nuclear magnetic resonance', may not be well-expressed as a composition of the individual words. Therefore, it is important to build a distributed representation that not only captures single words but multi-word phrases as well.
Learning a distributed phrase and word embeddings have been shown to be effective on a general, non-domain specific corpus [26]. Yet, one of the key challenges is to identify useful phrases. While this task is well-studied, many of the existing works require annotation or extensive computation to achieve good performance [4,10,35,37,44]. A new unsupervised method has been proposed to collect over 700,000 common phrases that may be useful for biomedical NLP from PubMed articles [20]. Unfortunately, including all possible phrases into the https://doi.org/10.1016/j.yjbinx.2019.100047 embedding model significantly impacts the computational complexity and negatively impacts the learned representations.
We propose PMCVec, an unsupervised method that generates useful phrases from the corpus and builds a distributed representation that contains both single words and multi-word phrases by treating both as a single term (or unit). In this paper, we consider a phrase to be a continuous sequence of two or more words with no stopwords or punctuation marks except for a hyphen. We used a standard NLTK 1 stopword list. For example, our method obtains similar representations for the pairs 'hypertension' and 'high blood pressure' as well as 'myocardial infarction' and 'heart attack'. We introduce a new criterion to rank the generated phrases that balance phrase frequency, phrase length, and the frequency of the individual words within the phrase. This step allows us to select only the k-most useful phrases, where k is a hyperparameter that can be learned as well.
We compared our method against several existing embeddings: two general word embedding models and two biomedical domain word representations. Using five benchmark datasets for biomedical semantic similarity, we show that PMCVec achieves significant improvement over other models. We show that our distributed representation not only captures the semantic meaning of the phrases better than compositional methods, but it also does not significantly degrade the singleword representations.
This paper is organized as follows. First, we describe the various steps in the PMCVec process including preparing the text data; generating, ranking and filtering phrases; and learning the term embeddings. We then describe experimental results on several biomedical term-similarity evaluation datasets. We conclude with a discussion of how our method compares to other similar techniques and what can be done to improve further.

Methods
In this section, we present our framework for computing the distributed phrase representations. PMCVec consists of multiple steps: (1) preprocessing the articles, (2) generating phrases from the articles based on chunking, (3) ranking and filtering the phrases, and (4) tagging the phrases and building the distributed phrase representation. Fig. 1 depicts the entire workflow.

Preprocessing
We used titles and abstracts from all the 27 million documents in PubMed. The National Library of Medicine produces the citation records (in XML format) for PubMed [29]. The XML files are parsed to collect titles and abstracts. These are merged into a single large document. We then cleaned the document by removing terms that consisted only of numbers or special characters. For example, in the sentence "in 29 (69%) patients, the cancer cells showed a strong immunoreactivity for PCNA" the number 29 and (69%) would be removed.

Phrase generation
The next step in the process is to identify phrases from the corpus. Traditional techniques focus on identifying noun phrases since most meaningful phrases are of this form. These methods use predefined parts of speech (POS) rules or learn those rules from annotated documents to chunk the text [4,44,37]. However, such rule-based methods usually suffer in domain adaptation and will miss out on meaningful non-noun phrases including 'multilocus sequence typing', 'calcitonin gene related peptide', 'electrophoretic mobility shift assay', 'zollinger ellison syndrome', and 'diffusion tensor imaging'. Other generic phrase generation techniques leverage frequency statistics in document collections by extracting all possible n-grams from the text and retaining the most popular concepts [35,10]. However, this result enumerates all the possible n-grams and does not scale well for a large corpus. Instead, we use a conceptually simpler and more generic approach. Potential phrase boundaries are identified using stop words and punctuation [41]. Although this eliminates the possibility of stop words occurring in a phrase, it provides a more systematic methodology for generating variable n-gram phrases without having to specify ahead of time the maximum number of terms and enumerating all the possibilities. Thus, with the last example sentence in Fig. 1, the potential phrases from our chunking process are 'patients', 'cancer cells showed', and 'strong immunoreactivity', and 'PCNA'. Since our interest is to generate meaningful phrases, we remove any single word occurrences.

Rank and filter
The third step in our workflow is to rank and filter the potential phrases. This is a necessary step as there is no guarantee that all the phrases generated in the previous step are meaningful. Moreover, incorporating all the phrases impacts the learning process in terms of computational and memory complexity, and may degrade the distributed word representations. Thus, it is important to rank the phrases using a metric and filtering out those that do not meet certain criteria. Prior to ranking, we perform an initial filtering step that removes any phrases that do not appear sufficiently in the corpus. While we set the minimum corpus frequency to be 100, this number can be increased to further improve the speed of the ranking process. Thus, in our example in Fig. 1, 'paraffin-embedded bladder cancer section' did not occur frequently enough and was filtered out in this initial stage.
After the initial filtering step, we rank the multi-word phrases to identify meaningful phrases based on their likelihood to occur in PubMed literature as coherent units. Although there are several common phrase ranking criteria [6,17], we found they offered a poor trade-off between phrase frequency, constituent word frequency, and phrase length. Thus, we propose our own ranking criteria "Information Frequency (Info_Freq)" that provides a good balance. As an example, we filter out the phrase "cancer cells showed" in the filtering step of Fig. 1 since it has a low rank according to our criteria. Below, we describe Info_Freq and four of the commonly used phrase ranking metrics as well as discuss the benefits and limitations of each of them.
1. Raw Frequency: A measure of the number of times the phrase appears in the entire corpus. With the removal of stop words, most of the phrases that occur very frequently are likely to be good phrases. However, the simple nature of this metric punishes meaningful phrases that do not appear often and predominately favors 2word phrases. Phrases like 'results suggest' and 'present study' which occur in most documents are ranked high but other important phrases like 'epithelial tissue' and 'acute respiratory failure' do not occur as frequently and subsequently have a low rank. 2. Point-wise Mutual Information (PMI) [7]: A measure of how much information is gained about a particular word if you also know the value of a neighboring word. It is defined as: where p x ( ) is the probability of the word x occurring in a document, and p x y ( , ) is the probability of the co-occurrence of both words x y , occurring in the same document. For a three-word phrase, we adapt the above formula as: PMI is often used to find good collocation pairs as high PMI occurs when the probability of the co-occurrence is either higher or slightly 1 https://www.nltk.org/.
lower than the probabilities of the occurrence of each word. Conversely, phrases that contain frequently occurring words will have small PMI scores even if the phrase is good. As an example, 'blood cells' should be an important and meaningful phrase. Unfortunately, the constituent words 'blood' and 'cells' occur frequently in the corpus. As a result, the phrase is ranked very low. 3. Jaccard's Coefficient (JC) [40]: A measure of the similarity and diversity of the entire phrase set. It is defined as the frequency of a phrase divided by the total number of phrases that contain at least one term in the phrase: where * x freq( , ) denotes the frequency of any phrase that contains the term x but not y. For a three-word phrase, we adapt JC as: ( , , ) freq( , , ) freq( , , ) freq( , ) freq( , ) freq( , ) , Although Jaccard index accounts for the diversity of the phrase, longer phrases are punished as there is a higher likelihood of at least one word appearing in a phrase. Thus, longer phrases like 'reverse transcription polymerase chain reaction' and 'cervical squamous cell carcinoma' will be ranked low even though they are meaningful phrases. 4. Word2Phrase: This is a method proposed by [26]. It is a datadriven approach where phrases are formed based on unigram and bigram counts.
σ is used as a discounting coefficient to prevent too many phrases with infrequent words to be formed. This technique is applied in multiple passes to find longer phrases. For example, the phrase "blood cells" occurs 7000 times while "tagging snps" occurs only 350 times but the latter will have a higher score since the constituent words "tagging" and "snps" are infrequent compared to the more frequently occurring words "blood" and "cells" in the first phrase. The discounting coefficient takes off a constant number so that phrases with much less frequency but higher scores due to infrequent constituent words will be penalized more. We provide an empirical example in the supplementary file. 5. Info_Freq: Our proposed measure of the association between words in the phrase that accounts for the phrase frequency, the constituent words frequency, and the length of the phrase. For a two word phrase "x,y", we calculate the info_freq as: For a three-word phrase, we adapt the above formula as: = * x y z p x y z Info Freq x y p z x y z Info_Freq( , , ) log ( , , ) _ ( , ) ( ) log(freq( , , )).
In the above equation, we assume the two-word-phrase (x,y) occurs more frequently than (y,z). Scores are calculated in increasing size of phrase length. All two-word-phrase scores will be calculated before any three-word phrases and so on. For instance, to calculate the info_freq of the phrase "high blood pressure", we first calculate the score for the shorter phrase "blood pressure" and use this to get the score for the longer phrase. This is applied for phrases with more than three words as well. For the four-word phrase "chronic obstructive pulmonary disease", we calculate the score for "pulmonary disease", then for "obstructive pulmonary disease" and finally for "chronic obstructive pulmonary disease". In the attached supplementary file, we provide detailed examples of how the scores are calculated for longer phrases. Table 1 shows the top 10 phrases from all 27 million PubMed abstracts based on each of the five above criteria. Both the frequency and JC metrics only contain 2-word phrases. Moreover, the top-ranked phrases by frequency are not medically meaningful. PMI and Word2-Phrase are also biased towards short phrases mostly consisting 2 words. On the other hand, the top 10 phrases using Info_Freq contain a good mix of long and short phrases that are biomedical-relevant terms. We get 2-word, 3-word, 4-word and 5-word phrases using Info_Freq. Since our goal is to minimize the number of phrases to embed while keeping the most important ones, Info_Freq allows us to extract quality phrases The text data is preprocessed and chunked to obtain candidate phrases. The phrases are ranked using our proposed Information Frequency criteria, and then filtered. The resulting phrases are tagged to form a single unit and the tagged text is passed into a standard word embedding model. Each term is then represented using a dense vector that maintains semantic similarity and relatedness.
with different number of words.

Tag and build embeddings
The final step in our workflow is to tag the selected phrases as a single term and then build the distributed word embeddings. The tagging process reformats the original phrase by joining the constituent words using the '_' symbol. This is to ensure the phrase is considered a single term (or unit) in the embedding process. For example, 'proliferating cell nuclear antigen' is tagged as 'pro-liferating_cell_nuclear_antigen' in the original corpus.
Once the tagging process is complete, we train a word embedding model on the entire tagged corpus. Under the word embedding model, terms are represented as dense vectors that capture the meaning of the words and retain the semantic and syntactic relationship between words. We use Word2Vec, the most widely used embedding method [27], which trains a shallow neural network to learn the word vectors. Word2Vec consists of two different architectures, the continuous bag of words (CBOW) and Skipgram. In CBOW, each word is trained using its surrounding context wordsgiven this set of context words, what is the word that is most likely to appear? For example, in Fig. 2a, using the context of six words, what is the word that is most likely to appear between them? On the other hand, Skipgram (Fig. 2b) trains the context based on the target wordgiven the word, what are the other words that are likely to appear? We assessed the impact of the two different architectures (Fig. 2) on the quality of the resulting embeddings. We used an existing work to guide the hyperparameter searches for CBOW and Skipgram to achieve optimal performances on both architectures [5]. While our framework can leverage other word embedding models such as Glove [36] and FastText [3], we achieved the best performance with the Word2Vec model. We assessed our model on five different evaluation datasets and performed several experiments to study the impact of the number of phrases, embedding architecture, and phrase generation. We also evaluated PMCVec with several other publicly available word embeddings.

Evaluation datasets
We evaluated the performance of the final models on five popular medical term similarity and relatedness datasets. • miniMayo: This is a subset of the 'Mayo' dataset and consists of 30 term pairs on which a higher inter-annotator agreement was achieved. Out of a total of 60 term pairs, 31 are unigrams, 22 are 2grams and 7 are 3-gram or more.
• AH [16]: This is a set of 36 medical concepts extracted from the MeSH repository by Hliaoutakis. The similarity between word pairs was assessed by 8 medical experts. This dataset contains 41 unigram terms, 20 2-gram terms, and 11 terms which are 3-gram or more.
• UMNSRS [32]: This is a dataset of 566 UMLS concept pairs that have been ranked by eight medical residents for similarity on a continuous scale. All the 1132 terms are unigrams.
• UMNSRS_R [32]: This is a dataset of 587 UMLS concept pairs that have been ranked by eight medical residents for relatedness. All of the 1174 terms are unigrams.
Two of the datasets (UMNSRS and UMNSRS_R) consist of only singleword term pairs only. The other three (Mayo, miniMayo, and AH) contains both single and multi-word term pairs.

Evaluation metric
The comparison on the semantic similarity and relatedness datasets is based on the Spearman rank order correlation coefficient (ρ). The coefficient is computed by comparing the ranking from the model (r i ) to the expert judged ranking (r i ): Since the benchmark models only support single words, we use a compositional approach of vector averaging wherever there are multiword similarity comparisons. For instance, when comparing the semantic similarity of the two phrases "Kidney Failure" and "Renal Failure", our model represents both terms as single entities and learns a vector representation for each phrase. The baseline models, however, learn embeddings for each word in the phrase and average those vectors to represent the phrase.

Impact of phrase generation
Our first experiment assesses the impact of our phrase generation step. A qualitative comparison can be seen in Table 1, which contains the top phrases generated by different phrase generation criteria. In this section, we quantify the performance of the metrics on the evaluation datasets. Table 2 shows the comparative scores based on similarity and relatedness for each metric, with the word2vec hyperparameters selected that achieved the highest score with 18,000 phrases used as this gave the best performance across the board. The full table with exhaustive parameters is attached in the supplementary file for further comparison. We see that Info_freq gets the best scores in the three mixed datasets (both single and multi-word phrases) and performs similarly in the single word datasets too. Moreover, Info_Freq is robust across a wide range of hyperparameter settings for the embedding models.
We also compared the quality of our phrases to PubMed Phrases, a collection of common phrases that were generated for biomedical NLP [20]. Each phrase comes with a precalculated score based on the p value of the hypergeometric test the authors performed on segments of consecutive terms that are likely to appear together in PubMed. To compare the phrase generation method, we tagged the PubMed Phrases in the PubMed abstracts and re-trained a new CBOW model. Longest phrases are tagged first to avoid conflict with substring phrases. Any substring phrases of longer phrases will be tagged only if they appear as stand-alone not as sub-phrase of longer phrase. Fig. 3 shows the average similarity scores using all five test datasets using the PubMed phrases [20] and PMCVec. We include two models for PubMed Phrases, the first is using the top n phrases as scored by the authors and the second (exist in chunk) is also using the authors scores but only tagging phrases if the phrase exists in our preprocessed chunks. The PMCVec-based models consistently outperform the PubMed phrases at all the ranges of phrases. This showcases the effectiveness of our phrase generation technique.

Impact of number of phrases and embedding techniques
Our second experiment assesses the quality of the PMCVec-embeddings based on the number of tagged phrases and the two Word2Vec architectures. Fig. 4a depicts how the number of phrases affects the quality of the learned model with respect to the five test datasets (CBOW model is used). For the two datasets (UMNSRS and UMNSRS_R) with only single word pairs, the quality of the embedding monotonically decreases as we include more phrases. As more phrases are tagged, fewer unigrams are available to learn the word embeddings. For the combined test sets (miniMayo, mayo and AH), the quality of the embeddings increases and then decreases or stalls thereafter. Thus, for optimal performance we need to cap the number of phrases so that our model learns quality vectors both for single and multi-word terms.
We also assessed the quality of the word vectors using the two different Word2Vec architectures. Fig. 4b shows the average similarity scores on all five datasets for both the CBOW and Skipgram architecture. CBOW is better when there are fewer phrases. As the number of phrases increases, the Skipgram model slightly outperforms CBOW. Based on the figure, the best performance is achieved by CBOW using 18 K tagged phrases. The hyperparameters associated with this model are a negative sample size of 10, sub-sampling of 1e−5, a minimum count of 1, vector dimension of 200, context window size of 10, and a learning rate of 0.025.

Baseline methods comparison
We benchmarked PMCVec with four other word-embedding models, all pre-trained on different corpora. For our model, we used hyperparameters associated with the best performance as described above.
• Google news [15]: A Word2Vec model that is trained on a general non-biomedical corpus. This is widely used as state-of-the-art embedding model as it is trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million terms.
• Glove [14]: A GloVE model that is trained on a general non-biomedical corpus. Training is performed on aggregated global wordword co-occurrence statistics from a corpus of Wikipedia and Gigaword 5 (6 Billion tokens). It is a 300-dimensional vector representation for 400k words.  The best scores for each evaluation dataset are shown in bold. The performance of PMCVec and the baseline models on the five datasets is shown in Fig. 5. The two models trained on general corpora (Google news and Glove) have the lowest scores on all the datasets. On the contrary, the other two baseline models trained on biomedical corpora perform significantly better. This is consistent with prior results outlining the importance of the training corpus [31]. PMCVec outperforms the baseline models on all the datasets. The improvement is noticeable in the Mayo dataset, where the task is harder due to the lower inter-annotator agreement. We also note that our model performs better on both of the single-word pair datasets (UMNRS and UMNRS_R), which shows that incorporating phrases into the embedding process does not significantly compromise the quality of the single word vectors.
To quantify the performance of PMCVec on the single words and multi-words separately, we extract unigrams from the "AH" and "Mayo" datasets. Since "MiniMayo" is a subset of the "Mayo" dataset, all terms are already included in the extracted set. The remaining two datasets (UMNSRS and UMNSRS_R) are all single words and the performance of the models on these datasets are shown in Fig. 5. We depict how all the aforementioned models compare when using only unigrams and multi-word phrases in Fig. 6. We observe that the performance gain from PMCVec is noticeable for both single words and multi-words compared to the baseline methods.
The inclusion of multi-word phrases not only improves the semantic similarity performance but is also qualitatively better. Fig. 7 shows the cluster of terms that are semantically similar to the word 'hypertension'. In the two scenarios where no phrases are tagged (Fig. 7a) and the PubMed phrases are tagged (Fig. 7b), the closest terms to hypertension   are the same which are 'hypertensive' and 'hypertensions' and the third closest are 'hypertensives' and 'prehypertensive' respectively. Moreover, only two multi-word phrases ('arterial hypertension associated' and 'uncomplicated essential renovascular') appear when using the PubMed phrases. Using PMCVec, 'high blood pressure', 'elevated blood pressure', and 'essential hypertension' are the closest and all three are semantically similar to hypertension.
Additional examples of similar terms are shown in Table 3 for different disorders, symptoms, and medications. In all 6 cases, PMCVec is able to return relevant multi-word synonyms in the top 5 closest words. 'diabetes mellitus' is a semantically similar to 'diabetes' whereas the other two methods contain the top word 'mellitus'. Similarly for symptoms, 'joint pains' is returned for aches whereas the other two embeddings do not have this term. The same holds true for drugs; for 'aspirin', single words returns 'clopidogrel' and PubMed phrases gets 'dipyridamoleasprin' as the most sematically similar term. These are drugs commonly administered with aspirin. With PMCVec, the top term is 'acetylsalicylic acid' which is another name for aspirin. In general, the PMCVec-based embeddings produce more accurate vector representations for phrases. Biomedical text is rich with multi-word concepts and terminologies, and as such representing these terms appropriately as single units to learn their vector representations is an important step in biomedical text processing.

Limitations
Our model focused on obtaining a good distributed term representation by combining multi-word phrases and single-words. Unfortunately, training GloVe and FastText models took considerably more time to train in large dimensions. Due to computational time and Fig. 7. Word cloud for semantically similar terms to 'hypertension'. The size of the term is proportional to how semantically close it is to the word 'hypertension' with the largest denoting the most similar. memory limitations, we were not able to train these models with large dimensions and window sizes. The GloVe and FastText models we trained performed much worse than the other two Word2Vec models in smaller dimensions (100-dimension and 200-dimension results are in  the supplementary table) which is consistent with the work of Fan et al. [11] on clinical notes. The method we used for phrase generation did not consider terms and phrases containing only digits or stop words. Even though it is common to remove stop words in the form of subsampling for word embedding generation since they occur much more frequently and inflate the vocabulary size and training time [26], it may not be desired for biomedical phrase generation. We believe that this may result in the exclusion of meaningful phrases. However, incorporating these aspects into the phrase generation process would significantly lengthen the computation time. We plan to experiment in the future to determine the viability of including phrases with digits and stop words.

Conclusion
Learning quality vector embeddings that incorporate both single word and multi-word phrases can be quite challenging. Although compositional approaches to combine unigram vectors to obtain a phrase representation has worked well in some domains, this does not capture the meaning of key biomedical concepts. Moreover, incorporating all the existing identified biomedical phrase can negatively impact the quality of the embeddings. To address these issues, we introduced PMCVec, an unsupervised method that bridges the gap in learning quality vector embeddings for multi-word phrases which are a staple in biomedical literature. Our method not only generates useful phrases from the corpus, but it also introduces a new criterion to rank the generated phrases to avoid incorporating all the phrases and achieve a better embedding for both single words and multi-word phrases. We showed that the learned phrase embeddings result in better performance than compositional approaches using several benchmark datasets. As an example, a search result for the term 'colitis' should include multi-word expressions like 'inflammatory bowel disease'. The learning of vectors for both these terms allows easy association of the concepts, which are very similar terms but will not be learned as such with just single-word embeddings. We believe that PMCVec-learned representations will be widely useful for a variety of biomedical NLP tasks.

Data availability
The PubMed dataset used in this study is publicly available for download at https://www.nlm.nih.gov/databases/download/pubmed_ medline.html. The resources we used and the final model are available for download at https://github.com/ZelalemGero/PMCvec.