12.1 Introduction

There is a widely adopted perspective that the twenty-first century is the age of biology [30]. Actually, biomedicine has always occupied an important position and maintained a relatively rapid development. Researchers devote to explore how the life systems (e.g., cells, organisms, individuals, and populations) work, what the mechanism of genetics (e.g., DNA and RNA) is, how the external environment (e.g., chemicals and drugs) affects the systems, and many other important topics [42]. Recent flourish has been brought by the development of emerging interdisciplinary domains [92], among which biomedical NLP draws much attention as a representative topic in AI for science, which aims to apply modern AI tools to various areas of science to achieve efficient scientific knowledge acquisition and applications.

12.1.1 Perspectives for Biomedical NLP

The prospect of biomedical NLP is to improve human experts’ efficiency by mining useful information and finding potential implicit laws automatically, and this is closely related to two branches of biology: computational biology and bioinformatics. Computational biology emphasizes solving biological problems with the favor of computer science. Researchers use computer languages and mathematical logics to describe and simulate the biological world. Bioinformatics studies the collection, processing, storage, dissemination, analysis, and interpretation of biological information. Bioinformatics research mainly focuses on the two aspects of genomics and proteomics.Footnote 1 The two terms are now generally used interchangeably.

According to the format of the processed data, we can make an inductive analysis of biomedical NLP from two perspectives. The first perspective is NLP tasks in biomedical domain text, in which we regard biomedicine as a specific domain of natural language documents; therefore the basic tasks are common with general domain NLP, while the corpus has its own features. Typical tasks [17] include named entity recognition, term linking, relation extraction, information retrieval, document classification, question answering, etc.

The other perspective is NLP methods for biomedical materials, in which the NLP techniques are adopted and transferred for modeling non-natural-language data and solving biomedical problems, such as the data mining of genetic and protein sequences [38]. As shown in Fig. 12.1, biomedical materials include natural language documents and other materials. The latter can be expressed in sequences, graphs, and other forms, and therefore the representation learning technique we introduce in the previous chapters can be employed to help model biomedical material. To ensure the effectiveness of general NLP techniques in the new scenario, adjustments are required to better fit with the data characteristics (e.g., fewer tokens for genetic sequences compared with natural language).

Fig. 12.1
An illustration depicts the different components of the N L P system, which consists of N L P tasks in various domains and N L P methods for biomedical materials, which further leads to linguistic, textual, and biomedical knowledge.

Introduction to biomedical knowledge and biomedical NLP. Icons are bought or freely downloaded from IconFinder (https://www.iconfinder.com/)

Overall, the biomedical natural language documents contain linguistic and commonsense knowledge and also provide explicit and flexible descriptions for biomedical knowledge. Meanwhile, the special materials in biomedical domain contain even more subject knowledge in implicit expressions. We believe that the two perspectives are gradually fusing together to achieve more universal biomedical material processing, and we will go into more detail about this trend later.

12.1.2 Role of Knowledge in Biomedical NLP

A characteristic of biomedical NLP is that expert knowledge is of key importance to get a deep comprehension of the processing materials. This even restricts the scale of golden datasets due to the high cost and difficulty of manual annotation. Therefore, we emphasize the knowledge representation, knowledge acquisition, and knowledge-guided NLP methods for the biomedical domain.

First, biomedical materials have to be expressed properly to fit automatic computing, and this benefits from the development of knowledge representation methods such as distributed representations. Next, echoing the basic goals of AI for science, we expect the biomedical NLP systems to assist us in extracting and summarizing useful information or rules in a mass of unstructured materials, which is an important part of the knowledge acquisition process. Nevertheless, we have mentioned above that the biomedical NLP datasets are hard to reach on a large scale, which is one reason that the data-driven deep learning system performance in the biomedical domain is not always satisfying. To improve the performance of these intelligent systems under limited conditions, the knowledge-guided NLP methods become especially important. With the help of biomedical knowledge, NLP models trained on the general domain can be easily transferred to biomedical tasks with minimal supervision. For instance, the definition and synonyms of terms in biomedical ontologies can guide models to get a deeper comprehension of biomedical terms occurring in the processing texts.

In Sect. 12.2, we first introduce the representation and acquisition of biomedical knowledge, which comes from two types of materials: natural language text and other biomedical data. Later, in Sect. 12.3, we focus on the knowledge-guided biomedical NLP methods which are divided into four groups according to the discussion in Chap. 9. After learning about the basic situation of biomedical knowledge representation learning, we will then explore several typical application scenarios in Sect. 12.4 and discuss some advanced topics that are worth researching in Sect. 12.5.

12.2 Biomedical Knowledge Representation and Acquisition

Going back decades, AI systems for biomedical decision support have already shown the importance of knowledge representation and acquisition. The former is the basis of practical usage and the latter ensures the sustainability of expert systems with growing knowledge. Biomedical knowledge is represented in a structured manner in that period. For instance, DENDRAL [10] is an expert system providing advice for chemical synthesis, and production rules in DENDRAL first recognize the situation and then generate corresponding actions. This two-stage process is similar to human reasoning and has a strong explanation capability. Other systems also represent the knowledge in the form of frame, relations, and so on [37]. Correspondingly, the acquisition of biomedical knowledge mainly relies on manual collection, and the assistant information extraction systems are conducted mainly based on the results of manual feature engineering.

With the development of machine learning, knowledge representation and acquisition have been raised to new heights. Our following discussion is divided into two different sources of knowledge: natural language text materials and other materials, which can correspond to the two perspectives mentioned in the last section.

12.2.1 Biomedical Knowledge from Natural Language

Text Biomedical textual knowledge is scattered in various natural language documents, patents, clinical records, etc. Various knowledge representation learning methods in the general domain are applied to these natural language text materials. What is special about biomedical texts is that we have to achieve a deep comprehension of the key biomedical terms. Therefore, we are going to first discuss various term-oriented biomedical tasks that researchers explore. Further, we turn to pre-trained models (PTMs) to achieve the overall understanding of language descriptions (including sentences, paragraphs, and even document materials around these terms).

Term-Oriented Biomedical Knowledge

Biomedical terms, including the professional concepts and entities in the biomedical domain, are important carriers of domain knowledge. Common biomedical terms that we may process include chemicals/genetics/protein entities, disease/drug/examination/treatment items, cell/tissue/organ parts, and others. To process the biomedical natural language materials better, deeper comprehension of these biomedical terms is necessary. Dictionary-based and rule-based methods are very manpower demanding, and it is difficult to maintain immediacy and hold complicated scenarios [46]. To grasp and analyze the data features automatically, machine learning and statistical learning are adopted to get more generalized term representations and achieve better acquisition performances [87], while still far from satisfaction. Further, deep learning has been rapidly developed and proven its effectiveness in the biomedical domain; therefore we are going to mainly introduce biomedical term process methods in the deep learning approach, which is currently the mainstream solution for biomedical knowledge representation and acquisition.

Biomedical Term Representations

The mainstream term representation methods are in a self-supervised manner, which is to predict the missing parts for the given context, hoping to get general feature representations for various downstream tasks.

Many works in the general domain such as word embeddings are directly used in biomedical scenarios without adaptation. The skip-gram version of word2vec [16], for example, is proven to get satisfying performance on the biomedical term semantic relatedness task [65]. Besides, researchers also try distributed representations especially trained for biomedical terms [22, 70, 102], introducing extra information such as the UMLS [9] ontology and the medical subject headings (MeSH) [54]. Based on the shallow term embeddings, we can go a step further to adopt deep neural networks such as CNNs and BiLSTMs to get the deep distributed representations for biomedical terms [48, 74].

In recent years, PTM is the most popular choice to generate distributed representations as the basis of various downstream tasks. Inspired by the PTMs including BERT that have achieved more and more surprisingly great performances in the general domain, researchers quickly transfer the self-supervised approach to the biomedical domain. SciBERT [7] is one of the earliest PTMs that are specially adapted for the scientific corpus, followed by BioBERT [47] which further refines the corpus field of the model target into biomedicine. The specific operation is very simple: replacing the pre-training corpus of BERT in the general domain (e.g., Wikipedia, books, and news) with biomedical texts (e.g., literature and medical records). Some other biomedical PTMs also follow this strategy, such as SciFive [73] which is adapted from T5.

To sum up, the knowledge representation methods in the general domain are adapted to biomedical terms quite well. Special hints including information from the subject ontologies can provide extra help to generate better representations.

Biomedical Term Knowledge Acquisition

The identification of terms involves many subtasks: recognition (NER), classification (typing), mapping (linking), and so on [46]. We introduce several mainstream solutions in chronological order.

Some simple supervised learning methods are applied for term recognition, such as the hidden Markov model (HMM) for term recognition and classification [19, 24] and the support vector machine for biomedical NER [67]. These systems mainly rely on the pre-defined domain-specific word features and perform not well enough for some lack-of-data classes. Neural networks including LSTM are also widely adopted for biomedical term knowledge acquisition [31, 33]. Unsupervised approaches are also explored and proven to be effective [101].

With the help of biomedical PTMs, we can better acquire and organize knowledge from the mass of unstructured text. At the term level, PTMs encode the long text and get a dense representation, which can then be fed into classifiers or softmax layers to finish NER, entity typing, and linking precisely. The tuning methods are sometimes specially designed, such as conducting self-alignment training on the pair tuples of the entity names and categorical labels in several ontologies [40, 55]. Though the methodology of PTMs has been successfully adapted to the biomedical domain, there still exist domain-special problems waiting for solving. For example, compared with the general domain, biomedical texts have more nested entities because of the terminology naming convention. For example, the DNA entity IL-2 promoter also contains a shorter protein entity IL-2, and G1123S/D refers to two separate entities G1123S and G1123D as shown in Fig. 12.2. We can solve the nested entity problem by separating different types of entities in different output layers [26] or by detecting the boundaries and assembling all the possible combinations [15, 89].

Fig. 12.2
A text document highlights drug-gene-mutation relations with biomedical terms such as gene, mutation, cell line, disease, and chemical.

Instance of biomedical textual knowledge acquisition. (Text is taken from [35])

Language-Described Biomedical Knowledge

As we can see, the research we have discussed concerns more about the special terms with professional biomedical knowledge. Nevertheless, other words and phrases in the language materials also contain rich information such as commonsense knowledge and can express much more flexible biomedical facts and attributes than isolated terms. It is necessary to represent the whole language descriptions instead of only biomedical terms, and this can be achieved by domain PTMs quite well. Based on the representations of language materials, the biomedical knowledge scattered in the unstructured text can be acquired and organized into a structured form.

We now introduce the overall development of the language-described biomedical knowledge extraction. The popular datasets are mostly small-scale and focus on specific types of relations, like the BC5CDR chemical-disease relation detection dataset and the ChemProt chemical-protein interaction dataset [49]. These simple tasks can sometimes be finished quite well with the help of distributed representations, even if they are generated by simple neural networks without pre-training [85]. However, in practical scenarios, biomedical knowledge exists in more sophisticated information extraction (e.g., N-ary relations, overlapping relations and events). Since scientific facts usually involve stricter conditions, few of them can be expressed clearly with only a triplet. For example, the effect of drugs on a disease is related to the characteristics of the sample, the course of the disease, etc. As shown in Fig. 12.2, the text mentioning N-ary relations is usually quite long and may cross several paragraphs. PTMs show their effectiveness due to their capability of capturing long-distant dependence for sophisticated relations in long documents, encoding the mentions, and then getting the distributed entity representations for the final prediction [41].

Summary

Overall, researchers solve the simple biomedical text processing scenarios quite well by transferring many knowledge representation and acquisition methods in the general domain of NLP, while the challenges still exist from practical perspectives. Knowledge storage structures with stronger expressive ability, plenty of annotated data, and targeted-designed architectures are urgently expected for sophisticated biomedical knowledge representation and acquisition.

12.2.2 Biomedical Knowledge from Biomedical Language Materials

Biomedical materials contain not only textual materials scattered in natural language but also some materials unique to the biomedical field. These materials have their own special structures in which rich knowledge exists, and we collectively refer to them here as biomedical language (e.g., genetic language) materials. Compared with natural language materials, biomedical language materials like genetic sequences are not easy to comprehend and require extensive experience and background information for analysis. Fortunately, modern neural networks can process not only natural language documents but also most of the sequential data including some representations of chemical and genetic substances. Besides, deep learning methods can also be applied to represent and acquire biomedical knowledge in other forms such as graphs. In this section, we consider genetic language, protein language, and chemical language for discussion, and substances expressed by these languages are linked by the genetic central dogma [21]. As shown in Fig. 12.3, genetic sequences are expressed to get proteins, which react with various chemicals to execute their functions.

Fig. 12.3
Three illustrations of the complementary base pairing of D N A with helical structure, the genetic code of R N A with the wave structure, and the protein folding of the peptide in the form of a spiral.

Genetic central dogma. Icons are bought or freely downloaded from IconFinder

Genetic Language

There are altogether only five different types of nucleic acid, among which A, G, C, and T are in the DNA sequences and A, G, C, and U are in the RNA sequences. Since the coding region of the unwinding DNA is transcribed to generate an mRNA sequence with a very fixed correspondence, i.e., A-T(U) and G-C, the processing methods for DNA and RNA sequences are often similar. We mainly discuss DNA sequences in this section. We first introduce basic tasks for DNA sequence processing and then discuss the similarities and differences between genetic language and natural language. In terms of genetic language, we show related work about tokenization and encoding methods.

Basic Tasks for Genetic Sequence Processing

First, let’s take a look at various downstream property prediction tasks for genetic sequences. Some of them emphasize the high-level semantic understanding of DNA sequences [5, 81] (long-distance dependency capturing and gene expression prediction), such as transcriptional activity, histone modifications, TF binding, and DNA accessibility in various cell types and tissues for held-out chromatin. Other tasks evaluate the low-level semantic understanding of DNA sequences [68] (precise recognition of basic regulatory elements), such as the prediction of promoters, transcription factor binding sites (TFBSs), and splice sites.

Features of Genetic Language

Although both DNA/RNA language and natural language are textual sequences, there still exist differences between them. Firstly, the genetic sequences are quite long and dull, thus not as reader-friendly for human beings as natural language sequences. However, the NLP models are actually good at reading and learning from the mass of data and finding patterns. Secondly, compared with natural language, genetic language has a much smaller vocabulary (only five types of nucleic acid as mentioned above); therefore low-level semantic modeling is important for overall sequence comprehension, about which researchers have launched many explorations as the following introduction.

Genetic Language Tokenization

Early works express the sequences via one-hot coding [82]. The nucleic-acid-level features can be captured by converting the sequences into 2D binary matrices. Based on the tokenized results, convolutional layers and sequence learning modules such as LSTMs are applied to get the final distributed representations [34]. More researchers use the k-mer (substrings of length k monomers contained within a biological sequence) tokenizer to take co-occurrence information into account. In other words, the encoding of each position in the gene sequences will be considered together with the preceding and following positions (a sliding window with a total length of k) [63]. Other methods such as byte pair encoding [80] have also been proven to be useful.

Genetic Sequence Representation

The shallow models can hardly process the long sequences which may have thousands of base pairs, while Transformers [88] can capture the long-distance dependency quite well thanks to its attention module. Further, the self-supervised pre-training for Transformers is proven to be also effective on the genetic language [39]. Besides, improved versions of Transformers are implemented and achieve good performances on DNA tasks. For instance, Enformer [5] is designed to enlarge the receptive field. To be more specific, the ideas from computer vision can be borrowed to use deep convolution layers to expand the region that each neuron can process. Enformer replaces the base Transformer layer with seven convolutional layers to capture the low-level semantic information. The captured features are fed into 11 Transformer layers and processed by the separately trained organism-specific heads. Experimental results show that Enformer improves gene expression prediction, variant effect prediction, and mutation effect prediction.

Protein Language

Protein sequence processing has a lot in common with genetic sequences. There exist altogether 20 types of amino acids in the human body, so protein language is a special language with low readability and a small vocabulary size as well. We also discuss some basic tasks and methods first and then introduce a representative work in protein sequence processing.

Basic Tasks for Protein Sequence Processing

The sequence specificity of DNA- and RNA-binding proteins [2] is a basic task that we are concerned about, because RNA sequences are translated to obtain an amino acid sequence and the two types of sequences are highly related. Moreover, the spatial structure analysis is another unique and important task for protein sequences, since the protein quaternary structure determines the properties and functions.

We have introduced the similarity of genetic and protein language, which allows most genetic sequence processing methods to be adapted to proteins. However, there are also some special methods for protein sequence processing. A significant fact is that structural and functional similarities exist between homologous protein sequences, which can help supervise protein representation learning. By contact prediction and pairwise comparison, we can conduct multi-task training of protein sequence distributed representations [8] and conversely assist spatial structure prediction.

Landmark Work for Protein Spatial Structure Analysis

AlphaFold [43] proposed by DeepMind has achieved a breakthrough in highly accurate protein structure prediction and become the champion of the Critical Assessment of protein structure prediction challenge. The system incorporates multiple sequence alignment (MSA) [11] templates and pairwise information for the protein sequence representation. It is built based on a variant of Transformers, which is named as EvoFormer. The column and row attention of MSA sequences and pair representations are fed into EvoFormer blocks. Peptide bond angles and distances are then predicted by the subsequent modules. The interfaces and tools for AlphaFold interaction have been developed quite well, and it is easy for users without an AI background to master. This reflects the essence of interdisciplinary research: division of labor and cooperation to improve efficiency.

Besides, it is worth mentioning that the initial results generated by AlphaFold can be further improved with the help of molecular dynamics knowledge. Incorporating domain knowledge also shows its effectiveness in some other scenarios, such as using chemical reaction templates for retrosynthesis learning [28]. Overall, the combination of professional knowledge and data-driven deep learning is getting better results, which is an important development trend for biomedical NLP.

Chemical Language

Apart from biological sequences, chemical substances (especially small molecules) can also be encoded and expressed into molecule representations, which can help finish property prediction and filtering. These representations play similar roles as the molecule fingerprint (a commonly used abstract molecular representation that converts the molecular structure into a series of binary sequences by checking whether some specific substructures exist).

Early Fashions for Chemical Substance Representation

In the early days of applying machine learning to assist the prediction of molecular properties, molecule descriptors such as nuclear charges and atomic positions are provided for nonlinear statistical regression [77]. Essentially, people still need to manually select features for the molecule descriptors. To alleviate the labor of manual feature engineering, data-driven deep learning systems have gradually become the main approach for the analysis of molecules.

For current deep learning systems of chemical substance representations, we classify according to the different expressions of chemical substances, for which there are several common methods as shown in Fig. 12.4.

Fig. 12.4
A table of the different methods of chemical expression as follows. Natural language, 2D or 3D graphs, and linear text.

Different chemical expression methods

Graph Representations

One of the clearest ways is the 2D and 3D topology diagrams [23, 45] describing the inner chemical structure of molecules. This naturally corresponds to the essential elements of graphs. In molecular graphs, the nodes represent the atoms, and the edges represent the connections (chemical bond, hydrogen bond, van der Waals force, etc.). Graph representation learning bridges chemical expression and machine learning [95], and we have introduced graph representation learning in detail in Chap. 6. Graph Transformer [98], for example, is currently one of the most popular approaches in molecular graph representation learning [76]. With the graph representation learning methods, we can achieve two main tasks for molecular processing: molecular graph understanding to capture the topology information of molecular structures and predict properties [45] and molecular graph generation to provide assistance for drug discovery and refinement [59]. Overall, graph representation learning has already been proven to be an effective approach to chemical analysis.

Linear Text and Other Representations

There are also some other solutions for expressing chemical substances. For example, linear text such as the structural formula, structural abbreviation, and simplified molecular input line entry specification (SMILES) [79] can be adopted for chemical expression. The straightforward advantage of linear text expressions is that they can naturally be fed into any NLP model. Although different from natural language text, the SMILES text expressing molecules and chemical reactions can also be processed by the Transformer-based models, if only with the assistance of specially designed tokenizers [50] and pre-training tasks [90]. Nevertheless, the linear text losses some structural information, and the 2D topologic and 3D spatial hints are still proven to be important. The atom coordinates computed according to SMILES help improve the performance of SMILES processing models [93], and this inspires us that the domain knowledge (e.g., molecule 3D procure) will enhance the NLP models when processing biomedical materials.

Summary

Apart from substances related to central dogma, there exist some other types of special materials in the biomedical domain, such as image data and numeric data. The former including molecule images and medical magnetic resonance images [58] can be automatically processed by AI systems to some extent. The latter such as continuous monitoring health data is also processed with NLP methods adapted to the biomedical domain [94]. In summary, the materials waiting for processing are in versatile forms, and deep learning methods have already achieved satisfying performances on many of them. Further, to achieve deep comprehension and precise capture of biomedical knowledge, we believe that adaptive and universal processing of various materials will gradually become the trend in biomedical NLP research.

12.3 Knowledge-Guided Biomedical NLP

We have already discussed the development and basic characteristics of biomedical knowledge representation and acquisition. Conversely, domain knowledge can guide and enhance biomedical NLP systems to better finish those knowledge-intensive tasks. Though the commonsense and facts in the general domain can be learned in a self-supervised manner, the biomedical knowledge we use to guide the systems is more professional and has to be additionally introduced. The guidance from domain knowledge bases can even assist human experts and help improve their performances, let alone the biomedical NLP systems. In this section, we introduce the basic ideas and representative work for knowledge-guided biomedical NLP, according to the four types of methods mentioned in Chap. 9: input augmentation, architecture reformulation, objective regularization, and parameter transfer.

12.3.1 Input Augmentation

To guide neural networks with biomedical knowledge, one simple solution is to directly provide the knowledge as the input augmentation of the systems. There exist different sources of knowledge that can augment the input, as we are going to introduce later. One mainstream source is the biomedical knowledge graph (KG) which contains human knowledge and facts organized in a structured form. Besides, knowledge may also come from linguistic rules, experimental results, and other unstructured records. The problem for input augmentation is to select helpful information, encode, and fuse it with the processing input.

Encoding Knowledge Graph

Information from professional KGs is of high quality and suitable for guiding models in downstream tasks. Usually, we rely on basic entity recognition and linking tools to select the subgraphs or triplets from KGs that are related to the current context and further finish more sophisticated tasks such as reading comprehension and information extraction. We now give three instances: (1) Improving word embeddings with the help of KGs. Graph representation learning approaches like GCN-based methods can get better-initialized embeddings for the link prediction task based on biomedical KGs [3]. (2) Augmenting the inputs with knowledge. Models such as the hybrid Transformers can encode token sequences and triplet sequences at the same time and incorporate the knowledge into the raw text [6]. (3) Mounting the knowledge by extra modules. Extra modules are designed to encode the knowledge, such as a graph-based network encoding KG subgraphs to assist biomedical event extraction [36]. As shown in Fig. 12.5, the related terms in the UMLS ontology are parsed and form a subgraph, which is encoded and concatenated into the hidden layer of the SciBERT text encoder to assist event trigger and type classification. There also exist other examples, such as the separate KG encoder providing entity embeddings for the lexical layers in the original Transformer [27] and the KG representations trained by TransE being attached to the attention layers [14].

Fig. 12.5
An illustration of the structure of GEANet, which interacts with the induce and biological function networks.

Encoding UMLS information to assist event extraction. (The figure is re-drawn according to Figs. 1 and 2 from GEANet paper [36])

Encoding Other Information

Apart from KG information, there are other types of knowledge that are proven to be helpful. Syntactic information, for example, is a significant part of linguistic knowledge. Though not a part of biomedical expert knowledge, syntactic information can also be provided as augmented input to better analyze sentences, recognize entities, and so on [86]. For non-textual material processing tasks, such as the discovery of the relationship between basal gene expression and drug response, researchers believe that experimentally verified prior knowledge including the protein and genetic interactions is important. The information can be concatenated with the original input substances to get representations and show the effectiveness of input augmentation [25]. Overall, introducing extra knowledge usually shows at least no harm to the performance, while we need to decide whether the knowledge is related and helpful to the specific tasks, through human experience or automatic filtering.

12.3.2 Architecture Reformulation

Human prior knowledge is sometimes reflected in the design of model architectures, as we have mentioned in the representation learning of biomedical data. This is especially significant when we try to process domain-specific materials, such as the substances we have introduced in the last section. After all, the backbone models are designed for general materials (e.g., natural language documents, natural images), which may have remarkable differences from biomedical substances. Here we analyze two examples in detail: Enformer [5] and MSA Transformer [75].

Enformer is an adapted version of Transformers framework for DNA sequences, and we provide the model architecture in Fig. 12.6. The general idea of this model has already been introduced when we discuss genetic sequences. Here we take a look at two designs in Enformer that help the model better capture the low-level semantic information in the super-long genetic sequences, and this information is of key importance for the high-level sequence analysis. First, Enformer emphasizes the relative position information, selects the relative positional encoding basis function carefully, and uses a concatenation of exponential, gamma, and central mask encodings. Second, convolutional layers are applied to capture the low-level features, enlarging the receptive field and greatly expanding the number of relevant enhancers seen by the model.

Fig. 12.6
An illustration of the structure of the Enformer model is as follows. Input D N A sequence, convolution layers, transformer layers, organism-specific heads, and output genomic tracks.

Model architecture for Enformer. (The figure is re-drawn according to Fig. 1a from DeepMind’s Enformer paper [5])

When discussing AlphaFold, we have mentioned the significance of MSA information. Inspired by this idea, MSA Transformer is proposed to process multiple protein sequences. The model architecture is shown in Fig. 12.7. The normal Transformers conduct attention calculations separately for each sequence. However, different sequences in the same protein family share information including the co-evolution signal. MSA Transformer introduces the column attention corresponding to the row attention of each sequence and is trained with a variant of the masked language modeling across different protein families. Experimental results show that MSA Transformer gets obviously better performance compared with processing only single sequences, and this becomes the basic paradigm of processing protein sequences.

Fig. 12.7
An illustration depicts the structure of the M S A transformer model as follows. Layer norm, row attention, layer norm, column attention, layer norm, feedforward. On the left side, the column and row attention grids are displayed, with the single position column and single sequence row highlighted.

Model architecture for MSA Transformer. (The figure is re-drawn according to Fig. 1 from MSA Transformer paper [75])

12.3.3 Objective Regularization

Formalizing new tasks from extra knowledge can change the optimization target of the model and guide the models to finish the target task better. In the biomedical domain, there are plenty of ready-made tasks that can be adopted for objective regularization once chosen carefully, and we do not need to specially formalize new tasks. Usually, we conduct multi-task training in the downstream adaptation period. Some researchers also explore objective regularization in the pre-training period, and PTMs learn the knowledge contained in the multiple pre-training tasks. We will give examples of these two modes and conduct a comparative analysis.

Multi-task Adaptation

The introduced multiple tasks can be the same or slightly different from the target task. For the former one, we usually collect several datasets (may be differently distributed or in various language styles) for the same task. For instance, the biomedical NER model has shared parameters while separated output layers for various datasets to deal with the style gap [12, 20]. When we do not have ready-made datasets, KGs can help generate more silver data for training, such as utilizing the KG shortest dependency path for relation extraction augmentation [84]. Further, different tasks can also benefit each other, such as several language understanding tasks (biomedical NER, sentence similarity, and relation extraction) in the BLUE benchmark [71]. Similarly, when dealing with non-textual biomedical materials, we can conduct multi-task adaptation to require the models to understand different properties of the same substances. For example, the molecular encoder reads the SMILES strings and learns comprehensive capability on five different molecule property classification tasks [53], and the knowledge in these tasks assists in improving the performance of each other.

Multi-task Pre-training

Pre-training itself is a knowledge-guided method, which we will introduce later in the next subsection. When it comes to multi-task pre-training, with knowledge of KGs or expert ontologies, we can create extra data and conduct knowledgeable pre-training tasks. The domain-specific PTMs we have mentioned such as SciBERT and BioBERT simply keep the masked language modeling training strategy. To introduce more knowledge, biomedical PTMs with specially designed pre-training tasks are proposed. One instance is masked entity prediction, e.g., MC-BERT [100] is trained with the Chinese medical entities and phrases masked instead of randomly picked characters with the assistance of biomedical KGs. The other instance is entity detection and linking, e.g., KeBioLM [97] annotates the large-scale corpus with the SciSpacy [66] tool and introduces the entity-oriented tasks during pre-training, which essentially integrates the entity understanding capabilities of the annotation tool. The PTMs enhanced by extra pre-training tasks usually show much better performance on the corresponding downstream tasks. In short, the multi-task pre-training period implicitly injects knowledge from the KGs/ontologies or the ready-made annotation tools, and this can help improve the capability of the PTMs in related aspects.

Comparing the two approaches above, we can find that multi-task adaptation is a more direct way to change the optimization target for the target task, and therefore the introduced datasets have to be of high quality and highly related to our target data. In contrast, the requirement for multi-task pre-training is less stringent since the pre-training period is conducted on a sufficiently large corpus that is insensitive to small disturbances, while the assistance of pre-training tasks is also not so explicit and remarkable compared with multi-task adaptation.

12.3.4 Parameter Transfer

One of the most common paradigms of transfer learning is the pre-training-fine-tuning paradigm. In this way, the data-driven deep learning systems can be applied to specific domains which may lack annotated data. The knowledge learned from the source domain corpus/tasks can help improve the performance of the target domain tasks. Taking the PTMs as an example, they transfer the commonsense, linguistic knowledge, and other useful information from the large-scale pre-training corpus to the downstream tasks. We now discuss two types of parameter transfer: between different data domains and between tasks.

Cross-Domain Transfer

The models pre-trained in the general domains are frequently transferred to the biomedical domain, and two of the most common scenarios are the processing of natural language documents and images. For example, the model pre-trained on ImageNet can better understand medical images and finish melanoma screening [61]. Compared with randomly initialized models, PTMs such as BERT can also achieve satisfying performances when fine-tuned on biomedical text processing datasets.

Nevertheless, with more biomedical corpora obtained, we do not have to rely on general domain pre-training now. Experimental results have shown that domain-specific pre-training has a more obvious improvement than general domain pre-training [64]. In fact, each domain may have its own characteristics, such as some empirical results in the biomedical domain showing that pre-training from scratch gains more over continual pre-training of general-domain PTMs [32], which is contrary to popular belief and waiting for further exploration.

Cross-Task Transfer

Models can be tuned on other tasks or styles of data before being transferred to the target task data, and knowledge learned from other tasks is contained in the initialized parameters. Specific to the biomedical domain, the high cost of biomedical data annotation limits the scale of golden samples labeled by human experts. Some methods can generate large-scale silver datasets automatically, such as distant supervision, which assumes that a piece of text/image expresses the already-known relation, if only the head and tail entities appear in it. Sometimes it is too absolute to directly change the optimization target. Instead, we consider using the cross-task transfer method to utilize the knowledge of the introduced task more softly. Pre-training on the silver-standard corpora and then tuning on the golden-standard datasets is proved to be effective [29]. Another example is cross-species biomedical data for transfer learning, in which the underlying biological laws of different species have similarities; therefore the biological data from other species can be used for pre-training before fine-tuning with the data from the target species and achieving higher accuracy for DNA sequence site prediction [52, 56].

Summary

To sum up, knowledge-guided NLP methods are widely used in biomedical tasks, such as parameter transfer which can be easily conducted, being proven useful in various scenarios and becoming an essential paradigm. For textual material processing, the structured biomedical expert knowledge in KGs is suitable for providing augmented input and designing better objective functions. For non-textual material processing, architecture reformulation is usually necessary due to the differences in the data characteristics between various forms of raw materials. Some special materials naturally provide clues for objective regularization, such as multiple properties for the given molecule. The satisfying performances achieved by the above methods inspire us to emphasize the significance of knowledge-guided biomedical NLP.

12.4 Typical Applications

In this section, we explain the practical significance of biomedical knowledge representation learning through three specific application scenarios. Literature processing is a typical scenario for biomedical natural language material processing, and retrosynthetic prediction focuses more on biomedical language (chemical language) material processing. Both the two applications belong to AI for science problems, attempting to search from a large space and collect useful information to improve the efficiency of human researchers. We then talk about diagnosis assistance, which is of high practical value in our daily life.

12.4.1 Literature Processing

The size of the biomedical literature is expanding rapidly, and it is hardly possible for researchers to keep pace with every aspect of biomedical knowledge development. We provide an example of a literature processing pipeline in Fig. 12.8 to show how biomedical NLP helps improve our efficiency. We divide the pipeline into four stages: literature screening, information extraction, question answering, and result analysis.

Fig. 12.8
An illustration of the processing of biomedical literature is as follows. Literature screening, information extraction, result analysis, and question answering.

A possible pipeline for biomedical literature processing. (Text in the example is taken from [44])

Literature Screening

In our usual academic search process, we first screen the mass of literature returned by our search engine. We require the information retrieval model to return a relevance score ranking according to the query conditions, which may describe the type and age limit of the document, the entities or relation pairs we are concerned about, and other details. Echoing the importance of the biomedical terms we have mentioned, sometimes the document representations in the biomedical information retrieval models emphasize the key biomedical terms in the documents and queries for better matching [1].

Information Extraction

We have already introduced some significant tasks for biomedical information extraction, such as term recognition, linking, and relation extraction. After we get the targeted literature by screening, we have to mine the text, extract the useful information, and convert it into a structured form just as we do in those extraction tasks. This stage usually relies on the knowledge-transferred PTMs reading and understanding the long documents.

Result Analysis and Question Answering

We may also care about advanced meta-relations between the extracted structured knowledge items or facts. An example is to perform a meta-analysis for clinical randomized controlled trials [4], which is one of the most convincing pieces of evidence in evidence-based medicine. The process of inductively analyzing the results of different trials does not necessarily need to be fully automated, while we expect the AI system to help us do a quality assessment and conclusion highlighting and therefore largely improve our efficiency. Based on the analysis result, we may even get assistance from the conversation systems generating reasonable responses to medical questions and providing effective suggestions for further research.

12.4.2 Retrosynthetic Prediction

Organic synthesis is an essential application for modern organic chemistry and plays an important role in drug discovery, material science, and other fields. To design synthetic routines for the target molecules more efficiently, AI systems are applied for chemical reaction reading, such as the reaction classification task. Further, we expect the systems to achieve deep comprehension of the reactions and can therefore generate single-step reaction predictions. Eventually, the multi-step retrosynthesis task, reasoning the synthetic routes for the given target product, can also be finished automatically with the help of extra information from knowledge bases or ontologies.

Chemical Reaction Classification

Machine learning methods can help researchers analyze large-scale reaction records and summarize useful reaction templates, which is a significant form of chemical knowledge [18]. These templates can further guide human researchers or AI systems to design synthetic routines.

Single-Step Reaction Prediction

In recent years, models such as the Transformers are pre-trained on the large-scale reaction corpus, and they are proven to be effective when predicting the single-step reactions without the guidance of templates [91].

Multi-step Reaction Prediction

For predicting multi-step reactions, most of the current methods search for reasonable routes based on the already-known reaction knowledge in the knowledge bases [13]. With the development of biomedical deep learning models, we may also explore end-to-end generation for multi-step retrosynthesis in the future, as shown in Fig. 12.9. Specifically, the heuristic algorithm for searching routes, the query of knowledge bases, and other operations may all be finished with unified models guided by chemical knowledge.

Fig. 12.9
An illustration depicts the possible solution for the chemical structure of molecule processes using A I.

A possible solution for automatic multi-step retrosynthesis. Icons are bought or freely downloaded from IconFinder

12.4.3 Diagnosis Assistance

There exists a huge demand for diagnosis assistance. The scarce medical resources in some areas call for AI systems to provide patients with auxiliary knowledge for simple daily situations. This can reduce the pressure on medical resources and improve the work efficiency of hospital systems.

We first take a look at several basic tasks in diagnosis assistance. The most practical application is automatic triage. The system is fed with the symptom descriptions from the patients and predicts the suitable clinic. This is essentially a disease classification problem. A similar task is medicine prescription, which requires processing more complex diagnostic information (including the text of complaints, quantified findings, and even images) and providing advice with the aid of medical knowledge. Further, the doctor-patient conversation is a challenging task due to the gap between the colloquial style of patients and the standard terms and structured items in KGs. The system must first recognize the key information and finish linking and then provide correct and helpful knowledge with good interpretability and readability.

Since safety is significant for issues related to medical care, the assistance systems have to be supported by plenty of knowledge and provide explainable suggestions. Incorporating knowledge representations with text representations achieves significantly better performance on the diagnosis assistance task [51].

12.5 Advanced Topics

We have introduced the current development in biomedical knowledge representation learning. There are several consensuses for biomedical NLP through which we can further discuss and get inspiration about future trends. We have discussed the significance of high-quality training data for the current deep learning biomedical systems, and data scarcity can lead to research in two ways: by guiding the models with the knowledge to adapt with few data or by incorporating different data forms from multiple sources. Besides, the black-box property of deep learning systems brings challenges for domain research since biomedical applications are highly related to human life and emphasizes safety and ethical justification. Next, we will elaborate on the above two solution paths and one main concern.

Knowledgeable Warm Start

There is a term in the field of recommendation algorithms called the cold-start [78] problem, which describes impaired performance when lacking user history. Extended to more deep learning applications such as biomedical NLP, we also face the cold-start challenge under scarce-data scenarios and often alleviate the problem with the help of transfer learning or other methods. For biomedical NLP tasks, data annotation is difficult, and we always have few supervision signals for model training. Therefore, it becomes more important to achieve warm start training for biomedical NLP systems.

As we have mentioned above, knowledge can guide deep learning systems in several different ways even when the data is comparably plenty, such as biomedical PTMs transferring linguistic and commonsense knowledge to help achieve a warm start. When it comes to the low-resource scenarios, there have been a few explorations. Knowledge-aware modules such as the self-attention layer introducing external KGs are designed for biomedical few-shot learning [96]. Special tuning strategies such as entity-aware masking are also applied and proved to be effective under low-resource problems [72]. Still, the knowledgeable warm start problem is rarely discussed in a targeted manner or even just clearly raised, although it is prevalent in biomedical NLP tasks. We believe that it deserves more attention and research.

Cross-Modal Knowledge Processing

Though the annotated datasets are small-scale, we have various forms of biomedical data that are linked to each other by biomedical knowledge. Apart from the regular cross-modal tasks (about which we can learn more details in Chap. 7) including medical image captioning, other types of materials can also be versatilely processed. For example, natural language and chemical language can describe the same chemical entities, and they may provide complementary information from different perspectives. KV-PLM [99] has proved that the connections between natural language descriptions and molecular structures can be modeled in an unsupervised manner through pre-training (Fig. 12.10). It can even surpass human professionals in the molecular property comprehension task and reveal its potential in drug discovery. Follow-up works further incorporate other materials such as molecular graphs with the text [83].

Fig. 12.10
An illustration depicts the different components of the K V-P L M model as follows. Internal information, meta-knowledge, external information, language model pre-training, and knowledgeable machine reading.

KV-PLM model bridging molecular structure expressions and natural language descriptions. This figure is taken from the original paper [99] with CC BY 4.0 license (https://www.nature.com/articles/s41467-022-28494-3)

Different expressions for biomedical terms have diverse emphases. Bridging them together and capturing the mapping relations between various data forms through a large number of observations, just as humans do, is a form of meta-knowledge learning, enabling a deeper understanding of terms while alleviating data scarcity issues. As long as we can design tokenizers to utilize different structures uniformly, the advantages of data-driven deep learning systems can be carried forward.

Interpretability, Privacy, and Ease of Use

There exist some other concerns about biomedical NLP. The first one is the interpretability problem, which we have discussed in Chap. 8. Most deep learning systems are black boxes that have poor interpretability, and this leads to distrust of automated decision-making, especially under medical scenarios closely related to human lives. Directly predicting the prescription without providing symptom analysis and disease diagnosis makes it hard for users to assess the credibility of the recommendations. This is not only related to safety but also some ethical problems including accident liability determination. There are already some researchers that focus on the interpretability of biomedical NLP due to its importance [60].

The second one is the privacy problem. The ethical controversy of privacy always exists when we talk about AI development. For example, the genetic sequence training data of deep learning models may be leaked by privacy attacks, and the genetic traits and disease information of the system users may be illegally sold. Some methods such as private aggregation of teacher ensembles can alleviate the privacy leakage problem [69], while it still needs more effort to be solved.

Thirdly, as the assistance tool for domain research, biomedical NLP systems are supposed to be designed as easily as possible to use. Some toolkits and online demos are developed [103], while most of them still propose quite high requirements for the users’ devices and programming foundation. There is a huge market for user-friendly platforms, and we hope the AI community to implement useful aids as soon as possible.

12.6 Summary and Further Readings

In this chapter, we discuss the representation learning of biomedical NLP. As an emerging interdisciplinary field, biomedical NLP has undergone rapid development in recent years, especially after deep learning methods such as PTMs appeared. We first introduce the knowledge representation and acquisition in biomedical materials, including natural language text materials and other materials, of which the latter adapts the advanced NLP algorithms and models to the biomedicine scenarios. Further, we explain the knowledge-guided methods in the biomedical domain in the four aspects: input augmentation, architecture reformulation, objective regularization, and parameter transfer. Future directions in this field have also been discussed.

For further understanding of biomedical knowledge representation learning, we recommend reading some surveys about the early works [62] and the comprehensive analysis for PTMs [32] which is the recent-year representative results.