The rise of deep learning in drug discovery

Over the past decade, deep learning has achieved remarkable success in various artiﬁcial intelligence research areas. Evolved from the previous research on artiﬁcial neural networks, this technology has shown superior performance to other machine learning algorithms in areas such as image and voice recognition, natural language processing, among others. The ﬁrst wave of applications of deep learning in pharmaceutical research has emerged in recent years, and its utility has gone beyond bioactivity predictions and has shown promise in addressing diverse problems in drug discovery. Examples will be discussed covering bioactivity prediction, de novo molecular design, synthesis prediction and biological image analysis.


Introduction
Digital data, in all shapes and sizes, is growing exponentially. According to the National Security Agency of the USA, the Internet is processing 1826 petabytes of data per day [1]. In 2011, digital information grew nine times in volume in just five years [2]; and by 2020 its amount in the world is expected to reach 35 trillion gigabytes [3]. The high demand of exploring and analyzing big data has encouraged the use of data-hungry machine learning algorithms like deep learning (DL). DL has gained huge success in a wide range of applications such as computer games, speech recognition, computer vision, natural language processing, selfdriving cars, among others [4]. It is fair to say that DL is changing our everyday life. In the Gartner-selected top ten technology trends of 2018, DL-represented AI technologies were ranked at the top position [5].
Over the past decade, there has been a remarkable increase in the amount of available compound activity and biomedical data [6,7] owing to the emergence of new experimental techniques such as HTS, parallel synthesis, among others [7,8]. How to efficiently mine the large-scale chemistry data becomes a crucial problem for drug discovery. Larger data volumes in combination with increased automation technology have promoted further use of machine learning. Besides established methods like support vector machines (SVM) [9], neural networks (NN) [10] and random forest (RF) [11], which have been utilized to develop QSAR models for a long time, methods like matrix factorization [12] and DL have started to be used. DL has taken advantage of the increased amounts of data and the continuous increase of available computer power. A difference between most other machine learning methods and DL is the flexibility of the NN architecture in DL. Architectures that will be discussed in this review are convolutional neural networks (CNNs), recurrent neural networks (RNNs) and fully connected feed-forward networks. Single-layer NNs have been used in QSAR modeling for a long time [10]; and with increasing data size and computational power have made it natural to apply multilayer feed-forward networks for bioactivity predictions. A somewhat surprising development has been the use of RNNs in de novo design which could not be foreseen a few years ago. With the adoption of high-throughput imaging equipment, CNNs have gained remarkable success in computer vision and have become a natural choice for biological image processing. The field of applying DL in drug discovery is rapidly progressing with new articles published almost every week. Recently, several reviews on DL applications in computational chemistry and life sciences have been published [13][14][15][16][17][18]. Here, we focus more on DL applications in drug discovery particularly in the chemoinformatics and biological image analysis domains and highlight DL architectures used so far within drug discovery.

Principles of deep learning
DL is a class of machine learning algorithms that uses artificial neural networks (ANNs) with many layers of nonlinear processing units for learning data representations. The earliest ANN can be traced back to 1943 [19], when Warren McCulloch and Walter Pitts developed a computational model for NNs based on mathematics and algorithms called threshold logic. The basic structure of a modern ANN is represented in Fig. 1 and is inspired by the structure of the human brain. There are three basic layers in an ANN: the input layer, hidden layer and output layer. Depending on the type of ANN, the nodes, also called neurons, in neighboring layers are either fully connected or partially connected. Input variables are taken by input nodes and the variables are transformed through hidden nodes, and in the end output values are calculated at output nodes. The interrelationship between input and output values of a hidden unit can be exemplified in Fig. 1b. The output value Y i of the node i is calculated as shown in Eq. (1).
where a j refers to the input variables, W ij is the weight of input node j on node i and function g is the activation function, which is normally a nonlinear function (e.g., sigmoid or Gaussian function) to transform the linear combination of input signal from input nodes to an output value. The training of an ANN is done by iterative modification of the weight values in the network to optimize the errors between predicted and true value typically through the back-propagation methods [20]. The modern ANN algorithm was developed during the 1960s to the 1980s and applications have appeared since then. But the traditional ANN method suffered from problems such as overfitting, diminishing gradients, among others, and was largely replaced by other machine learning algorithms like SVM [9] and RF [11]. The recent development of DL has given ANN a renaissance. The major difference between DL and traditional ANN is the scale and complexity of the NNs. DL uses larger numbers of hidden layers whereas traditional ANNs normally can only afford one or two hidden layers owing to the limitation of computer hardware in the early days. DL can afford to use many more nodes in each layer owing to the appearance of more-powerful CPU and GPU hardware. There are also many algorithmic improvements in DL, for example using the dropout [21] and DropConnect [22] methods to address the overfitting problem, applying rectified linear unit (ReLU) [23] to avoid vanishing gradients and introducing convolutional and pooling layers as novel network architectures to enable the usage of large numbers of input variables. Most of the DL software packages are open-sourced. TensorFlow [24], Caffe [25], PyTorch [26], Keras [27] and Theano [28] are among the most popular DL packages used in the data science community. Here, we briefly introduce several popular NN architectures used in DL (Fig. 2). First is the fully connected deep neural network (DNN) which contains multiple hidden layers and each layer comprises hundreds of nonlinear process units (Fig. 2a). DNNs can take large numbers of input features and the neurons in different layers of a DNN can automatically extract features at different hierarchical levels [29]. Another very popular NN is CNN, which is widely used for image recognition (Fig. 2b). It usually contains several convolution layers and subsampling layers. The convolution layer consists of a set of filters (or kernels) that have a small receptive field and learnable parameters. During the forward pass, each filter is convoluted across the width and height of the input volume, computing the dot product between the entries of the filter and its receptive field in input volume and producing a 2D feature map of that filter. The subsampling layer is used to reduce the size of feature maps. In the end, the feature maps are concatenated into fully connected layers where neurons in neighboring layers are all connected just like in a traditional ANN to give a final output value. Owing to sharing the same parameters for each filter, a CNN largely reduces the number of free parameters learned, thus lowering the consumed memory and increasing the learning speed. It has outperformed other types of machine learning algorithms in image recognition [30].
One additional variant of an ANN (Fig. 2c) is RNN. Unlike feedforward NNs, it allows the connection among neurons in the same hidden layer to form a directed cycle. RNNs can take sequential    data as input features, which is very suitable for time-dependent tasks like language modeling [31]. Using a technology called long short term memory (LSTM) [32], RNNs can reduce the vanishing gradient problem. The fourth ANN architecture shown in Fig. 2d is called autoencoder (AE) [33]. An AE is a NN used for unsupervised learning. It contains an encoder part, which is a NN to transform the information received from the input layer into a limited number of hidden units, and then couples a decoder NN with the output layer having the same number of nodes as the input layer. Instead of predicting labels of input instances, the purpose of the decoder NN is to reconstruct its own inputs from a fewer number of hidden units. Typically, the purpose of AE is for nonlinear dimensionality reduction. Recently, the AE concept has become more widely used for learning generative models from data [34]. Below, we illustrate how these DL technologies are applied in drug discovery research.

Application of deep learning in compound property and activity prediction
Machine learning methods including ANN have been applied in compound activity prediction for a long time. Naturally, DL methods are adopted to address the activity prediction problems in the first place. When compounds are presented by the same number of molecular descriptors, the straight forward method is to use fully connected DNNs to build models. Dahl et al. [35] applied a DNN on the Merck Kaggle challenge dataset using a large number of 2D topological descriptors; and the DNN showed slightly better performance in 13 of the total 15 targets than the standard RF method. Some of the key learnings from the study are: (i) DNNs can handle thousands of descriptors without the need of feature selection; (ii) dropout can avoid the notorious overfitting problem faced by a traditional ANN; (iii) hyper-parameter (number of layers, number of nodes per layer, type of activation functions, etc.) optimization can maximize the DNN performance; (iv) multitask DNN models perform better than single-task models. Mayr et al. [36] reported their multitask DNN models that won the Tox21 challenge on a dataset comprising 12 000 compounds for 12 highthroughput toxicity assays. Similar to Dahl's architecture [35,37], dropout and ReLU activation function were used in the DNN, and model training was run in parallel on GPU machines. They used a large feature set with static descriptors (3D, 2D descriptors, predefined toxicophores) as well as dynamically generated extended connectivity fingerprint descriptors (ECFP) to enable DNN to make self-feature deduction during training. More interestingly, statistical association analysis was done for DNN models exclusively using ECFP, and substructures significantly associated with known toxicophores in each hidden layer can be identified. These benchmark results demonstrate the advantages of a multitask DNN compared with a single-task DNN and conventional machine learning methods.
Recently, some other benchmark studies were published to further support the conclusion. Ramsundar et al. carried out a systematic study [38] to build multitask DNNs and compare their performance with single-task DNN models. Their results show that multitask models constantly perform better than single-task and RF models. Koutsoukas et al. [39] compared a DNN model with some commonly used machine learning methods such as SVM, RF, among others, on seven datasets selected from ChEMBL [40].
DNNs were found to statistically outperform (with P value <0.01 based on the Wilcoxon's statistical test) other machine learning methods. Lenselink et al. [41] reported another benchmark study for comparing DNN with conventional machine learning methods RF, SVM, naive Bayesian and logic regression methods taking protein descriptors into account [i.e., the proteochemometric (PCM) study]. They investigated performance of various classification models on a dataset comprising 314 767 target-compound interactions. The DNN model turned out to be the best model in terms of BEDROC (Boltzmann-enhanced discrimination of receiver operating characteristic), and multitask and PCM implementations were shown to improve performance over single-task DNNs.
Besides the benchmark studies of DNN, Subramanian et al. [42] reported a study using DNN with 2D topological descriptors to build a predictive BACE activity model and achieved a classification accuracy of 0.82 and standard error of pIC 50 $0.53 on the validation set. Aliper et al. [43] built DNN models for predicting pharmacological properties of drugs and for drug repurposing leveraging transcriptomic data from the LINCS project [44], as well as the pathway information. It has been shown that, using pathway and gene-level information, DNN models achieved high accuracy in predicting drug indications, hence they could be useful for drug repurposing.
Efforts have also been made in using representation learning (i.e., enabling NNs to learn directly from the molecular structure instead of using predefined molecular descriptors). This idea was first explored by Merkwirth et al. in 2005 [45]. Several years later, two different methods were developed to address the problem. Lusci et al. [46] reported a method that employed a variant of RNN, called UGRNN, which first transforms molecular structures into vectors of the same length as the molecular representation and then passes them to a fully connected NN layer to build models. Bit values in the vectors are learned from the dataset. The UGRNN method was shown to be able to build predictive solubility models that were comparable in accuracy to models built with molecular descriptors. Xu et al. [47] applied the same method to model druginduced liver injury (DILI). The DL models were built based on 475 drugs and validated on an external dataset of 198 drugs. The best model achieved an AUC of 0.955 exceeding the accuracy of previously reported DILI models.
Another type of method is called graph convolution models. The basic idea is similar to the UGRNN method, which employs NNs to automatically generate a molecular description vector and vector values are learned by training NNs. Inspired by the Morgan circular fingerprint method [48], Duvenaud et al. [49] proposed the neural fingerprint method as one of the first efforts in creating a graph convolution model. The workflow of this method can be seen in Fig. 3. First, the 2D molecular structure is read to form a state matrix, containing atom and bond information (based on the bonds attached to the atom) for each atom. The state matrix then goes through a convolution operation via a single-layer NN to generate a fixed length vector as the molecular representation. The convolution operation can be run at different levels by considering the contribution of neighboring atoms, which is equivalent to the circular fingerprints at different neighboring levels. The vectors generated from different convolution operations first go through a softmax transformation and then are summed up to form the final vector for the compound, which is a neural fingerprint encoding molecular level information. The neural fingerprints are passed through another fully connected NN layer to generate the final output. The bit values in the neural fingerprint are learned through training and are differentiable. In Duvenaud's three test cases, better results were obtained using neural fingerprints than with Morgan fingerprints and, more importantly, the influential substructures in the graph convolution model can be visualized to interpret the model. The advantage of the graph convolution model is that descriptors are generated automatically during the training and do not need any predefined molecular descriptor. Such a descriptor is not a general descriptor, but is taskspecific and fully differentiable and hence can potentially provide better prediction. Other molecular graph convolution methods were reported by Kearnes [49,50,53,55,56] into a common framework known as a message passing neural network (MPNN) and used the MPNNs to predict quantum chemical properties.
Besides the graph-based representation learning methods, DL methods based on other types of molecular representation were also explored. Bjerrum [57] used a SMILES string as the input to LSTM RNNs to build predictive models without the need to generate molecular descriptors. More interestingly, it was observed that augmenting the dataset by using multiple SMILES strings to represent the same compound achieved better results than using canonical SMILES. Goh et al. [58] applied a CNN on images of 2D drawings of molecules and achieved surprisingly comparable results to DNN models trained on ECFP. Moreover [59], when the images were augmented with some basic chemical information, the model performance was further improved. The capability of learning representations from structures directly without using any predefined structure descriptor is an important feature distinguishing DL from other machine learning methods and it basically makes the traditional feature selection and reduction procedures unnecessary.

De novo design through deep learning
Another interesting application of DL in chemoinformatics is the generation of new chemical structures through NNs. Gó mez-Bombarelli et al. proposed a novel method [60] using variational autoencoder (VAE) to generate chemical structures (Fig. 4). The first step is to use VAE to do unsupervised learning to map Illustration of graph convolutional neural networks (CNNs) [49]. A molecular graph first goes through a convolution operation via a single layer NN to form a vector of fixed length. The convolution operation can be run at different neighbor levels. The vectors generated from different convolution operations then go through a softmax transformation and are summed up to form the neural fingerprints of the compound. The neural fingerprint is passed through another fully connected NN layer to generate the final output. RNNs have been very successful in the natural language processing area [31]. Segler et al. [64] and Yuan et al. [65] reported their studies using RNNs to generate novel chemical structures. After training the RNN on a large number of SMILES strings, the RNN method worked surprisingly well for generating new valid SMILES strings that were not included in the training set (Fig. 5). The RNN writes structurally valid SMILES by learning the underlying probability distribution of characters in a SMILES string and, in this case, RNN can be regarded as a generative model for molecule structures. Segler et al. [64] also explored the possibility of using RNNs to generate target-specific libraries by first training a general prior model and then a fine-tuned focused model through transfer learning on a small set of target-specific active compounds. In a retrospective analysis for testing on two antibioactive targets, their focused models were able to generate 18% unseen true actives for Staphylococcus aureus and 28% for Plasmodium falciparum.
Jaques et al. [66] applied a reinforcement learning technology, called Deep Q-learning, together with an RNN to generate SMILES with desirable molecular properties such as cLogP [67] and QED drug-likeness [68]. However, their method needed a reward function that incorporates handwritten rules to penalize undesirable types of structures, which otherwise would lead to exploitation of the reward resulting in unrealistically simple molecules. To overcome the drawback, Olivecrona et al. [69] proposed a policy-based reinforcement learning approach to tune the pre-trained RNNs for generating molecules with given user-defined properties. In one test example for tuning the model toward generating compounds predicted to be active against the dopamine receptor type 2, the model generated structures of which >95% were predicted to be active, including experimentally confirmed actives that have not been included in the generative model nor the activity prediction model.
The methods described above have demonstrated potentials as alternatives to the traditional rule-based approaches for de novo design. However, GANs and the reinforcement learning methods are known to be susceptible to mode collapse (i.e., the models only generate a single solution or a small family of similar solutions). This has been highlighted in a recent survey [70] on de novo structure generation using DL tools. Considerable effort [71,72] has been spent to address this issue.

Application of deep learning in predicting reactions and retrosynthetic analysis
Synthesis predictions have a long history dating back to rule-based methods in the 1960s [73]. Very recently some promising results were reported in reaction prediction using DL methods. Although there has been no explicit comparison with other machine learning methods, the results indicated that DL can achieve performance on-par with, or superior to, the rule-based methods. Schematically, two types of problems can be addressed with machine learning including DL in reaction informatics. One type is forward reaction prediction, where the products are predicted given a set of reactants, and the other type is retrosynthetic prediction, where given a final product the reaction steps that produce the product are predicted. Coley et al. [74] utilized NN to rank the candidate products for a set of reactions based on a training set of 15 000 reactions from US patents. The reactions were classified into templates and the trained model correctly assigned the major product rank 1 in 71.8%, rank 3 in 86.7% and rank 5 in 90.8% of cases. To overcome the coverage and efficiency issues faced with the template-based reaction prediction methods, a template-free approach was proposed [75] in a followup study by the same research group. They employed the Weisfeiler-Lehman difference network to score the generated candidate reactions and superior performance was achieved compared with reaction template-based methods. Segler et al. [76] used 3.5 million reactions as the training set for DNN. A top-ten accuracy of 97% for reaction prediction and 95% in retrosynthetic analysis were achieved. In another study [77], they combined policy networks and Monte-Carlo tree search for retrosynthetic  The illustration of a variational autoencoder (VAE) method. The encoder neural network (NN) converts a discrete molecule into Gaussian distribution deterministically. After the latent variables are reparameterized against the gaussian distribution with given mean and variance, a new point is sampled and fed into the decoder NN. In the generation mode, only the decoder is used to generate a new molecule from the sampled latent point.
prediction utilizing a training set consisting of 12 million reactions from scientific literature. Their system can solve twice as many molecules' retrosynthesis plans as the rule-based method. Liu et al. [78] used neural sequence-to-sequence models for retrosynthetic prediction. They used 50 000 reactions obtained from US patents to train the network and obtained similar accuracy to rule-based methods.

Application of convolutional neural networks to predict ligand-protein interactions
Assessing the interaction between a protein and a ligand is the crucial part of the molecular docking program and a lot of scoring functions were developed either based on forcefields or knowledge from existing protein-ligand complex structures [79]. Inspired by the success of CNNs in image analysis, several studies have been recently published in applying a CNN to score protein-ligand interaction. A typical example is the investigation done by Ragoza et al. [80]. The protein-ligand structures were discretized into a grid with a resolution of 0.5 Å . The grid was 24 Å on each side and centered on the binding site. Each atom was described with a function, and atom densities over the grid were generated to form the input matrix. Multilayer CNN models were defined and trained using the Caffe DL framework. The CNN scoring outperformed AutoDock Vina [81] on the CSAR inter-target pose-prediction dataset [82], but performed worse for intra-target ranking of poses. Other studies utilizing CNNs or DNNs have also been published [83][84][85]. Although some encouraging results have been obtained with convolutional networks, it is not clear whether they will consistently improve results compared to currently used scoring functions.

Benchmark datasets within chemoinformatics
The rapid advances made in the field of image recognition can be attributed to not only the emergence of novel algorithms but also to the existence of canonical and large datasets. The standardized dataset would allow the community to conveniently benchmark or evaluate developed machine learning methods. The yearly ImageNet Large Scale Visual Recognition Competition (ILSVRC) [86] has seen the birth of many influential CNN architectures.
Although several open-source chemoinformatics datasets [87,88] are available, their impact on machine learning method development is still limited owing to the limited size of those datasets, lack of diverse ways of splitting training and test-sets and, more importantly, lack of a standard evaluation platform for proposed new algorithms. Inspired by WordNet [89] and ImageNet [  Structure generation from recurrent neural networks (RNNs). The upper plot shows how the RNN model thinks when generating the structure on the bottom right. The y axis lists all possible tokens that can be chosen at each step, the color represents the conditional probability for the character to be chosen at the current step given the previously chosen characters, and the x axis shows the character that, in this instance, was sampled. The bottom left figure demonstrates how the RNN actually works in the structure-generation mode. At each step a character was sampled based on the conditional probability distribution calculated from the RNN model and the generated character will then be used as the input character for generation of the next character.

Application of deep learning in biological imaging analysis
In the drug discovery process, biological imaging and image analysis are widely used at various stages from preclinical R&D to clinical trials. Imaging enables scientists to see the phenotypes and behaviors of hosts (human or animals), organs, tissues, cells and subcellular components. Through digital image analysis, the hidden biology and pathology, as well as the drug mechanism of action, are revealed. Examples of imaging modalities are fluorescently labeled or unlabeled microscopic images, computed tomography (CT), MRI, positron emission tomography (PET), tissue pathology imaging and mass-spectrometry imaging (MSI). DL has also made its way to successes in biological image analysis and many studies reported a superior performance compared with classical classifiers. For microscopic images, CNNs have been used [93,94] for segmenting and subtyping individual fluorescently labelled cells, as well as unlabeled imageries from phase contract microscopy [95,96]. Other traditionally laborious tasks from preclinical settings, such as cell tracking [96] and colony counting [97], could also be automated using DL. Images from tissue pathology are typically complex in nature compared with the fluorescently labeled images owing to rich tissue morphology. Nevertheless, at the cellular level, the segmentation and classification of individual cells were achieved in breast and colon tissues stained with hematoxylin and eosin (H&E) staining [98,99]. At the tissue region level, the tumor regions from H&E-stained breast tissue were identified through DL [100], whereas the extra categories of leukocytes and fat tissue can also be recognized [101]. Beyond basic image segmentation, DL has already been used for the histopathological diagnosis with H&E and the immunohistochemistry stained tissue [102,103].
The application of DL was also applied for the analysis of CT [104][105][106], MRI [107,108] and PET [108] imaging. Besides the popular application of the image segmentation [106,107] and classifications [104,105], its utilities have also been shown in content-based image retrieval [109] and it was reported that DL methods outperformed the popular ISOMAP and Elastic Net methods.
For the emerging MSI, similar to the application of DL in tissue pathology, tumor subtyping can be performed by high-resolution matrix-assisted laser desorption/ionization (MALDI) MSI [110]. Given that MSI can visualize the metabolic information of a tissue, sub-regions of a tumor with metabolic heterogeneity from desorption electrospray ionization (DESI) MSI can already be detected through DL [111]. Finally, in an unusual imaging area: flow cytometry, DL enabled the cell classification in real-time for high-throughput applications [112]. The training of DNNs for imaging is time-consuming and requires dedicated GPU proces-sing. Furthermore, in the context of high-throughput imaging screening, good-quality training sets are rare. Therefore, image features trained from natural scenes and other datasets were 'borrowed' to perform biological image segmentations and classifications, and robust performances were reported [101,113].

Future development of deep learning in drug discovery
Machine learning methods and DL in particular generally need large datasets for training; however, the human brain has the capability of learning through only a few examples. How to learn with only a small amount of available data is therefore one of the hottest topics in machine learning. A DL example of exploiting auxiliary data to improve a model with only a few data points is matching networks [114], which was proposed as a variant of oneshot learning. Improved results were obtained when the auxiliary data were included. Methods like one-shot learning are relevant to drug discovery, where medicinal chemists often work on novel targets with limited data available. Altae-Tran et al. [115] utilized the LSTM method on chemoinformatics datasets to build models with a very small training set and promising results were reported. Very recently, a new type of architecture has been used in DL: memory augmented neural networks. The first version was the neural Turing machine. This architecture was significantly improved with a differentiable neural computer (DNC) [116]. DNCs have been applied to several problems like question-answering systems and finding the shortest path in graphs. However, these more-advanced architectures have not been applied so far in drug discovery.

Concluding remarks
Machine learning has been used since the late 1990s in drug discovery and has established itself as a useful tool in drug discovery. A recent extension of the machine learning toolbox is DL. In comparison with other methods, DL has a much more flexible architecture so it is possible to create a NN architecture tailor-made for a specific problem. A disadvantage is that DL in general needs very large training sets. A relevant question is: is DL is superior to other machine learning methods? We believe it is still too early to draw any firm conclusion, the results so far indicate that DL is superior for certain tasks like image analysis and very useful for de novo molecular design and reaction predictions. For tasks with structured input descriptors, DL seems to perform at least on-par with other methods. The most relevant example is bioactivity prediction where DL seems to achieve better performance overall through multitask learning. However, other machine learning methods are also improving. One example is the XGBoost [117] method, which has dominated Kaggle competitions for structured input data [118] after its introduction. Thus, in practice the choice of method used in bioactivity prediction might depend on which method the modeler is most familiar with. If different machine learning methods achieve roughly the same accuracy, the limit of what can be achieved with a machine learning model could depend on experimental uncertainty for the data and dataset size rather than the specific algorithm used.