Perspectives on Supercomputing and Artificial Intelligence Applications in Drug Discovery

This review starts with outlining how science and technology evaluated from last century into high throughput science and technology in modern era due to the Nobel-Prize-level inventions of combinatorial chemistry, polymerase chain reaction, and high-throughput screening. The evolution results in big data accumulated in life sciences and the fields of drug discovery. The big data demands for supercomputing in biology and medicine, although the computing complexity is still a grand challenge for sophisticated biosystems in drug design in this supercomputing era. In order to resolve the real-world issues, artificial intelligence algorithms (specifically machine learning approaches) were introduced, and have demonstrated the power in discovering structure-activity relations hidden in big biochemical data. Particularly, this review summarizes on how people modernize the conventional machine learning algorithms by combing non-numeric pattern recognition and deep learning algorithms, and successfully resolved drug design and high throughput screening issues. The review ends with the perspectives on computational opportunities and challenges in drug discovery by introducing new drug design principles and modeling the process of packing DNA with histones in micrometer scale space, an example of how a macrocosm object gets into microcosm world.


Big Data and Supercomputing Challenges in Drug Discovery
In the last century, three cutting-edge inventions, which were combinatorial chemistry (CC), polymerase chain reaction (PCR), and high-throughput screening (HTS), significantly changed biomedical science and technology. CC was invented by Robert Bruce Merrifield who won 1984 Nobel Prize for Solid Synthesis [26], and made high throughput syntheses (a method for scientific experimentation using robotics, data processing/control software, liquid handling devices, and sensitive detectors allows a researcher to quickly make millions of chemicals for biological tests) become possible [19]. PCR was invented by Kary Banks Mullis who won 1993 Nobel Prize [31], and expedited human gene project. HTS was invented by Donald J. Cram, Jean-Marie Lehn and Charles J. Pedersen, who jointly won 1987 Nobel Prize in chemistry for their development and use of molecules with structure-specific interactions of high selectivity. HTS significantly accelerated screening huge number of compounds against biological targets. These inventions trigged high throughput science and technology and revolutionized pharmaceutical discovery and development. Because people now can make chemical compounds, biopolymers and validate their biological properties in high throughput manner. Consequently, human being is facing big data and supercomputing challenges in modern time.
Drug discovery and development involve in the following major processes: molecular design, biological or chemical syntheses, molecular structural elucidations and pharmaceutical analyses, pharmaceutical target identification and validation, drug screening, preclinic experiments and clinic trials, pharmacokinetics (PK) and pharmacodynamics (PD) analyses, disease diagnoses, and clinic drug applications. Each process involves instrumental measurements that result in big data. These data are not only "big" (volume from GB to PB), but stored in many different formats (variety) and required prompt analyses (velocity).
There are mainly four sources contributing to the big data in drug discovery: 1. High throughput experiments. High throughput syntheses can generate many data describing molecular structures and properties and high throughput screening campaigns can generate many data regarding the relations of the compounds and their biological targets. 2. Health information / office automation. These resources contain patients information regarding demographic, administrative, health status / risks, medical history, current management of health conditions, and outcomes data. 3. Scientific publications, patents, and databases. Publications in life sciences grow rapidly. PubMed collects more than 30 million biomedical articles from more than 7,000 journals; by August 2020, American Chemical Abstracts (ACS) collects more than 100 millionth compound, 64 million gene sequences. These big data bring in following challenges: 1. Data storage. Petabyte (10 15 bytes) of digital information relies on cloud storage; chemical and biological data annotation/curation and quality assurance are challenging. 2. Visualization. Small molecules or biopolymers are described in graphs per se. These objects are usually converted into numbers (descriptors). Thus, a molecule is defined as a point in multi-dimensional space that requires dimension reduction approaches (such as principal component analysis (PCA), and nonlinear dimensionality reduction techniques), metadata generation techniques. 3. Data mining. Based on the high dimensional data, scientists are facing classification problems. Molecules are classified into two or more clusters corresponding to their phenotypes. Moreover, people need to understand the relations between the key factors/features/chemotypes and a specific phenotype(s). The real challenges are (a) the relations between the features and phenotypes are not of classical analytic function relations; (b) the features for an entire molecule are not related to its phenotypic property in the most of situation; (c) the local feature(s)/substructure(s) for an molecule can be the key to a phenotypic property, but there are uncountable ways to partition a molecular structure into substructures. That is why so may data mining tools have been developed (such as clustering algorithms, decision trees, supporting vector machines (SVM), artificial neural networks (ANN). 4. Computational complexity. The most precise theory to study a molecular system is quantum chemistry. However, the computational complexity of different quantum chemistry algorithms is so difficult that even a quantum computer will be unable to solve [38]. When we deal with a huge number of molecules interacting a protein, the situations are worse. To identify drug targets for a drug lead, multisequence alignment techniques are required. The computational complexity of sequence alignment algorithms ranges from O(m * n) to O(n 2 ) [5]. To identify privileged substructures responsible for a biological activity, sub-structure match algorithms are applied. The computing complexity of these algorithms are usually polynomial [39].
Traditional drug discovery did not generate big data, however, modern instrumentation and automation changed the situations. With micro-chip technology, people can collect in vivo data from model animals 24 hours a day to monitor a drug action in sito. New drug discovery technologies such as Surface Plasmon Resonance (SPR, measuring protein-ligand affinity with optics) [18], Isothermal titration calorimetry (ITC, measuring protein-ligand affinity with entropy and enthalpy) [29], and Saturation Transfer Difference NMR spectroscopy (STD-NMR, measuring protein-ligand affinity with nuclear magnetic resonance) [27] allow us to acquire unprecedent protein-ligand interaction data. Omicses (such as Pharmaco-proteomics, Pharmacometabonomics), High performance computing, and cloud technologies are indispensable components for modern drug discovery service platform (Fig. 1). To constantly monitor the pharmaceutical efficacy of a compound in vivo, 2-3 months continuous administration will result in 3.5 PB physiological and pharmacological data, from which the efficacy, dosage, and toxicity can be determined.
Constantly monitoring cell changes (such as the effect of drugs on cell activity, the drug distribution, the alter cell behavior, proliferation or apoptosis) while cells incubated with a compound will result in 1 PB data for tracking 10 traits in 1000 cells for 24-hours.
Drug discovery processes involve in the data derived from patients to various devices in many different formats; these data require different search engines and approaches to retrieve and elucidate; and eventually result in personal diagnosis and treatment scenarios (Fig. 2).
Intrinsically, modern drug discovery is to discover macrocosmic solutions by simulation microcosmic phenomena with many experimental data. Therefore, this is a multi-scales simulation Figure 2. Data, search engine and mining tools involved in drug discovery process process, which covers time scale (from femto-seconds to hours/days), space scale (from nanometers to meters) at changing resolutions (from electron orbitals to molecular machines) and various theories/methods (from density function theory to biopolymer physics) [11].
Therefore, the computing complexity in drug discovery is due to the complexity of the molecular systems. In many cases, the computing complexity issue can be reduced by parallel computing technology (aka high-performance computing) if the problem is parallelizable. For example, employing molecular dynamics-based virtual screening (MDBV), a state of the art HPC can be 600 times faster than an eight-core PC server is in screening a typical drug target (which contains about 40 K atoms). Also, careful design of the GPU/CPU architecture can reduce the HPC costs [15].
A successful virtual drug screening campaign relies on a properly selected compound library. Brutal random virtual screening can lead failure even one has the highest performance computing facility. Therefore, we desperately develop artificial intelligence (AI) applications in pharmaceutical studies.

Artificial Intelligence and Drug Discovery
The essence of drug discovery is to identify a molecule that interacts its designated biological target from a compound library that have millions of molecules. To do this, we have to understand the relation of molecular structure and activity (SAR). Here, the structure in SAR is actually substructure. A drug molecule can be considered as a molecular machine consists of various functional parts (also termed as substructures, fragments, or chemotypes). How to define the functional parts has been puzzling for many years. Many methods, such as empiric-based method [13], and computational rule based methods [7,12,28,40] were proposed. There is no perfect way to partition substructures from a compound library. Therefore, People also explored other methods, such as molecular descriptors [30], atomic pairs [8], and fingerprints [6].
Conventionally, in order to predict whether a species (for example, a natural substance) has a biological activity, scientists have to extract moieties from the substance, determine chemical structures (represented in topologies, 3D shapes or static surfaces) of the active ingredients; then to covert the chemical structures into a numeric array (called as molecular descriptors or fingerprints). Then, various mathematic models are applied on the data to generate predictive models. Finally, the models result in the prognosis whether the substance is a candidate to become a drug (Fig. 3). Figure 3. Flow-chart for conventional structure-activity predictions Molecular structure representations can be converted into various molecular descriptors such as sub-structural fragments, scaffolds, atom pairs (paths), topologic indexes, physical/biological/chemical properties, and fingerprints (bit-maps). The combinations of the descriptors will be figured out based on two principles: (1) a descriptor in the combination has to be significantly associated with the property to be predicted; (2) descriptors within the combination should be orthogonal to each other. Based upon the descriptor combination data, one can build predictive models with learning methods as shown in Fig. 4. The cores of AI are pattern recognitions that are divided into numeric and non-numeric pattern recognitions. Markush structure or substructure recognitions are non-numerical; selforganizing map (SOM) (aka Kohonen network), support vector machine, hierarchical cluster tree, or random forests (aka random decision forests) are numerical. The common defect of the conventional machine learning algorithms is that the model performance highly relies on how a modeler selects and combines the molecular descriptors. Unfortunately, there is not rational rules to choose and combine molecular descriptors. In order to make up for this defect, people tried many approaches, such as rule-embedded naive Bayesian learning [24], multiple machine learning models [23], and combining recursive partitioning with Nave Bayesian learning approaches [35]. Now, people realize that deriving substructures that related to activities from a molecule or molecular library depends on related drug target. In the earlier time of chemoinformatics, a number of molecular structure linear notions were developed due to the lack of computer graphic terminals in that period. Weininger developed the linear notations system called as SMILES (simplified molecular-input line-entry system) that are well accepted internationally [37]. SMILES is an accurate language for molecules, a SMILES notation/sentence precisely describes the atomic connectivity in a molecule. Thus, a compound library can be "written" as an article composed in SMILES sentences. A focused compound library for a specific biological target can be viewed as an article written in SMILES sentences under the same title.
This concept is important because we can derive substructures and activities relations (SAR) without predefining substructures. With deep learning approaches, we can figure out the SAR or predict drug targets with syntax pattern recognition techniques [25].
As shown in Fig. 5, a chemical structure is converted into a SMILES sentence, which is then transformed to a reduced vocabulary, eventually a word embedding matrix is calculated and finally sent to recurrent neural network (RNN) to train a learning model. With self-attention mechanism, structure-activity/property relations (SAR/SPR) can be discovered through chemical linear notation (for example, SMILES) syntax analyses using an interpretable deep learning architecture. The syntax pattern recognition approach has been applied in predicting chemical properties, toxicology, and bioactivity from experimental data sets [2,3,9,10,17,36,44].
With the syntax pattern recognition protocol, drug-like, lead-like, or quasi-biogenic molecules can be proposed by a deep learning program. A quasi-biogenic molecule generator (QBMG) to compose virtual quasi-biogenic compound libraries by means of gated recurrent unit recurrent neural networks has been reported. The library includes stereo-chemical properties, which are crucial features of natural products. QMBG can reproduce the property distribution of the underlying training set, while being able to generate realistic, novel molecules outside of the training set. The proposed compounds were associated with known bioactivities. Therefore, with a given focused compound library for a biological target, a computer can generate novel compounds that are promising to be active against the target [43].
A property of a molecule can associate with one or more substructures in its structure. For chemical structure stability prediction, if one substructure is found responsible for the instability, it will be enough to conclude the molecule is instable. A model (DeepChemStable) [22] employing an attention-based graph convolution network based on the COMDECOM data (experimental chemical compound instability data set [45]) was implemented to predict a compound instability. The main advantage of this method is that is an end-to-end model, which does not predefine structural fingerprint features, but instead, dynamically learns structural features and associates the features through the learning process of an attention-based graph convolution network. The previous ChemStable program (with conventional machine learning approach) [24] relied on a rule-based method to reduce the false negatives. DeepChemStable, on the other hand, reduces the risk of false negatives without using a rule-based method minimizing the rate of false negatives, which is a greater concern for instability prediction (Fig. 6). Fragment-based drug design (FBDD) [16] gains great achievements these years. Linking fragments to generate a focused compound library for a specific drug target is still puzzeling. A program named SyntaLinker that is based on a syntactic pattern recognition with deep conditional transformer neural networks was reported recently. The state-of-the-art transformer links molecular fragments automatically by learning from known structures in medicinal chemistry databases (such as ChEMBL). Linking the fragments was viewed as connecting substructures that were predefined by empirical rules in the past. In SyntaLinker, however, the rules of linking fragments can be learned implicitly from the known chemical structures by recognizing syntactic patterns embedded in SMILES notations. With deep conditional transformer neural networks, SyntaLinker can generate molecular structures based on a given pair of fragments and additional restrictions [41].
Syntactic pattern has also been applied in predicting chemical reaction feasibility. Copper(I)catalyzed alkyneazide cycloaddition (CuAAC) reaction is a main click chemistry reaction [20] and widely employed in drug discovery. However, the success rate of the CuAAC reaction is not satisfactory as expected. A recurrent neural network (RNN) model was reported to predict its feasibility. Authors designed and synthesized a structurally diverse library of 700 compounds with the CuAAC reaction to obtain experimental data. Then, a bidirectional longshort-term memory with a self-attention mechanism (BiLSTM-SA) model was built. The model achieved total accuracy of 80%. Density functional theory investigations were conducted to provide evidence for the correlation between bromo-α-C hybrid types and the success rate of the reaction [32].

Perspectives on Computational Opportunities and Challenges in Drug Discovery
The Nobel Prize in Chemistry 2016 was awarded jointly to Jean-Pierre Sauvage, Sir J. Fraser Stoddart and Bernard L. Feringa for the design and synthesis of molecular machines [33]. This can be viewed as an overture for artificial molecular machine era. So far, chemists focus on the mechanical aspects artificial molecular machines [1,42]. Actually, a drug molecule can also be viewed as an artificial molecular machine that consists of a number of parts (fragments) for regulating biological targets. Thus, the essential questions for drug design methodology becomes (1) what are the fragments for a drug molecule for its target? (2) how to assembly the fragments to make (synthesize) a drug molecule? (3) how to biologically validate the assembled molecules. FBDD, click chemistry (combinatorial chemistry), and HTS are the current answers to these questions respectively. Drug discovery process is similar a machine invention process (Fig. 7). Drug discovery is much more sophisticated than design and make a machine in macrocosm due to a drug molecule has to regulate even more complicated biological machines in microcosm [14,21]. The main challenges to a drug designer are: (1) the designed plan for assembling fragments is not necessarily chemically feasible; and (2) the designed molecules against a target is not necessarily functioned as expected. Because most of the mechanisms of actions in life are not well understood to us. Therefore, new drug design approaches and in silico experiments are demanding to deal with the big data and computing complexity problems.
Artificial intelligence (AI) techniques will continue to demonstrate their power in drug discovery. Especially, deep learning (DL) techniques have shown the usefulness in deriving SAR from big biochemical data. However, DL assumes the positives and negatives are evenly distributed in a training set, and the number of the samples is big enough. However, typical medicinal chemistry data mainly contain positives with no or minor negatives.
Drug discovery involves multi-scale computation issues. For example, the length of a typical human DNA molecule is about 1.8 meters (visible in macrocosm) has to be tightly packed up to fit in the micro-meter-scale space of cell nucleus (in microcosm). We dont have a convincing theory to explain how a DNA enters microcosm world from macrocosm world with the help of histone proteins. It is a grand computational challenge to generate a model and simulate this process. Interestingly, recent report claimed that the histones are not just used for packing DNA, they are enzymes that may have helped power eukaryote evolution [4].