Introduction

The discovery of molecular structures with desired properties for applications in drug discovery, crop protection, or chemical biology is among the most impactful scientific challenges. However, given the complexity of biological systems and the associated cost for experiments and trials, molecular design is also scientifically very challenging, prone to failure, inherently expensive and time consuming [1, 2]. To improve our odds and the timelines in this process, and to identify good starting points, unbiased incorporation of knowledge through continuous analysis of literature and patents from different scientific fields is required [3]. The number of yearly publications is increasing, and a good collaboration between scientific experts across disciplines is required to fully evaluate the potential of a hypothesis. The theoretical space of chemistry, even when limited by molecular size, is huge [4] and dramatically exceeds what we can assess experimentally and even computationally. How to navigate through it efficiently and select molecules that satisfy the multiple parameters that need to be optimized and that are synthetically accessible [5]? The number of existing data points at the beginning of a project are low. How can we enrich projects in short time frames with informative molecules and data that are subsequently used to drive the design?

With these questions in mind, it comes as no surprise that data mining and statistics have been integrated into molecular discovery and design pipelines to provide computational support in the prioritization of molecular hypotheses [6, 7]. Machine learning algorithms have been part of the routine toolbox of computational and medicinal chemists for decades. The recent increase in applications and coverage of these methodologies has been attributed to advances in computational power, the growing amount of digitized research data, and an increasing theoretical understanding of the algorithms and their shortcomings. However, given the gradual character of these evolutions, it might be counterintuitive to expect a dramatic revolution of molecular design. Nevertheless, extravagant claims have been made for the ability of Artificial Intelligence (AI) to accelerate the design process [8, 9]; how well founded are these claims? While there is unquestionably a lot of potential in novel computational tools, it is important to scrutinize them and compare their performance to already existing methods, to objectively distinguish real progress from promotion. Only such careful evaluations will enable us to shed light on whether novel artificial intelligence methods contribute to an evolution or a revolution of the established scientific discipline of computer-assisted molecular design [10].

The historical context of machine learning in molecular design

Machine learning and AI are not new to researchers in computer-assisted molecular design. The pioneering work of Hansch and Fujita [6], as well as Free and Wilson [7], established the field of quantitative structure–activity relationship (QSAR) modelling. In their groundbreaking work, they used focused datasets as small as a series of a dozen chemical derivatives to fit equations that would anticipate fairly complex phenotypic effects such as toxicity [11]. Spurred by this success, a large research area has emerged that focuses specifically on (a) identifying approaches to describe chemical structures in more detail, to capture the characteristics that govern their properties such as pharmacophores and three dimensional structure but also autonomously learned representations [12, 13], and (b) derive increasingly complex mathematical relationships that aim at describing the causal relationship between these chemical characteristics and the biological properties of interest for predictive purposes [14, 15]. Through an increasing amount of structural information [16], as well as data generation through combinatorial libraries and high-throughput screening, first applications of more complex machine learning models became feasible. However, the excitement and promise was shortly after followed with disenchantment. The growing field of QSAR learnt hard lessons in the 1990s about model validation, control experiments and other pitfalls [17]. Specifically, the overly broad application of computational models as hard filters for data sets that had not been covered in the training data led to an increasing disappointment in this technology.

With increasing understanding of the algorithmic principles and their statistical interpretation, the concept of domains of applicability was introduced [18,19,20]. Such predictive confidence estimates enabled computational drug hunters to increase the transparency of the capabilities of their tools as well as adjust expectations. This led to an increasing number of successful applications of machine learning to drug discovery and design across academia and industry in the 2000s, which slowly rebuilt the trust of the community and led to a sustained growth of their use. By 2015, computational advances such as the broad inclusion of GPUs in modern computing frameworks and the increasing amount of available RAM, the training of larger and deeper neural nets became feasible. At the famous Kaggle challenge, a team from Toronto used a Deep Neural Net [21] to win a SAR challenge set by Merck. This competition is commonly perceived as a turning point in which a complex deep learning AI method had outperformed other machine learning approaches and therefore arrived as a useful tool for computational molecular design. Deep Learning can trace its roots back to the 1960s, in its theoretical form at least, with the work of Ivakhnenko and Lapa [22]. AI can trace its roots even further back to a workshop that was run at Dartmouth College in 1956. Even given AI’s long history, and typically longer than many imagine, the field has had a number of ‘winters’ with expectations not matching reality. This has led to a number of setbacks for the field and it has taken time to recover from these. While now multiple promising applications of AI exist to derive molecular descriptors and understand their relationship to biological properties, these methods are inherently linked to big data. These algorithms are typically very data hungry before they can provide useful solutions; as a bonus, they provide unprecedented opportunities to navigate large datasets.

Big data and navigation in chemical space

Analysis of very big chemical datasets is a major research area that can profit from the application of modern machine learning and AI-based methods. For many years the only larger public chemical data set available was the “NCI Open Database” [23], released in 1999 containing about 250,000 molecules. This database was used as a test case for validation of numerous “classical” cheminformatics methods and virtual screening techniques. Advent of PubChem [24] and later ChEMBL [25] databases considerably increased the amount of publicly available chemical data for model training and validation. PubChem currently contains more than 100 million unique compounds. ChEMBL, in its current 26th release, holds information on nearly 2 million compounds, 13 thousand targets, and 16 million relationships between these compounds and targets. Another useful source of public chemical data is the ZINC database [26] providing information about more than 230 million commercially available compounds. All these three data sources offer user friendly web interfaces, but since the data may be downloaded and processed locally, they also were used for development of several novel analysis and visualization tools [27, 28]. Recently, two new experimental developments have increased the amount of available data by several orders of magnitude. One of these technologies is DNA-based library synthesis [29], where a single library can contain tens or even hundreds of millions of molecules. Introduction of so called "readily available" virtual libraries offered currently by several compound vendors became another important factor in increasing the resolution of possible molecular solutions: the virtual molecules in these libraries are enumerated using exclusively validated synthetic protocols and available building blocks, thereby enabling the vendor to guarantee delivery of picked molecules in a relatively short time. The number of molecules in these libraries is reaching billions [30]. With these developments in mind, the community is expecting further increases in available chemical matter, so that in the next decades we are likely to witness datasets with several billion compound structures. This is an exponential growth, comparable with the Moore's law describing the increase in computer processing power, that will push the number of synthetically accessible molecules towards the size of the virtual chemistry database GDB-17 with 166 billion structures [4] and thereby enable the fine-tuned selection of molecular prototypes—if the amount of data can be appropriately handled.

Classical cheminformatics methods are often struggling with such very big data sets, although some recent developments are promising [30,31,32]. Novel machine learning and AI-based approaches can help by adaptively navigating vast chemical spaces and autonomously focusing on the most promising regions. In this special issue, several such approaches are described: in the study by Varnek and colleagues, [33] Generative Topographic Mapping, a sophisticated dimensionality reduction method, was used to compare molecules in the company archive of a large pharmaceutical company with over 8 million commercially available samples. The method was enhanced by an AutoZoom function that focuses on the heavily populated areas of chemical space and automatically extracts substructures well representing these dense regions. The methodology was used to identify sets of commercial molecules maximally enhancing the chemical space covered by molecules already available in the investigated company archive. Such approaches enable the adaptive enrichment of compound sets.

Following an orthogonal approach, Tetko and colleagues [34] describe a focused library generator that is able to generate molecules with a higher chance to exhibit desired properties. The generator is based on the long short-term memory (LSTM) recurrent deep neural network with results directed by the reinforcement learning process to a specific target. As a proof of concept, Mdmx inhibitors were chosen as the objective for the presented study. The generated molecules were further refined by pharmacophore screening and molecular dynamics simulations. Additionally (and something that fortunately has become more commonplace in computational molecular design research), the source code of the generator is available at GitHub, which will allow other researchers to adapt it and use it in their own projects. Taken together, such adaptive approaches will improve the ability of research teams to navigate billions of possible structures to find molecular solutions that are sufficiently optimal for practical applications if the predictive algorithms are powerful enough and sufficiently validated.

Practical considerations for AI-based molecular design

The field of machine learning and AI has moved from theoretical studies to real-world applications. The field of cheminformatics and especially QSAR have always been early adopters of statistical methods and machine learning, but in the past few years the development of novel algorithms in this area has drastically increased. Besides more conventional models like Random Forest, Gradient Boosted Trees, or Gaussian Processes, which have been applied very successfully in the past [35], novel techniques like deep neural nets (DNNs), convolutional neural nets (CNNs) or recurrent neural nets (RNNs) have been increasingly recognized as valuable additions to the toolbox of chemoinformaticians [14, 15, 21, 36,37,38]. CNNs are especially attractive in this regard as they offer a different, data-driven way to extract molecular features [39, 40]. The promise of these novel techniques originates not only from slightly higher performance metrics in retrospective evaluations but even more importantly in an inherent ability to process unstructured data as well as navigating and manipulating the “latent” space. This has led to a series of specialized AI tools that can perform tasks that are not possible with “traditional” machine learning algorithms (see for example References [9, 41, 42]). Another series of publications has shown the ability of deep neural nets to use matrices of experimental observations (multitasking) rather than vectors to improve predictive accuracy [43, 44]—this is especially useful for noisy and smaller data sets, for which data collection experiments are time-consuming and expensive, for example in ADMET predictions [45,46,47,48,49]. Directly tackling this challenge is also possible with one shot learning [50] which enables learning from a low amount of data that is potentially better curated compared to high-throughput data. Conversely, to further combat low data limits and autonomously enable data generation, a new direction is the automation of experiments and “closing the loop” in the design-make-test-analysis (DMTA) cycle typically used in drug discovery programs [51]. Active learning [52] is being applied with increasing popularity to the analysis part of the DMTA cycle. This technique assists in selecting the most “interesting” compounds (most commonly the compounds that will help to improve the model) to test in the next cycle. The new results are then fed back into the system to improve model prediction quality and to rapidly increase the applicability domain of the model [53]. The design part of the DMTA cycle has received more attention, with generative chemistry methods well to the fore. Multiple new de novo design models based on RNNs [54,55,56], variational autoencoder (VAE) architectures [57,58,59] or generative adversarial networks (GAN) [60, 61] have been developed recently (see also Ref [62].). Most of these models are trained on molecule structures from large public compound collections like ChEMBL [25] or PubChem [24] (to ensure “druglikeness”) and are able to generate completely novel molecules according to an objective function, for example, similarity to a given input structure or fitting to constraints in certain properties like logP or activity against a protein target. For the “make part” of the DMTA cycle retro-synthesis, reaction condition or reactivity prediction has been in the focus of the new DNN-based models [41, 63,64,65,66]. Here, substantial progress has been made in all areas given both access to more experimental data [67, 68] but also to the sophisticated techniques like Monte Carlo Tree Search (MCTS) which helps to identify the most likely synthetic routes in retro-synthesis planning using deep neural networks and symbolic AI [41]. In this special issue, Ghiandoni and colleagues present a novel reaction-based de novo design algorithm [69] adapting previously published work on reaction vectors [70, 71] to optimise molecular structures that are likely to be more synthetically tractable. Using a recommender system, the authors demonstrate that their new methodology successfully prioritises the most relevant reaction vectors; this reduces the possibility of combinatorial explosion in the number of solutions while simultaneously ensuring that the probability of successful synthesis is high.

QSAR modelling has also concentrated on interpretability to assist the design part of DMTA; this assumes that the design is being carried out or supervised by skilled human experts. AI models are rather complex, in terms of their representations of molecules. For that reason they are often treated as black boxes and interpretation or understanding of what exactly is learned remains difficult [72]. The paper in this special issue from Webel et al. demonstrates the impact of deep learning to the area of identifying cytotoxic substructures in a large corpus of data [73]. Here, the authors use Deep Taylor Decomposition to identify these toxicophores in the training set so that one can more easily diagnose the structural drivers of toxicity. Such interpretability will enable to increase the credence into novel methodological developments and facilitate the implementation of such methods into established molecular design pipelines.

In an industrial setting, an important aspect is making all these novel machine-learning models and technologies operational: this includes deployment, access, reproducibility, monitoring and maintenance. In addition, these new machine-learning systems bring novel technical challenges in industrial settings which often are not directly obvious [74]. Green and colleagues [75] discuss how these novel methods can be made accessible to a broad range of scientists in GSK and how a smart design of the system can help with maintenance and deployment. Their system called BRADSHAW integrates methods for chemical structure generation, experimental design, active learning and cheminformatics tools to allow automated molecular design in the DMTA cycle. Due to a very modular design of their system they can incorporate many of these novel methods and models. In a retrospective case study they show how the system can be used successfully in lead optimization for the design of MMP12 inhibitors.

Control experiments—is AI really doing better?

In recent years there has been a resurgence of interest and demonstrated impact of Artificial Intelligence in a number of domains [9, 76, 77]. The biggest impact in recent years has been the advent of publicly available Deep Learning algorithms for processing image data and pattern recognition through the ImageNet [78] competition, leading to a victory for Deep Learning in 2012. The recent advances, especially in Deep Learning, have led to a huge quantity of research conducted in this area and published online in preprints and peer-reviewed articles. Of particular interest here, is the great quantity of research directly at challenges in chemistry and, specifically, drug discovery and materials chemistry. Given the increasing importance of these new machine-learning methods in a plethora of fields, researchers are trying to better understand how these models work [79, 80]. As might be expected, these models have a high risk to learn something different than what was intended [81, 82]. Much work has still to be done to make these methods resilient to noise (brittleness) or overfitting [83]. Latter, i.e. memorization of training data by these models, can lead to a reduced performance on prospective data in the best case but also to security issues in the worst case [84, 85]. Due to these reasons, the establishment of a strong tool kit for validation of these models is crucial (see for example [86,87,88]). In this special issue, Lee and coworkers [89] have investigated a recent study on large scale comparison of deep learning models with more traditional methods on bioactivity prediction tasks [43]. They show how critical it is to choose the right metrics for benchmarking regarding data distribution and data biases to enable a fair comparison of the methods. Furthermore they suggest using precision and recall statistics in conjunction with the common area under the receiver-operator curve (AUC–ROC). Finally they report challenges in interpreting scaffold-splitting cross-validation results. They conclude that more research needs to be done in proper validation procedures for these models used in the field of chemoinformatics.

Conclusions

As is evident from the information covered in this perspective and by the plethora of scientific and media outlets, many opportunities exist now for the development of novel computational methods, data-driven workflows and algorithmic tools that lead to a higher degree of automation and improve the efficacy of certain components in the drug design process [37]. A particular focus lies on assisting the selection of which experiment to carry out next [52]. The tight integration of artificial intelligence into pharmaceutical, chemical, and crop protection research is inevitable and has the potential to significantly improve the efficiency and efficacy in molecular discovery.

Although slight increases in retrospective accuracy are unlikely to qualitatively change the ability of machine learning to support the drug discovery and development pipeline [10], we anticipate an enthusiasm for this technology, coupled to technological and algorithmic advances, to significantly further the field and increase the contribution of computational tools in the chemical sciences. A possible inflection point for the field will be the concurrent progress initiated by the convergence of multiple AI branches, such as natural language processing, computer vision, and robotics. This might very well amplify the increase in available information, change our ability to automate and increase reproducibility of experiments, as well as accelerate our understanding of the inner-workings AI. We are still a very long way from a completely in silico discovery process; the need to perform experiments is still vital.

With these advantages in mind, novel challenges will occur. First and foremost, similar to the emergence of applicability domains, a consensus among the community needs to be reached about what appropriate controls are to validate and assess novel AI tools [90]. Specifically relevant will be the proper implementation of adversarial controls to reduce the risk of overfitting, brittleness, and other classical machine learning challenges [84, 91], which are easily overlooked with increasing model complexity. Another important challenge that arises with increasingly complex models will be the potential for attacks or simply unrobust predictive behavior [85, 92]. This is a recurrent hot topic in deep learning research and its implications for novel computational tools in molecular design will need to be carefully considered.

In this special issue, we have carefully picked a selection of classical challenges in computer-assisted molecular design and have invited some of the leading scientists in their respective disciplines to contribute studies that propose avant-garde computational approaches to address these challenges and evaluate and contextualize their potential to accelerate drug discovery. We expect that this special issue will provide an overview of the possibilities that these novel tools hold, but also provide important examples on proper quality control, validation, and domain of applicability assessments. We hope that this will serve as a compendium to stir further discussions and guide the future development of novel AI-tools to guide molecular design.