The machine learning life cycle and the cloud: implications for drug discovery

ABSTRACT Introduction: Artificial intelligence (AI) and machine learning (ML) are increasingly used in many aspects of drug discovery. Larger data sizes and methods such as Deep Neural Networks contribute to challenges in data management, the required software stack, and computational infrastructure. There is an increasing need in drug discovery to continuously re-train models and make them available in production environments. Areas covered: This article describes how cloud computing can aid the ML life cycle in drug discovery. The authors discuss opportunities with containerization and scientific workflows and introduce the concept of MLOps and describe how it can facilitate reproducible and robust ML modeling in drug discovery organizations. They also discuss ML on private, sensitive and regulated data. Expert opinion: Cloud computing offers a compelling suite of building blocks to sustain the ML life cycle integrated in iterative drug discovery. Containerization and platforms such as Kubernetes together with scientific workflows can enable reproducible and resilient analysis pipelines, and the elasticity and flexibility of cloud infrastructures enables scalable and efficient access to compute resources. Drug discovery commonly involves working with sensitive or private data, and cloud computing and federated learning can contribute toward enabling collaborative drug discovery within and between organizations. Abbreviations: AI = Artificial Intelligence; DL = Deep Learning; GPU = Graphics Processing Unit; IaaS = Infrastructure as a Service; K8S = Kubernetes; ML = Machine Learning; MLOps = Machine Learning and Operations; PaaS = Platform as a Service; QC = Quality Control; SaaS = Software as a Service


Introduction
Artificial Intelligence (AI) and Machine Learning (ML) have high potential to revolutionize drug discovery, and is increasingly being used in various aspects of the discovery process [1][2][3][4]. Examples of applications using AI/ML include drug screening and repurposing [5,6], structure-based modeling [7], imagebased analysis of cells using high-content microscopy [8], drug safety assessment [9,10], and de-novo molecular generation [11]. It is envisioned that AI will increase productivity and reduce attrition in drug discovery [12], a hypothesis that has been strengthened during the COVID-19 pandemic [13][14][15]. Examples of success stories include the identification of Halicin as an antibacterial molecule from a drug repurposing screen [16], and a phase I clinical study of DSP-1181 developed by Sumitomo Dainippon Pharma and Exscientia for the treatment of obsessive-compulsive disorder [https://www.bbc.com/ news/technology-51315462]. Several new companies are emerging where AI is at the core of their business, such as Recursion Pharmaceuticals with several ongoing phase 1 and phase 2 trials (https://www.recursionpharma.com), Benevolent AI (https://www.benevolent.com/), and Insilico Medicine (https://insilico.com/).
Training useful ML models primarily requires access to sufficiently sized dataset of observations with labels; an example would be a large set of compounds run in an assay where each individual experiment constitutes an observation and the label is the result of the assay. It is also important that the data is of high quality. No ML methodology can rescue a dataset with poor labeling [17], such as assays with high bias or high variance in the results. Collecting, harmonizing, storing and preprocessing data can be a very challenging task, even if the data is generated within the same organization. In some cases data might come from experiments having different setups, and sometimes data is distributed, further contributing to challenges [18]. Having a well-planned informatics infrastructure with databases and data management pipelines greatly improves the throughput and reduces the errors when assembling datasets for ML modeling.
Many life science instruments continuously increase their throughput, and new types of experiments emerge, capable of producing detailed depictions of biological processes [19]. This leads to ever-increasing data sizes -both in the number of objects studied but also in the resolution and dimensionality of each observation. Growing data sets enable more accurate AI modeling to be performed, but also contribute to challenges in data storage and the necessary computational infrastructure required for training.
Recently, Deep Learning, or Deep Neural Networks have emerged as a method that in many cases has been demonstrated capable of producing more accurate predictions in applications related to life science [20][21][22]. Convolutional Neural Networks have been especially useful for training AI models on images, such as captured by microscopy in cellbased experiments [8,23]. More recently, DeepMind's Alphafold was able to very accurately predict a protein's threedimensional shape (https://deepmind.com/blog/article/alpha fold-a-solution-to-a-50-year-old-grand-challenge-in-biology). Deep Learning requires, apart from the necessary data management infrastructure, substantial computing power. For this reason, accelerators such as GPUs are today more or less a requirement to train even medium-sized AI models using DL methods.
The ML life cycle describes the entire workflow from data to inference and contains many more challenges than the actual model training, including evaluating, deploying and serving models, managing a resilient computational infrastructure, and supporting reproducible data pipelines to enable continuous re-training (see Figure 1). These steps come with different types of risks, and this can be summarized under the software engineering framework of 'technical debt' [24]. In drug discovery, this is a traditionally overlooked issue and it is important to address all parts of the machine learning life cycle.
In this manuscript, we discuss how cloud computing can assist and alleviate many of the challenges related to ML modeling when training and serving models aimed at drug discovery, with the required steps making up the ML life cycle. We conclude with discussing working with private and sensitive data in cloud environments, and its potential impact on drug discovery projects.

Cloud computing and containerization
Cloud computing refers to delivering and consuming configurable computing resources as networked services on demand [25]. At the heart of cloud computing is virtualization technology where a hypervisor software is used to abstract away, partition and share underlying hardware resources as virtual machines (VMs). A cloud-computing environment pools virtual resources, including networking and storage, and allows the user to rapidly scale resources up and down, with minimal interaction with the service provider. Due to its wide-spread adoption, cloud computing is today used in virtually all aspects of drug discovery, for an overview see [26]. In fact, drug discovery is a constantly changing process, and cloudbased solutions enable an organization or lab to continuously adapt and re-purpose underlying hardware resources to meet evolving needs. Three main service levels can be distinguished: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS) with an increasing level of abstraction of underlying hardware ( Figure 2). The most easily consumed service is IaaS, relieving organizations of the burden of maintaining an on-premises computational infrastructure, see also D'Agostino et al. for a review of practical and economical aspects for computational drug discovery [27].
Clouds come in four principal deployment models: Public cloud, private cloud, hybrid clouds and community clouds.

Article highlights
• Integrating machine learning (ML) in drug discovery necessitates managing the entire life cycle of data and models • The ML life cycle comprises steps for data collection, preprocessing, model training and validation, model serving as well as predictions and inference • MLOps is a set of practices and a category of software that aims to support all stages in the ML modeling lifecycle, and can facilitate reproducible and robust ML modeling in iterative drug discovery • Cloud computing provides scalable and resilient access to compute infrastructure that assist in MLOps and the machine learning life cycle • Containerization and platforms such as Kubernetes together with scientific workflows can facilitate reproducible and resilient analysis pipelines for continuous ML modeling in drug discovery projects • Emerging federated approaches holds great promise to enable multiorganization collaboration on ML. This will require a combination of private and cloud infrastructure. Public clouds are provisioned for use by a general public, often in large data centers spread over multiple geographic regions. Private clouds are often deployed on premise for use by a single organization, while community clouds are used by multiple organizations with e.g. similar regulatory concerns. Hybrid cloud refers to the combination of at least two of the other deployment models, often a public and private cloud. Hybrid cloud assumes a software layer that enables seamless application portability between the involved infrastructures [25]. OpenStack [28] is the major open-source software stack for operating cloud environments. Being modular in nature, it has seen a large adoption both in academia [29] and as a basis for smaller, national and local cloud providers.
Major public cloud providers such as Google Cloud Platform, Microsoft Azure, Amazon EC2 and Alibaba cloud, all host extensive and extendable catalogs of virtual appliances -virtual machine images (VMIs) with preinstalled software environments ready to be deployed and used. Such VMIs can greatly simplify distribution of tools in the drug discovery pipeline and help improve reproducibility [30]. While VMIs provide an effective abstraction layer for application portability, there is a significant overhead in emulating a complete operating system. Containerization provides a lightweight alternative that packages everything needed to run an application -the code, its runtime dependencies and settings, as a standard, portable unit of software. Multiple containers run on top of the host system's operating system kernel, and consume far less resources than VMs. Containerization helps applications run in a standardized way across platforms ranging from public cloud to private cloud to laptops. To automate deployment and management of a large number of such containerized applications in production, container orchestration platforms are used. The currently most widely used solution is Kubernetes (K8s), originally developed and open-sourced by Google. A Kubernetes cluster on top of local hardware is an increasingly common alternative for a private cloud platform, providing scientist easy-to-use and scalable access to CPUs and GPUs. It is also common to run K8s on top of VMs in a cloud, and all major cloud providers offer on-demand container orchestration services based on K8s. Containerization is integral to the microservices architecture pattern, where complex software is broken down into small, loosely coupled services with clearly defined roles. Examples of PaaS based on Kubernetes in life science are the PhenoMeNal Virtual Research Environment [31,32] and OpenRiskNet (www.open risknet.org), realized as microservices platforms on Kubernetes.
There are a great number of tools available as SaaS for drug discovery, a few examples includes SwissTargetPrediction for prediction of protein targets for small molecules [33], SwissADME for evaluating pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules [34], various APIs and GUis for prediction of molecular properties such as logD [35]. Larger suites of integrated tools available as SaaS are commonly proprietary and commercial offerings.
Finally, we like to comment on the practical IT knowledge needed to work with cloud services. SaaS services typically require very little practical skills since services tend to be consumed directly via a browser-based UI. However, to work efficiently with IaaS and PaaS, we recommend basic knowledge on the Linux operating system. It is also very useful to know the basics of Python programming.

Machine learning life cycle in drug discovery
The ML life cycle consists of multiple steps with varying infrastructure and software requirements (see Figure 3). In all these steps, cloud IaaS can help reduce the need to buy and maintain a large amount of on-prem compute infrastructure to get started, and for many of the steps higher level services are available as public cloud services. Private cloud and onpremise K8s clusters are good complements in that they let a scientist quickly repurpose the same hardware to meet the varying application needs in the different stages.

Data collection
Drug discovery comprises data from many different types of experiments, and with an increasing use of AI modeling it becomes crucial to have a data management solution in place that enables rapid access to data so that it can be efficiently used. The data collection step in the AI life cycle can in drug discovery comprise different tasks such as performing experiments to produce data, extracting existing data from databases, or making a selection of data to be used for modeling. Cloud computing here offers scalable capacity and on-demand infrastructure for storage, databases and middleware with no up-front costs. Further, hosting data in cloud enables easy and rapid access to data when modeling is done in cloud environments [36]. This also facilitates collaboration, for example, by giving others access to data volumes. There are many proprietary cloud-based data platforms used in drug discovery, but most are integrated with analytics solutions and are often not flexible enough to be integrated in larger multivendor process flows.

Data preprocessing
When data has been selected and assembled, it is very common to perform a set of data preprocessing steps. This includes dealing with artifacts such as duplicate data, missing values, normalization, augmentation, and quality control. Here, cloud computing offers many opportunities to facilitate the preprocessing, simplifying construction and execution of pipelines or workflows for transparency, reproducibility, and robustness. Many workflow systems enable declarative specifications of analysis pipelines with a built-in capability to execute workflows on public as well as private clouds. Nextflow [37] is a widely used workflow engine in life science research, and can execute workflows on IaaS resources directly but also Kubernetes clusters. Argo (https://argoproj.github.io/ argo/) and Pachyderm [38] are two workflow systems that are Kubernetes native. Specifying preprocessing steps in a Notebook deployed on cloud resources is a common way to run preprocessing steps, and has the added advantage of supporting visual interpretations. A complete workflow of all preprocessing steps that can run in the cloud makes the preprocessing reproducible, portable and scalable. In addition, many PaaS services targeted at big data preprocessing and analysis are readily available as public cloud services, including Databricks based on Apache Spark .

Model training and validation
When the dataset has been constructed the next step is model development, comprising steps such as model training and validation. In most cases this includes an optimization process and hyperparameter tuning and could be a both timeconsuming and resource-demanding process. Cloud computing holds many opportunities for this stage in the ML life cycle. A common solution for practitioners, such as data scientists involved in drug discovery, is to invest in on-site powerful workstations equipped with a GPU accelerator, but this solution is neither scalable nor flexible within a team of scientists with a bursty need for compute power, and it also requires considerable up-front investments and maintaining on-site infrastructure over time. Procuring IaaS resources from a cloud provider is a compelling alternative, with access to large resources when needed and a pay-per-minute policy the up-front and maintenance costs for hardware are completely eliminated. However, the running costs for advanced infrastructure on cloud providers, such as GPUs or large VMs with a lot of RAM, should not be underestimated, and teams with sustained large needs for GPU compute increasingly often invest in setting up private K8s infrastructure to support a GPU-powered containerized wokflow. One example in drug discovery is Moghadam et al who scaled AI modeling using VMs procured at public cloud providers [39].
The software stacks used for AI modeling in drug discovery are large and diverse, but some prominent tools include Tensorflow, PyTorch and SciKit-Learn where the two first are focused mainly on Deep Learning. These and many other stacks are readily available in cloud environments, and there are both VMIs and container images with all dependencies enabling rapid setups for single-node infrastructure. For scaling over many nodes, Kubernetes is widely used, and specialized frameworks for AI on Kubernetes. In drug discovery, Recursion Pharma uses Kubernetes to scale AI modeling over CPU and GPUs on both Google Kubernetes Engine and onprem clusters (https://www.recursionpharma.com/careers?gh_ jid=2271704). BenevolentAI has a Kubernetes-based Figure 3. A high-level overview of the machine learning life cycle. The initial step is Data collection, which aims at assembling a dataset from one or more sources for subsequent Data preprocessing which results in a dataset that can be used for Model training. The next steps, Model evaluation and Model Serving ensures that end users can take advantage of a validated model that can be trusted to be used for Predictions and Inference.

Model serving and inference
An often overlooked issue is to make developed AI models available to end users in a production-grade environment with adequate governance. Such model management and serving requires not only validation and versioning of models but also features such as privacy, access control, auditability, logging and monitoring, and a resilient infrastructure that can recover from failures. This is arguably the most technically challenging step of the ML modeling life cycle, but fortunately there is much to learn from modern software engineering. Cloud computing with its solutions for infrastructure-as-code has enabled new sets of practices for combining development (Dev) and operations (Ops), so-called DevOps. Clouds and Kubernetes have become popular in DevOps teams as they offer a flexible and scalable environment, and also provide many of the features needed for production-grade solutions. Serving ML models have the same prerequisites in terms of e.g. resilience and scalability as many other services, but also more specific challenges related to e.g. monitoring and governance from an AI perspective. Most public clouds offer services for serving and publishing ML models. In addition, several cloud-native open-source frameworks that leverage Kubernetes have been developed, including Tensorflow Serving (https://github.com/tensorflow/serving), Seldon (https://www.seldon.io/), Kubeflow (https://www.kubeflow. org/), and STACKn (https://github.com/scaleoutsystems/ stackn).
Making models available over an API is an important step so that models can be consumed over a network by other software components, and it is also a prerequisite for cloudbased inference with predictions as SaaS. Examples in drug discovery include the OpenRiskNet platform for chemical risk assessment (www.openrisknet.org), built on OpenShift (a Kubernetes distribution by RedHat) that allowed for publishing services, such as AI models, and also adding a layer of discoverability and interoperability in annotating APIs using JSON-LD (https://doi.org/10.5281/zenodo.2597061). Another example is DeepCell Kiosk for scaling deployment and inference for microscopy images using TensorFlow, also built on top of Kubernetes [41].

Iterative drug discovery and the need for MLOps
Small-molecule drug discovery is an iterative process, and after initial high-throughput screening or other activities for lead identification, the subsequent lead optimization phase commonly follows the Design-Make-Test-Analyze (DMTA) cycle (see Figure 4) [42]. The cycle starts in the Design phase with deciding on the next round of compounds that are to be synthesized in the Make phase. After being produced, the new compounds are in the Test phase evaluated using different types of assays, and the results are analyzed to guide the next round of experiments. ML models can aid the decision-making in the Analyze phase both by models trained on data from the Test phase (commonly for on-target activity) and by models trained on global data collected over multiple drug discovery projects (e.g. for safety endpoints and off-target effects). The project-specific models are usually smaller, and could, for example, be one or more SAR-models developed around one or more scaffolds [43]. These are commonly developed by the scientists themselves within the drug discovery project, but might still need to be consumed by several members in the team. Models should preferably be deployed and made available over a network. The global models are commonly larger, since they are based on data from multiple projects [44], and since they are applied in each cycle and in several projects Figure 4. The use of ML models in the Design-Make-Test-Analyze cycle in drug discovery. Within a specific drug project, models can be trained on data from e.g. assays run in the Test phase, and used to make predictions aiding the Analyze phase -for example to optimize interaction toward a certain target. Global models can be used for predicting e.g. toxicity and off-targets. These models are commonly trained on the organization's proprietary data that has been collected in many projects, and often merged with publicly available data. Project models are developed toward a particular objective in a project, such as activity toward a specific target. They are commonly manually trained and might need to be updated several times during the project, and are generally not needed when a project is completed or terminated.
there is a much higher demand for production-grade model serving and accessible APIs. The global data available within a drug discovery organization is however continuously updated, and global models need to be continuously retrained to include the most recent data. While traditional ML modeling and model serving has to a large extent been a manual process in drug discovery organizations, it can create delays in the DMTA cycle and if e.g. global models are not updated it might mean that predictions are made on models that are not trained on all available data. The emerging area of MLOps -Machine Learning and Operations -can here provide both a systematic way of working as well as cloud-based software to support the process.
MLOps describes the joint undertaking by data engineers, data scientists and operations professionals with the intention to cover the entire life cycle of ML modeling in production environments. In drug discovery, MLOps here tries to bridge the disconnect in many organizations between pharmaceutical professionals, AI modelers, and service providers for hosting production-grade ML-models and to enable collaborations. Figure 5 illustrates the continuous deployment nature often associated with MLOps: the aim is to automate the processes from data preparation and experimentation to ML model training and validation to delivery of production-grade models with standard API endpoints, with the purpose of ensuring that up-todate models become available to the scientist in a timely manner as data is added.
All major cloud providers offer services that cater to the stages of MLOps, and there are many commercial and proprietary systems. Example of open-source frameworks that can be used both in public cloud and on private Kubernetes infrastructure include KubeFlow (https://www. kubeflow.org/), STACKn (https://github.com/scaleoutsys tems/stackn), and H2O.ai (www.h2o.ai).
The vast majority of AI models in drug discovery are today trained on data in a batch-wise fashion, where datasets are generated and preprocessed to form a standalone training set that is then progressed to AI modeling. Datasets are either generated within the same organization, or assembled elsewhere and downloaded. A critical component is adequate quality control (QC), and many people currently involved in AI modeling operate on datasets where such QC has already been performed, in many cases without disclosing the details. When AI models are used in decision-making, there is a larger requirement on training on updated data, and it becomes critical to close the loop from data generation to a predictive model in production [42,45]. There are a few activities reported related to using Active Learning in drug discovery, where algorithms decide where the next data point should be generated [46,47]. There are also studies showing the advantages of integrating ML in iterative drug screening [48][49][50]. In order to harness the full power of such continuous ML modeling, all components including data extraction, preprocessing, QC, training, validation and deployment must be integrated into a reproducible pipeline with adequate versioning.

Machine learning using private, sensitive and regulated data
In the machine learning life cycle, data is the key asset and in drug discovery this data is often sensitive, private or regulated. In this situation, transferring data to a cloud outside of the administrative control of the data owner is not an option. A hybrid cloud setup -where some computational resources are located inside the organizations domain and some outside -can help a single organization overcome these challenges internally, by replicating standardized infrastructure on many sites in different countries. Seamlessly managing applications over such hybrid clouds is however not a trivial task Figure 5. MLOps. MLOps is both a set of practices and a category of software that aims to support all stages in the ML modeling lifecycle, from data ingestion and preprocessing to development and validation of a ML model, to making these models available over a network in a production grade fashion. MLOps can be seen as the application and specialization of the software development paradigm DevOps to machine learning projects. Establishing good MLOps practices helps an organization make sure that model can be continuously updated and deployed as data is being added, and it can enable model sharing between teams by making sure that artifacts are continuously updated and deployed. [51]. If there is a need or opportunity to collaborate across organizations, such as between two or more pharmaceutical companies, the situation becomes even more complicated. Here, a recent technology, federated machine learning [52,53], has shown great promise to enable the joint construction of models across organization and data silos, without the need to ever move or disclose data. Figure 6 illustrates the main idea -multiple organizations can collectively train a ML model by each doing completely local model updates and employing an aggregation scheme in an iterative fashion. There are many possible variants of the scheme employed for aggregation, ranging from centralized versions supported by a trusted third party to completely decentralized trustless variants, each with their own pros and cons [52]. In either case, a practical challenge in all such systems will be to standardize and manage the code needed to be executed on each organization's own infrastructure. Again, containerization and orchestration frameworks can greatly facilitate that process. In drug discovery, the appealing idea of precompetitive alliances in areas of mutual interest, such as in safety assessment of hits, leads, and candidates would be propelled if federated learning could be realized. The MELLODY consortium (https:// www.melloddy.eu/) is targeting such analysis over proprietary databases of pharmaceutical companies. There have also been efforts for aggregation of predictions on non-disclosed datasets applied to drug discovery problems [54].

Expert opinion
AI is establishing itself as a key component in modern drug discovery, and especially Deep Learning approaches has become popular in many different areas. This trend however presents many challenges for drug discovery organizations, both for providing sufficient computational resources and software to its scientists but also for managing and serving production-grade models to end users within the organization. Cloud computing offers many components that assist in the machine learning life cycle, such as easy and scalable access to computational resources and an ecosystem of tools developed to aid data scientists to work effectively with ML modeling. Many of these tools rely on containerization technology and together with platforms such as Kubernetes they can greatly aid and help streamline the ML lifecycle. But apart from technical resources and platforms, there is also a need for scientists and engineers to adopt MLOps best practices. For scenarios where continuous modeling is required, such as when data is updated in regular intervals or when engaged in human-in-the-loop or active learning, the demands on reproducibility and traceability of data and models become even more important to cater for all steps of the ML life cycle.
Relying solely on public cloud providers has many advantages in terms of not having to maintain any infrastructure locally, but has implications on costs and data privacy. In most real-world scenarios, drug discovery organizations will need to Figure 6. Federated Learning. In Federated Learning, a distributed system is employed to train ML models jointly across multiple organizations. No training data is disclosed to any other party in any stage of the training process, promising strong input privacy yet offering the possibility to construct ML models with superior performance and generalizability in pre-competitive alliances.
combine both local, private cloud infrastructure and public cloud. Distributed ML on such hybrid clouds can be an interesting solution; however, this is currently a technically challenging setup. Federated ML is another upcoming technology that can be used for privacy preserving and collaborative ML, such as in precompetitive alliances, and this has high potential for future drug discovery to unlock hidden data among organizations.

Funding
This project was financially supported by FORMAS (grant 2018-00924), Swedish Research Council (grant 2020-03731), the Swedish strategic research programme eSSENCE, and the Swedish Foundation for Strategic Research (grant BD15-0008SB16-0046).

Reviewer disclosures
Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.