Best practices for machine learning in antibody discovery and development

Over the past 40 years, the discovery and development of therapeutic antibodies to treat disease has become common practice. However, as therapeutic antibody constructs are becoming more sophisticated (e.g., multi-specifics), conventional approaches to optimisation are increasingly inefficient. Machine learning (ML) promises to open up an in silico route to antibody discovery and help accelerate the development of drug products using a reduced number of experiments and hence cost. Over the past few years, we have observed rapid developments in the field of ML-guided antibody discovery and development (D&D). However, many of the results are difficult to compare or hard to assess for utility by other experts in the field due to the high diversity in the datasets and evaluation techniques and metrics that are across industry and academia. This limitation of the literature curtails the broad adoption of ML across the industry and slows down overall progress in the field, highlighting the need to develop standards and guidelines that may help improve the reproducibility of ML models across different research groups. To address these challenges, we set out in this perspective to critically review current practices, explain common pitfalls, and clearly define a set of method development and evaluation guidelines that can be applied to different types of ML-based techniques for therapeutic antibody D&D. Specifically, we address in an end-to-end analysis, challenges associated with all aspects of the ML process and recommend a set of best practices for each stage.


Introduction
The development of antibody-based drugs has revolutionised the field of medicine, providing effective treatments for a wide range of diseases [1,2].However, although the drug discovery process has proven effective, it is complex, expensive, time-consuming, and has room for improvement [3][4][5][6][7].Recent advances in ML have the potential to accelerate and optimise this process by enabling the identification of better biotherapeutic drug candidates more rapidly, and thereby reducing the cost and timelines for antibody drug discovery [8][9][10].
The rapidly evolving landscape of ML-guided therapeutic antibody research and development (R&D) holds immense potential for the biopharmaceutical industry.Although initial success stories have emerged, its 'real world' impact is thus far relatively minimal [11].To unlock the full potential and demonstrate significant impact in commercial drug discovery and development settings, it is essential to establish standardised guidelines and best practices for applications of ML at every step of biologic drug discovery and development projects.These include in-silico design of antibody candidate drugs; computational identification or design of high affinity specific function relevant epitopes; accurate prediction of biophysical attributes; screening for formulations; and development of digital twins of manufacturing processes.[12,13].
For next-generation multispecific antibodies, these computational approaches are essential for assessing the large design spaces and identifying candidates with optimal properties that can then progress towards clinical studies.Improving the probability of success in clinical trials generally has the highest positive impact on the costs and timelines for an individual program [14].Nevertheless, a reduction of time and costs in the preclinical stages can still enable significant overall savings due to the high volume of early stage programs that are executed in parallel throughout the industry, assuming the resulting biotherapeutics are a similar quality or better than those conventionally discovered [15].Recent years have, therefore, witnessed a surge in the application of ML-based models at each stage of the drug discovery and development cycle.However, little to no attention is being paid to benchmark general applicability as well as benchmarking of these models, leading to difficulties reproducing models and results.
This review article aims to address the need for establishing benchmarks and best practices for the use of ML in the biopharmaceutical industry.We will critically examine current methodologies, identifying common pitfalls and providing recommendations for ML-based approaches to therapeutic antibody R&D akin to other reviews for chemistry or broader ML research [16][17][18][19].Unlike other reviews, (see e.g.[20][21][22][23][24]) we are less concerned with what type of model to use in which context and the presentation of different ML methods.Instead, we focus heavily on data aspects and model validation.This focus stems from our practical drug discovery expertise, where the latter are crucial factors that have been commonly neglected or undervalued in other reviews of this emerging field.
By offering clear evaluation guidelines based on literature and experience from practitioners from both small biotech and large pharma, we aim to bridge the gap between theoretical advancements and real-world applications.The focus is on therapeutic antibody engineering, and aims to foster consistency across academic and industry research groups.We believe that this, in turn, will enable ML to contribute more meaningfully to the R&D of novel antibody therapeutics and ultimately benefit the patients.
While the main focus of this review is on the creation of high quality data sets and ML models, we need to touch on the most fundamental problem that needs to be addressed first when working on any drug discovery program: The underlying objective.This topic could be summarised under the broad term of 'predictive validity', i.e., are we increasing confidence in the project's validation for likely product development and clinical success.In other words, are we increasing confidence that we have a developable antibody which is safe, binds only the intended target and produces the desired therapeutic effect via the desired mechanism of action (MoA) among the correct patient group [25]?Only once this is clearly answered, should a team move on to building data sets and ML models to support a given program.

Before you start: Choose the appropriate strategy and the experimental set-up
Before discussing the intricacies of a ML process, any antibody discovery program needs to have a clear experimental strategy aligned to the candidate drug target profile (CDTP) [26].The CDTP should include targets for both the desired pharmacology and the appropriate developability package (Fig. 2 A).This specification sheet will inform both the assay cascade for lead generation, optimisation and final candidate selection (Fig. 2 B).

Figure 2:
(A) An illustrative candidate drug target profile (CDTP) for any biologic modality covering key attributes of both pharmacology and developability.The targets should be set appropriately for the relevant milestones during R&D towards candidate selection.(B) An outline to illustrate an assay cascade for lead selection during R&D.Successful completion may be achieved following one or a number of iterations of the cascade dependent upon aspects of the platform and biology of the target.
In drug discovery, building confidence in both the target and the candidate molecule's pharmacology and developability are key to progress from target selection to candidate drug selection prior to clinical and CMC investment.CMC science includes cell line development, formulation and drug delivery, device development, analytical capabilities and clinical manufacture and supply for First Time in Human (FTiH) clinical studies to launch.To maximise the probability of success establishing assays that relate to species cross reactivity, desired MoA and aspects of developability is crucial.Focusing on function rather than affinity or other simplistic properties enables the identification of the rarer functional biologics, especially in the context of complex targets, agonists, multispecific or multivalent biologics [27][28][29].To further stress this point, many screening or optimisation campaigns still primarily focus on affinity maturation, even though more predictive assays (e.g.functional, reporter assays), which could deliver a similar throughput, are available today.In most instances these binding optimisation campaigns have limited correlation with the desired function, even for simple biologies such as agonists (see Fig. 3).

Figure 3.
The figure illustrates correlations between binding and activity behaviours assessed using different technologies for VHH domains in both mono-and bivalent (biparatopic) formats against a specific therapeutic target that requires agonistic activity.VHH-target binding was evaluated through biolayer interferometry (BLI), cell binding through FACS, and activity within a T cell activation assay.In this context, agonistic activity is achievable only by combining two binding domains against the same target within a biparatopic format.In Figure 3a, binding against the immobilised target measured via BLI (response) correlates with cell binding assessed via FACS (EC50) for compounds in the biparatopic format.However, Figures 3b and 3c demonstrate that binding to the immobilised target (BLI response) and binding to cells (FACS EC50) poorly correlate with activity in the T cell activation assay.Figure 3d illustrates that the binding (BLI off-rates) of monovalent building blocks does not correlate at all with the activation behaviour of the same variants in the biparatopic format in the T-cell activation assay.
Appropriate functional assays can be established in high-throughput (HT) screening mode or lower throughput profiling mode with increasing translational relevance.Examples of HT functional screening assays are biochemical ligand/receptor HTRF Ⓡ assays or cell-based reporter systems.Progressing from engineered biochemical and cell-based in vitro assays to more disease relevant cell-based assays with endogenous target biology or patient derived systems increases translational relevance but can reduce throughput and precision, and increase variation.
Designing a project specific assay cascade entails balancing assay feasibility, throughput, robustness, data quality and translational relevance.For each project, decisions are required for what data can best triage libraries and molecules through lead generation and optimisation phases.The integration of wet-lab automation with the goal of generating structured, consistent, machine-readable data will enhance data generation efficiency and accuracy [30,31].Data accuracy, data amount, and time to complete a full loop from prediction to experimental validation all need careful consideration for an optimal predictive performance and hence project outcome [32].An example assay cascade with approximate numbers and endpoints is outlined in Table 1.The triaging of assays, throughput, and phasing depends upon individual project requirements and biological feasibility.Key considerations in successfully implementing this revolve around technical and cultural ways of working.This includes aspects of: standardisation of labware and vendors; having skilled automation engineers to code and optimise the wet-lab scientist methods; use of barcoding and tracking samples from sequence through expression and assay systems; the use of appropriate controls across users, protocols and projects; standardised parsing analysis to route data back to the end user.Staff's real-time access to integrated and curated data will also facilitate better and more timely decision-making.Other aspects of data generation, capture and analysis to enable ML are summarised in Box 1.
One aspect that is important for ML purposes is the number of dilution points in concentration-response curves.In lead generation or early lead optimization a single point or a small number (e.g. 3) of concentrations are often chosen to enable screening of a larger number of candidates in a given run.For later stages, a higher number of points are often chosen in order to obtain more accurate EC50 or IC50, which typically comes at the cost of a lower number of tested designs.From a ML perspective, however, a consistent data set is required to make optimal predictions across a program.While comparisons in terms of program efficiency, i.e. the time or resources to an optimised lead or candidate nomination, are generally not available the same concentration ranges are required to create consistent, high quality, ML-grade data sets.The choice of the number repeats will also influence the accuracy of the measurement and hence influence predictions, which further needs to be taken into account.Overall, the number of concentration points and number of repeats will highly influence the data quality and should be a key consideration of the program if ML is to be employed and good predictions are sought.Table 1.An illustration of the data generation from a typical assay cascade.The details of throughput, repeats, data quality assessed by Z-prime and end points will be dependent upon aspects of the platform and biology of the target.

Box 1.
In the context of enabling ML important aspects of data generation include: • Agreed upon standard protocols (including plate layouts) for all assays if data is generated internally.• Capture data with minimal manual handling and store it in a FAIR way [33].
For early Research a lower level than FAIR data may be sufficient but is not generally advised.This encompasses agreed data access, data field descriptions, definitions and referenced controls.However, we generally recommend data beyond the FAIR data standards.• If pre-existing data, such as public data, is used, check the origin (lab), assay type of the data set, assay variability, readout, and units of measurement and confirm that these match.Apply stringent filter criteria to remove data that is not suitable for ML.Check if there is still enough data to train a model [34][35][36].A general guideline is that there should be more data points than model parameters in order to obtain reasonable performance and avoid overfitting and poor generalisation.• Ideally establish data and processing lineage and versioning, so that source data or processing changes can easily be tracked by all teams.• Employ at least technical repeats on the plate controls for assay QCs (Robust Z-prime & variance of repeats) and tracking of assay behaviour (e.g.drift).Robustness and reproducibility of the assay protocol needs to be ensured/validated beforehand.• Deploy clear acceptance criteria for data quality before use in modelling.
• For data of multiple classes (e.g. two classes in case of a QC method: fail/pass), assure that data points are available for each class [37].• For regression, check that there are sufficient real value data points.If there are too many data points with a cutoff (e.g.pIC50 > 100 uM), there might not be sufficient data to train a regression model and only a classifier might be possible.• If real valued data (not multiple classes) is used, data should include a wide range (e.g.pEC50 values of 7 to 11).• Clear business rules for the (non ML) pre-processing or filtering of the data, for example for curve fitting or cut-offs for filters (e.g.purity).• Deployment of a set of controls constantly throughout a program to enable appropriate normalisation.Unlike for small molecules, for biologics, controls are needed to monitor both the "performance" of the assayas well as process parameters along the whole value chain.This allows for monitoring parameters like changes in quality of DNA used to produce the clones/mAbs or expression artefacts (host cell performance, cell batches, etc.) among other aspects.• Tracking of relevant confounding variables through metadata.Assay considerations are equally important for developability attributes.Developability encompasses the ability to manufacture, formulate and package a stable, homogeneous, high concentration, specific drug as cost and time efficiently as possible [15,38].Being able to predict the developability profile of a research molecule from sequence and HT assay data would ensure molecules with poor developability are either removed from the R&D workflow or the detrimental property is re-engineered.When considering both pharmacology and developability attributes for molecules that are not simple IgGs, such as bispecific or multivalent formats, the individual building blocks may behave differently in the final format.Therefore, screening in the CDTP-envisioned product format as early as possible in the assay cascade is recommended [39].A range of commonly optimised or screened properties are captured in Fig. 4.

Components of a (good) ML process
All applied ML requires process validation.Process validation, unlike model validation, is crucial since we need to validate that the entire process -from data collection and processing to the actual model predictions -is applicable to the given problem.We define the process of deploying ML models in a biotherapeutic environment to create actionable insights as everything that is required from the data creation to the model building and making predictions.The main steps are laid out below.
Process validation plays a crucial role in any ML application for two primary reasons.Firstly, it ensures a realistic assessment of model performance, which is vital for determining the effectiveness and applicability of a model within a real-world drug discovery setting.Secondly, process validation enables the reliable comparison of different ML models and data processing steps, allowing researchers to identify the most suitable techniques for specific tasks.
A good ML process consists of the following steps: 1. Data collection: Obtain diverse, high-quality, and relevant data from relevant sources including literature, public and commercial databases, and proprietary wet-lab experiments.Particular care needs to be taken when mixing data sources due to high variability in execution and data collection standards, in particular when dealing with human-labelled training data [40].2. Data curation, pre-processing and standardisation: Clean, organise, and transform the data to ensure consistency and reduce noise.It is helpful to adopt similar practices to the ones that have been established for QSAR models [41,42], that is, assure the data is collected using standardised experimental protocols and/or molecules with common formats.3. Exploratory data analysis: Examine the data to understand its characteristics, imbalances, and distributions before building complex models.Collaboration with experimental scientists is essential here for deeper understanding of the data [43].In particular, data scientists need to acquire domain knowledge and understand the science behind the data.For example, while connecting 'microscopic' molecular properties of therapeutic antibody candidates with the 'macroscopic' biophysical measurements, it is crucial to develop an understanding of computational biophysics, protein structure and dynamics as well as the experimental biophysics protocols.4. Choosing a model performance metric: Select appropriate metrics that align with the nature of the data, and the application [18,[44][45][46].Note that metrics might be different in specific applications such as protein structure prediction [47,48] from say prediction of biophysical measures associated with expression, purification, conformational stability and colloidal interactions of proteins.5. Model components and model choice: Determine the most suitable model based on the complexity of the problem, the available data, and other constraints [18].In certain cases, mimicking biological processes such as generation of unwanted immune responses against the therapeutic antibodies, it may be essential to develop multiple models each representing a specific step in the process and tie them together to obtain an improved understanding of the process itself.6. Evaluation: Assess model performance using techniques like cross-validation, different data splits, and comparison to baseline methods to understand realistic model performance.This involves the training of different ML methods and benchmarking their performance using the same evaluation criteria.In some cases, consensus predictions or other ways of ensembling the models can perform better than individual models developed using specific ML methods.7. Putting the model into production: Ensure scalability, computational efficiency, compatibility, interpretability, and monitor the data and model performance over time in a production environment.This might require the repeated validation and deployment of models as new data arrives and hence the building of strong ML ops pipelines.In order to ensure realistic performance assessment and the biggest impact to drug discovery projects, ML researchers and engineers need to adhere to best practices for all of the above mentioned steps.Doing so will result in more transparency and more reliable performance in real-world ML applications.The full process is summarised in Fig. 5.
In the following we will discuss each step of this process outline, highlight potential pitfalls, and make recommendations for best practices.

Data collection
In this section we delve into the essential aspects and considerations of gathering high-quality data for ML in a drug discovery setting.The goal of this step is to ensure that the collected data is accurate, relevant, and suitable for training ML models to make reliable predictions.This section covers important topics such as predictive validity of assays, data correctness, choice of measurement metrics, handling biological variability, data normalisation, addressing challenges in drug discovery data, and detecting and dealing with data drift.We will address each of these topics separately.
a. Predictive Validity of Assays: Predictive validity of an assay refers to its ability to predict a desired outcome accurately [25].In this case, we refer to it as the likelihood of an assay outcome translating into a more complex experiment, which could be a more complicated assay or even a clinical trial.It is crucial to validate that the endpoint that is used correlates with the desired outcome whenever proxy assays are used (e.g.endpoint/prediction needs to be correlated with go/no-go criteria).For example, a biochemical antagonism assay should first be validated to translate into a more complex cell-based read out.This is essential to guarantee that the optimisation process guarantees useful candidates.While this is mainly the responsibility of the experimental domain expert, it is crucial that the entire team is aware of the limitations that are imposed on ML models by the data.If the data is not predictive, then the model will not have any impact on the program.For example, if biochemical assay data is used to predict binding, then one cannot expect the ML model to reliably predict activity in a cell-based assay if the correlation of one with the other wasn't validated upfront.
b. Determine the Correct Set-up of the Assay: As previously discussed, deciding the number of repeats and type of assay that is used will have a large impact on the data quality and quantity.Depending on the stage of the program the correct setup should be chosen and maintained throughout the whole program.We further recommend establishing, maintaining, and adhere to versioned business rules for all data (pre)processing.These should encompass normalisation, transformation, and scaling processes for the data.While the majority of this work will be the responsibility of senior experimental colleagues, it is crucial here also to get input from the data science colleagues to understand, for example, which consistent set of controls enables the best pre-processing of the data for ML purposes.
c. Confirming Data Accuracy: Ensuring data accuracy involves proper assay calibration and reproducibility to confirm that measurements reflect the desired behaviour.Maintaining consistent conditions is critical in ML, as later changes in the assay can introduce inconsistencies and confuse the model, resulting in poor performance.Use of control molecules and regular quality assessment of the assays is advisable.Furthermore, the collection of metadata can help also improve the quality and understanding of the collected datasets [49,50].In particular, regular visual inspection through e.g.box-or scatterplots of the data allows to spot errors which can then be subsequently incorporated as automated quality controls.For example, if a small percentage of values are offset from the main distribution by a consistent amount (e.g. 3 log units), it suggests those values have a systematic error like incorrect units conversion and a secondary peak or wrongly used threshold placeholders (like "<10 μM" which was misinterpreted as exact values).
d. Choosing the Correct Measurement Metric and Process: Selecting a measurement metric that is appropriate for the specific ML problem (e.g.AUC, EC50, maximum activation/inhibition values) and process (incl.data) directly impacts ML performance and the suitability of combining data from different sources [40].Consistency across data processing steps and protocol standardisation is key for optimal model training.Input from data science colleagues must be sought to inform these decisions but ultimately should be driven by experimental experts.For example, when AUC values from different sources are used it is essential to confirm that baseline subtraction, hook-effect removal, or curve fitting processes have consistently been applied to the data in order to ensure comparability.
e. Minimise Biological Variability: Biological data often exhibits inherent variability and noise, making it important to rely on biological and technical repeats.Understanding variance between repeated measurements helps set a baseline for the best-case model performance, as a model cannot realistically outperform assay accuracy [51,52].This can be done by using the experimentally measured error distributions to simulate repeated measurements for a larger amount of data and then evaluate different correlation metrics based on the resulting simulations as done by Brown et al. [53].The impact of experimental errors is usually more significant for datasets with a limited dynamic range and is less problematic for larger ranges.For example, when the error is nearly half of the dataset's dynamic range, then achieving a meaningful correlation becomes nearly impossible.This implies that for early projects (e.g.hit finding) it poses usually a small problem as for example in late stage lead optimisation or when we want to step-by-step increase the selectivity in each campaign by a small amount.Notably, the variability will vary between different assays.For example, variabilities in cell-based assays tend to be around 20-30% while biophysical measurements such as melting temperature can be accurate to 0.1 degree when done with Calorimetry and hence usually exhibit much lower variability.Additionally, controls can be used to discard readouts that exhibit excessive variability.
f. Use Data Normalisation Based on Controls: Data normalisation using controls is essential when data comes from the same laboratory, as it helps to calibrate data, particularly for cellular assays.For example, normalising AUC measurements using a combination of plate controls and average control AUC ensures consistency (Fig. 6) or normalisation of expression data across different plates, batches, and days (Fig. 7).However, this can be challenging when using public data sources or data from different service providers, which may lack the appropriate controls, harmonisation of experimental conditions, and protocol standardisation are again key to build larger consistent data sets and using controls to normalise data sets to a common reference is essential to improve consistency of the data [54].g.Other Challenges with Drug Discovery Data: Drug discovery data often has set cut-offs, such as maximum concentration limits in concentration-response curves.They are derived from the specific targets of the program (e.g. the required target potency) and will cover a reasonable range around these targets.For example, concentration-response curves have a number of dilution steps and will hence cover a few orders of magnitude of potencies.While these values are usually set to the most relevant range for the program, the properties of the molecules might not always fall within these ranges, leading to values beyond the limits.For example, for an assay limit of 100 uM on the upper side, a compound might hence receive a cut-off value of >100 uM.Such cut-offs can restrict the ability to train regression models, requiring the use of categorical models (classification) with lower resolution predictions (for example >5uM or <5uM potency).Therefore, it is crucial to understand the type of data to collect for maximising the model's ability to make relevant predictions, especially when relying on public data sources [55][56][57].
Another challenge with drug discovery data is the harmonisation of experimental data across different functional units of an organisation.For example, the data on protein expression, purification and biophysical characterization performed at pre-formulation stages may not be easily translatable to the similar experiments performed at the CMC development and formulation stages.This often happens because of several reasons including the cell lines used (transient expression versus CHO cell line optimised for the candidate), scale of protein made and differences in the experimental protocols followed in the discovery and development stages and even differences in the instruments used to make the experimental measurements.This ties in with the translatability of early stage proxies with late stage measurements that we have touched on earlier.
A third challenge in experimental data collected in a discovery setting is time lapse.Typically, it is harder to use data on older projects than the current ones, even if the same target is being worked on again due to change of data collection and storage methods.This also means that the experimental data collected and ML models built a couple of years earlier may not be current because of adoption of newer instruments and experimental methods by the experimentalist laboratories.Therefore, collaboration between data scientists and experimentalists needs to be on a continual basis to fully realise the potential of digital transformation in the pharmaceutical industry.h.Detecting and Dealing with Data Drift: Being aware of changes in experimental setups, such as reagent changes, equipment accuracy variations, or data processing alterations, is vital, as these can have significant impacts on the ML models trained.Consistently monitoring and addressing data drift, e.g.via the deployment of a appropriate set of controls, helps maintain model performance and ensures that any changes in experimental conditions do not go unnoticed.In addition, periodic updates of the model using the greater amounts of available experimental data is also recommended to minimise the impact of data drift.We generally recommend the visualisation of process data alongside molecular profiles throughout the program.
In case of a failed quality control, it is necessary to diagnose the source of error.
Here metadata can also be helpful.

Data curation and pre-processing
Antibody engineering by means of ML mandates stringent data handling and preprocessing, key to attaining reliable, meaningful outputs.Outliers, noisy data, or data that might confuse a model (for example pIC50 of 5 in classification to distinguish more clearly between active and inactive molecules), should be carefully addressed.Although reducing data quantity, removal of data can be beneficial for the predictive performance of a ML model.
Data curation and preprocessing has a major impact on the behaviour and performance of a ML model.The Cheminformatics community has over the years established clear guidelines on how to process and filter data to make it suitable for ML purposes [41,42,[58][59][60][61].
While there are some attempts to provide similar guidelines for bioinformatics and ML for biology [62], most publications are limited to best practices for computational modelling, but do not offer advice for data preparation and quality [17,22,[63][64][65][66][67][68].For data preprocessing of antibody and T-cell receptor data, the AIRR (Adaptive Immune Receptor Repertoire) community has established guidelines that enable standardised input and and output of AIRR data, which enables the communication of AIRR-compliant bioinformatics tools [69][70][71][72].
In the following we highlight key steps that should be taken in order to obtain good predictive performance and to ensure that this performance is representative for the actual application in a program.Data integration, cleaning, and binning is typically performed without the need to look at the ML model, while data transformations, feature extraction, selection and transformation are performed repeatedly during the model training and evaluation process to evaluate the model performance in various settings.
a. Data Integration: Many large company databases and public data sources are not ideal for training ML models due to the lack of appropriate controls required for normalisation and varying or changing assay protocols over time.Generally, better data quality results in better model performance.Nonetheless, training models on combinations of different data sources, for example public and in-house or different public data sources can be beneficial under certain circumstances [73][74][75].) concluded based on a large-scale analysis of small molecule (ChEMBL) data that augmenting mixed public IC50 data by public Ki data does not deteriorate the quality of the mixed IC50 data if the Ki is corrected by an offset.However, in our experience this is not the case for biologics and consistent internal data leads typically to the best outcomes since the production process has a much bigger impact on the final molecule and variations in it hence unproportionally affect the results.In order to decide how much data you need, a common rule of thumb is at least ten times as many data instances (data points) as there are data features.However, this depends strongly on the selected features, the quality of the data, and the complexity of the problem [76].There are hence many exceptions to this rule.
b. Data Cleaning: Enhancing model accuracy can be achieved by removing noisy data points or outliers.For example, when training a classification model, excluding data points with pIC50=5 (commonly considered the threshold for inactive compounds) can improve model performance.This step may involve techniques such as outlier detection, data imputation, and standardisation of units and formats.It is in general advisable to get input from experts who understand the experimental setup and data very well when performing this step.It is for example common in many databases that active compounds are over represented due to reporting biases.For example [77] reported that the average pIC50 value of the whole distribution of data in ChEMBL25 is 6.57 for small molecules.This is likely different to prospective unbiased library screens, where it is common for 95% of the compounds to have pIC50<5, a large train-test distribution shift.
For biologics these ratios will vary but being aware of such changes in the distribution is equally important in order to allow adequate processing (e.g.resampling) the data to create a more representative data distribution for the actual application.
c. Binning of Data: For classification, it may be necessary to bin data into active and inactive groups.The choice of group ranges can significantly impact model performance.Initial surveys of the data as well as feedback from the experimentalists can help set appropriate data bins.For example, IgG antibody purification data is often obtained using size exclusion chromatography (SEC).The experimental result consists of a chromatogram showing relative abundance (peaks) for extents of high molecular mass species, monomer content and low molecular mass species in terms of the percentages of the areas under the peaks.Percent monomer is often used as an indicator of quality of the antibodies when these measurements are performed over a large set of them.The range of percent monomers in these samples may vary from 0 to 100%.A typical quartile-based binning of the data without any inputs from the experts in the field may bin the samples as of poor quality (Percent monomer below 25%), good quality (25%-75%), and high quality (>75%).However, domain experts would often consider IgG antibodies with percent monomer below 90% as of poor quality, 90-95% as of good quality and those with >95% monomer content as of high quality [78,79].
d. Averaging Data Over Technical and/or Biological Repeats: Data is typically averaged over technical and biological repeats, helping reduce noise and error.If the data distribution is narrow, averaging is advisable.However, if data points are disparate, outliers are better discarded, unless a clear rationale for their preservation is presented.The trade-off between diversity, replicates, and the final number of compounds tested requires careful consideration as previously discussed.Based on our experience, averaging repeated measurements and using the mean to train ML models is a best practice.However, individual repeats might be more suitable for certain models, especially when data errors or noise levels are used for calibrating model uncertainties as is commonly the case for Bayesian methods [32,80].
e. Feature Extraction and Selection: Converting raw data into a set of features or descriptors is crucial for effectively training ML models in drug discovery.
In the antibody or protein engineering space, features may include protein sequence or sequence-derived structural features, designs (e.g. a combination of combinatorial antibody parts), 3D structure, learned representations, or more conventional molecular descriptors (amino acid composition, dipeptide composition, tripeptide composition, or pseudo amino acid composition).Physicochemical properties of amino acids, such as hydrophobicity, charge, size, and polarity, can also be used to compute various descriptors, including autocorrelation, Moran, and Geary coefficients.Finally, the incorporation of evolutionary information, for example through multiple sequence alignment, has been particularly helpful for computational structure prediction.
For protein sequence-based features, values can be directly extracted from the sequences, such as amino acid type, evolutionary information (e.g.profile representations in the form of PSSMs from PSI-BLAST), or features predicted by other tools, such as secondary structure and solvent accessibility.Representation models like transformers can be used to learn features from large volumes of unlabeled data, aiming to represent the innate structure of the data.Several pretrained representation models are available for proteins (e.g.CPCProt, deepGOCNN, ESM-1b, ProtTrans/ProtBert, rawMSA, SeqVec, GearNet, or UniRep, AntiBerty and AntiBerta) and we recommend carefully evaluating these depending on the specific task at hand [34,[81][82][83][84][85][86].In addition to these, physicochemical features for individual amino acids represented in AAIndex may also be useful [87][88][89].
Feature selection is important when dealing with protein sequence-based features as many features can be redundant.Feature selection offers several advantages, including a decrease in the overall number of tunable parameters in the algorithm, reducing the likelihood of overfitting.A reduced number of input features can also increase the algorithm's speed, which is crucial for large-scale applications.Most importantly, a concise list of relevant features aids in understanding the essential characteristics of the problem at hand.Feature selection can be categorised into three types: (i) 'wrappers' use the ML algorithm as a black box to select features based on their performance, (ii) 'filters' select feature subsets without considering the ML algorithm, and (iii) 'embedded' techniques are part of the ML algorithm training procedure.Choosing the correct representation for the task at hand is crucial and typically more important than the choice of the model used.It is recommended to evaluate a range of representations in combination with simpler models to identify the most suitable approach for the specific problem.In drug discovery, simple models have repeatedly been shown to outperform more complex ones [90][91][92], and should hence be used at least as a baseline before moving on to more complex ones such as deep learning.Non-redundancy can be a potential way for feature selection.This can be done by clustering the features based pairwise correlation among them for a given set of proteins/antibodies.In recent analyses of marketed and clinical stage biotherapeutics, it was shown that only a few features were needed to build profiles of antibodies likely to reach or succeed in the clinic [93,94].Depending on the prediction target there are also many choices for the output representation.For example, one might want to predict the structure of a protein via all atom positions or a reduced representation such as the ɑ-Carbon only.Choosing the right output representation is equally important for a successful application.
f. Feature Scaling: Real-world datasets often contain features that are varying in degrees of magnitude, range, and/or units.In order for ML models to interpret these features on the same scale, we hence need to perform a step called feature scaling that involves standardising the range of feature values to ensure that no single feature dominates the model.Common techniques include normalisation (scaling features to a range of 0 to 1 or -1 to 1) and standardisation (scaling features to have zero mean and unit variance).Feature scaling will likely depend both on the data and the algorithm that is used.Some algorithms require or preferably operate on scaled features [95,96].
g. Data Transformation: Apply transformations to the data to reduce dimensionality, enhance interpretability, or improve model performance.
Examples include principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), and log transformation.Dimensionality reduction methods can be used to reduce the number of features and hence reduce training times.They can also be used to assess the importance of features, sense checks, or to confirm biological hypotheses or act as regularisation [97,98].
h. Simulations: there is a lack of large-scale ground truth experimental data.This hinders the development and benchmarking of robust and interpretable ML approaches [99,100].To address this problem, there is a need to complement analyses on experimental data with simulated ground-truth data.The challenge is to generate simulated data, such that it incorporates key features observed in experimental repertoires that render ML problems challenging.Simulation frameworks for antibodies range from VDJ-recombination-like antibody generation [101,102] to synthetic antibody-antigen structures [103].Together, these tools allow for large-scale, high-throughput and real-world relevant synthetic data generation.The here cited simulation tools have been tested for nativeness vis-a-vis experimental data.Extension of experimental observations via simulations can also help explore deeper correlations among different attributes such as aggregation and immunogenicity of antibody-based therapeutics.For example molecular dynamics trajectories could be used as input to machine learning models to enable better prediction of molecular properties such as binding, similar to small molecules [104,105].Availability of curated publicly available datasets is required for comparison of methods based on a single agreed-upon datasets [13].The OAS (Observed Antibody Space) [106] and iReceptor databases [107] represent starting points where novel datasets that are associated with function metadata may be integrated.These datasets could be set for different ML tasks ranging from antibody structure prediction, antibody-antigen docking to antibody developability prediction.These datasets would not only represent a data standard but also necessary building blocks for public competitions [48,[108][109][110][111] .Competitions are integral to mapping both those areas where predictability is good as well as where knowledge blank spots exist.

Exploratory data analysis
Exploratory Data Analysis (EDA) is an essential step in the ML process.It allows for a better understanding of the data and helps inform subsequent modelling decisions.In this section, we discuss several aspects of EDA in the context of drug discovery: a. Assessing property distributions: Analysing the distribution of biophysical properties, activation (e.g.IC50 or Ki value range), potency, selectivity, positional amino acid frequency, antibody topology, or other relevant features in the dataset.This analysis can help identify potential biases, outliers, or trends that may impact model performance.It will also impact the model's prediction ability and enable determination if additional preprocessing or normalisation steps are required.
b. Coverage of the target space and dynamic range of the data: Coverage of the target space (the final outputs of inputs) and dynamic range of the data: Evaluate the data's coverage of the target space to ensure that the model can make accurate predictions across the entire range.Assess the dynamic range of the data, as small ranges may lead to poor model performance.For example, if selectivity ranges are small it is unlikely that the model can make reliable predictions far outside these ranges (c.f., model applicability domain).When dealing with very small dynamic ranges, the question of sufficient resolution of the corresponding assay might arise.If the underlying assay is not able to distinguish variants with diverse properties in a significant manner, computational methods built on top of such data will most likely fail as well.The dynamic range of a dataset can have a large impact on the apparent correlation between experimental and predicted activity and the literature is full of examples of what appear to be impressive correlations on datasets that span an unrealistically high range.So when testing a model it is also important to ensure that the evaluation range is representative of the application it is intended for.When data within this typical range is considered, these apparent correlations can decrease dramatically [112].
c. Evaluating data imbalance: Assessing the balance between different classes or ranges of values in the dataset, as imbalanced data may negatively impact the performance of ML models.Techniques such as re-/oversampling [113][114][115], undersampling [113][114][115], changing the decision threshold (for classification) [116], or using weighted loss functions [115] can help address this issue and should be considered where appropriate [117][118][119][120]. Furthermore, different metrics should be chosen for the performance evaluation depending on the data imbalance.We will discuss this further in the later sections on metrics.
d. Model applicability domain: It might be possible to evaluate the applicability domain of the model based on similarity of training to test/production data or uncertainty [121,122].For similarity, consider factors such as the similarity of the training set property range to the target property range, input sequence similarity, or clustering representations.Keep in mind that sequence similarity does not always imply phenotypic similarity, as sequence-similar antibodies may bind different antigens.The applicability domain should in particular be considered when deploying generative models [123], since these can easily exploit weaknesses of the scoring functions [124,125].
e. Simple correlation and cluster analysis: Perform correlation and cluster analysis to gain insights into the relationships between variables, identify potential outliers or patterns, and inform the choice of features and models.This information can be valuable for improving model performance and interpretability.Additional methods for outlier detection should also be considered, e.g.[126].

Choosing the correct model loss function and performance metrics
Selecting a set of appropriate metrics is critical for assessing the performance of ML models in drug discovery [127].This section will discuss various metrics for regression and classification tasks and when to apply them.Rather than being comprehensive we only try to give the reader a flavour of the variety of methods and refer to literature for further reading.It is important to note that metrics are different from loss functions.Loss functions measure the model performance during training and are used to optimise machine learning models by minimising the loss in order to derive the optimal performing model.The loss function hence usually needs to be differentiable (i.e. a gradient can be calculated) with respect to the model's parameters.Metrics, on the other hand, are used to monitor and measure the performance of a model both during training and testing.Not all metrics are differentiable, in particular if they are not used for training but only for the final evaluation of the model.While there is a substantial overlap between the two, here we focus on the model performance metric [128].For the final model evaluation we typically look at multiple different metrics, while for the loss function typically a single one is chosen where additional terms may be added for regularisation [129].Both need to be chosen carefully alongside the optimiser in order to be successful.The advice around the metrics that are discussed below will generally hold up for the loss function as well as for the final model evaluation.[18,20,130]: • Mean Absolute Error (MAE/L1): This is the absolute value of the difference between the prediction and the observed value.It should be used when you want to minimise the effect of outliers, as it is less sensitive to these than Mean Square Error (MSE).• Mean Square Error (MSE): This is the square of the absolute value of the difference between the prediction and the observed value.The MSE is on the other hand susceptible to outliers (extreme values), and these might impact the model performance unproportionally.MSE may also punish errors too heavily if your targets have a large value spread.Mean squared logarithmic error may then be more appropriate, specifically if you're dealing with large, unscaled quantities.
• R-Squared: Use when you want a robust and interpretable metric that considers the proportion of variance explained by the model.• Pearson Correlation Coefficient (r): Use when assessing linear correlations between two sets of data.• Spearman Correlation Coefficient (rho): Use when assessing non-linear, monotonic relationships between two sets of data, specifically to measure rank correlation.
It's essential to remember that correlation coefficients have themselves an associated error, dependent on both the correlation coefficient itself and the number of data points that are used to obtain it.It is hence good practice to evaluate confidence intervals for the correlation coefficients, in particular when comparing these across different models.Only if two correlation's confidence intervals do not overlap, there is a clear advantage.However, existence of an overlap does not necessarily imply a lack of difference between two models [131].
2. Classification Metrics: For classification tasks, it is important to select the appropriate threshold for optimal performance.Several metrics can be used to evaluate classification models: • Balanced Accuracy: Use when dealing with imbalanced data, as it considers both sensitivity and specificity.• Accuracy: Use when the problem is balanced, and classes are equally important.• Precision: Use when you want to focus on the accuracy of true positive predictions.• Recall: Use when you want to focus on the proportion of true positive predictions out of all possible positive predictions.• F1 Score: Use when dealing with imbalanced data and focusing on the positive class.It balances precision and recall.• ROC-AUC (Receiver Operating Characteristic/Area Under the Curve), also called AUROC: Use for balanced data, as it measures the trade-off between true positive rate and false positive rate [130].Avoid using ROC-AUC for imbalanced data [132].ROC-AUC can summarise the performance, with perfect classifiers having an AUC of 1 and a random one having 0.5.It's another measure of model calibration as it assesses model performance across all possible decision boundaries and is directly related to the Mann-Whitney statistic [133].• Partial AUC: Instead of the area under the entire curve this is a restriction to certain sensitivity and specificity ranges.This might be useful when certain decision boundaries are irrelevant.• PR-AUC (Precision-Recall Area Under the Curve): Use for imbalanced data sets or when you care more about the positive class, as it focuses on precision and recall.
• MCC (Matthews Correlation Coefficient): Use when you want a metric that considers all four values of the confusion matrix, providing a comprehensive evaluation of binary classification.• Cohen's kappa: This is the chance-corrected accuracy.This accounts for class imbalance and is useful for determining if a model is better than a naive Bayes model.
For multiclass predictions, metrics such as Matthew's correlation or extensions of the above scores can be used but require additional care [134,135].Choosing the metric with care is crucial since these can otherwise be misleading.For example, Cohen's kappa should be avoided [136].In addition, when predicting probabilities (rather than classes) other metrics such as the Kullback-Leibler divergence need to be chosen [137].
When selecting a metric, consider the specific requirements and goals of the drug discovery project, as well as the distribution of the data.Different metrics may be more suitable under different circumstances, such as data imbalance [138] or the importance of certain classes.For example, high specificity or low false positive rates are often desired, especially in scenarios where the rarer positive class is the main interest.In general it is advised to always assess models for multiple metrics in order to find the most suitable one for a given task.In most common packages, such as sklearn [138,139], all metrics are already pre-implemented and hence readily available.For classification, it is also helpful to change the threshold for the classification based on the task at hand as the standard binary classification threshold of 0.5 is not always ideal.
It is good practice to monitor the learning curve over the training process for both the loss function and the performance metric.This allows for spotting undesired behaviour such as overfitting.

Model components and model choice
Choosing the right ML model for drug discovery projects is crucial for achieving desired results.This section discusses various aspects of ML models, from data considerations to model types and best practices.
1. General Data and Program-Specific Data: Depending on the dataset size and problem scope, you can use pre-trained or multi-class models or train new models from scratch.When you have a large, diverse dataset that covers various aspects of the problem, pre-trained models can be fine-tuned for your specific application.For smaller, domain-specific datasets or unique problems, training a new model might be necessary.
2. Model Requirements: Consider the specific goals of the drug discovery project and the characteristics of the data when selecting a model.This includes interpretability, computational resources for training, and the model's ability to generalise to new data.
3. Data Preprocessing: Data preprocessing techniques like StandardScaler, MinMaxScaler, and PowerTransformer can be crucial for achieving good results [140][141][142].StandardScaler removes the mean and scales features to unit variance, while MinMaxScaler scales features between a specified range.PowerTransformer applies a power transformation to make data more Gaussian-like.
Standardising data is important for many ML estimators because they may perform poorly if the individual features do not resemble standard normally distributed data.This can be particularly relevant for models that rely on the assumption of normally distributed data or are sensitive to feature scales, like linear regression or support vector machines.Automating the process so that you can test a variety of combinations is hence generally useful and a best practice.
In terms of ML models, there is a range of models that could, or indeed should be tested, and in many circumstances it is advisable to start from simple and go to more complex.Below we suggest a general flow of models that can be used to first test the data splits and evaluate baselines for the model performance: • Dummy Models: Start with simple baselines to evaluate the performance of more complex models.The simplest models are dummy models, which for example can just make random predictions or majority class predictions.These are helpful to establish an absolute minimum baseline to beat.• Adversarial validation [143,144] [145].It can be used to invalidate wrong hypotheses.A related method is y-scrambling [146].In y-scrambling the model is first trained on the original data and the performance metric is observed.The y-labels are then shuffled so that the correct feature-target pairs are now replaced with the incorrect feature-target pairs, in other words with incorrect labels.Now the model is retrained on this data and the performance metric observed.The last step is repeated a few times to obtain a sample of the performance.It is expected that the model performs well over the original data and poorly on the shuffled data.If that is not the case and the metric doesn't vary much, then that means the predictions aren't robust and the model predictions are likely not reliable.As an example, in [147] the authors showed that the original model had a high r² and low RMSE scores which were, however, closely replicated by y-scrambled models.This immediately casts doubts on the original model's validity.• Conventional ML Models (Random Forest, SVM, etc.): Use these models as a starting point before exploring more complex (deep learning) models.These will usually be easy and quick to implement and give you a sense if there is a signal in the data.They will also allow you to meaningfully assess any improvements of more complex models and are typically more robust.• Deep Learning Models: Consider using deep learning models when you have a large, complex dataset, and the problem requires advanced feature extraction.Certain types of architectures, such as equivariant neural networks, can be used even if little data is available.• Physics-based Models and Simulations: These models can be used when you have structural information and require a more detailed understanding of the interaction mechanisms.A wide range of tools like Rosetta [148,149] , ZDOCK [150], HDOCK [151] are available and should be chosen depending on the problem [152,153].

Best Practices:
• Combining Models into Ensembles: Ensemble methods can improve model performance by combining the strengths of multiple models.This is particularly the case if you combine different types of models (not the same type with different initialisation), such as physics-based with conventional ones.
• Common Pitfalls and Best Practices: Always validate your models using appropriate metrics, avoid overfitting, and tune hyperparameters carefully.It is also essential to use a correct data split and avoid data leakage [154].This should be carefully tested beforehand.Data leakage in particular has become a large problem in computational protein structure prediction [155] and needs to be carefully considered, since sequence alone is not indicative for good data splits and data leakage and protein homology also needs to be taken into account (c.f.section 9).In summary, choosing the right ML model for drug discovery projects involves considering various factors such as data size, model requirements, preprocessing, and best practices.Always start with simple models and move towards more complex models as necessary, keeping in mind the specific goals and characteristics of your drug discovery project.

Evaluation
Model evaluation is a critical aspect of drug discovery projects that ensures the chosen models are effective and accurate [127,145].This section delves into various components of model evaluation, from validation to data splits and feature analysis.
1. Process Validation vs. Model Validation: In drug discovery, every model validation is a process validation rather than just a model validation [31].This is because the entire process, including data preprocessing, feature selection, and model training, contributes to the overall performance of the model.Therefore, it is essential to validate and optimise the entire process end-to-end to ensure the best possible results.2. Benchmarks: Benchmarking against dummy, simple, and state-of-the-art approaches is crucial for evaluating the effectiveness of complex models.By comparing the performance of the proposed model with existing methods, it is possible to demonstrate the superiority of the new model and justify its use in the drug discovery project.3. Metrics: As discussed earlier, choosing the correct metrics for model evaluation is critical.Ensure that the selected metrics are appropriate for the specific drug discovery problem, the model, the task, and that they also adequately capture the desired aspects of model performance.4. Significance testing: An important part of the model evaluation is the comparison between different tested models and whether there is a significant difference in their performance.Since model performance is based on a statistical sample (the data split), evaluation of the performance differences should be performed using statistical methods.Generally, a statistical hypothesis test quantifies how likely it is to observe the outcome given the assumption that the model outputs have the same distribution.If the result of the test suggests that there is insufficient evidence to reject the null hypothesis that there is no difference in the model, then any observed difference in model skill is likely due to statistical chance.For comparison of two models, it is recommended to use McNemar's test in cases where there is a limited amount of data and each algorithm can only be evaluated once [156] or a resampling method with 10x10-fold cross-validation with a corrected paired Student-t test [157,158].
Comparing multiple models at once requires different tests such as the Holm-Bonferroni method, a Wilcoxon signed rank test with adjustment for multiple testing or the Friedman two-way analysis of variance by ranks test (in short Friendman test) [159][160][161].While it is generally recommended to use adjustments of the p-values for comparison of multiple models, there is no consensus as to which method should be used.We refer the reader to [162] for a practical tutorial, and note that looking at the overlap of error bars is insufficient [163,164].5. Model Performance vs. program impact: While academic and research groups are mainly interested in improving the model quality, in real-world drug discovery scenarios the real impact comes from the improved quality of decision making leading to the generation of better drug candidates.
The performance of a model should hence be evaluated in the context of its impact on the drug discovery program.In many cases, a few percent uplift in the model performance is irrelevant given that the noise in the data is usually high and models simply overfit.Getting a small amount of extra, high-quality data, would usually result in much better outcomes and should hence be considered as an alternative to working for weeks on new or better models.6. Drug Discovery-Specific Metrics: There are additional metrics for ML-driven discovery processes.While these are likely less familiar for a ML expert, they may be used to evaluate the model's ability to discover new drugs.These include, for example: • Enrichment: The fraction of active compounds (e.g.binders) in a selected subset of compounds compared to the fraction of active compounds in a randomly chosen subset [165][166][167].A related metric is the normalised discounted cumulative gain (NDCG), which indicates if the top predicted results are enriched for truly high performers [168].In contrast to the Spearman correlation, NDCG weights the ranking of the top of the list higher.• Scaling factor: The number of compounds that can be reliably predicted per experimentally tested compound.• Search efficiency: How quickly a method can discover the top-performing compound in a dataset, usually used in simulations (and retrospective studies).7. Data Splits: Defining data splits that better assess real model performance is crucial for accurate evaluation.Predictive models are not universally applicable and generally perform better when predicting the activity of molecules similar to those in the training set and worse if the molecules are too dissimilar [169].In our experience simple data splits rarely reflect the true prospective model performance in a drug discovery program [170].For this reason a set of different data splits with varying degrees of difficulty typically allows one to get a better sense of the true prospective model performance.For small molecules this is for example done using random splits, splits by Tanimoto similarity, Bemis-Murcko scaffold splits, and Butina clustering-based splits on extended connectivity or Morgan fingerprints [169].For proteins, we recommend to test several of the below dataset splitting approaches: • Types of Data Splits: Various data splits based on sequence identity can be used for protein models [19], such as variation in sequence, mutation location, number of mutations, amino acids, physicochemical properties of the sequence, and property ranges.It is also useful to use the concept of adversarial validation [143] to establish if training and test data is easy to distinguish and the splits are done well.When mixed data sources are used (e.g. for training and test data we could train on public data but predict internal data), this method is very useful to assess how likely a successful outcome is.• Approaches: Consider using 80-20 or other fixed splits, cross-validation, time splits (e.g.campaign-based) [171] or simulated time splits [172], or cluster-based splits for data splitting.In general a variety of different data splits with increasing difficulty should be chosen.The most difficult splits are in our experience often most representative of real world performance.• Of particular importance when splitting the data is to check for any data leakage.This can happen if proteins with high sequence identity or homology are present in the data set and it is hence essential to check the similarity between training, test, and validation sets.A typical approach that is chosen by many researchers is to use a threshold of 25-30% for sequence identity of the training set proteins to any protein in the test set.This is enough to exclude many homologous pairs of proteins, but it is well known [47,173,174] that some homologous proteins can have virtually no sequence similarity.Such challenges should be addressed as well, by invoking additional tools such as CD-Hit [175], BlastClust [176], or TESE [177]).Such reduction algorithms often rely on local sequence alignments, and multidomain proteins might lose unique domains during the reduction.Preprocessing the sequences into domains using tools like PFAM [178] can be an option as well.However, for antibodies or related scaffolds sequence identity thresholds of 25-30% need some further consideration as the constant regions of antibodies are highly similar and diversity is mainly limited to the Fv and CDR regions.Custom data splits (as discussed in the next section) taking the program or project specific sequence diversity and design into account, are often the methods of choice.Other types of data leakage are of course also possible, for example, the availability of a feature during training that is not available during testing or in the application.We refer the reader to the literature (e.g.For data representative for cases (a) and (c) generic splits should be avoided and other approaches should be sought depending on the application and objective.For example, training and test splits should consider meaningful distributions of mutational positions, amino acid type, and biophysical properties (polar vs hydrophobic, etc.).For data of the form (a) and (c), data leakage can become a problem and random splits should be avoided since it will likely result in overestimation of the performance and a lack of generalisation abilities.Here, one may for example split by sequence families or cluster the sequences by similarity and avoid overlap between training and test.In general we always advise to understand the data set (e.g.origin and purpose of the sequence designs) at hand and a clear definition of similarity and the prediction goals in order to create data splits that are representative of real world performance.• A last point is related to the hyper-parameters.No parameter should be selected based on the test data, and this includes hyper-parameters.Examples of hyper-parameters include the number of clusters in K-means, the number of trees in random forest, or the regularisation parameters for SVMs, or number layers or neurons in a neural network.A simple approach to guard against overestimation due to choice of the hyper-parameters based on the test set is to introduce a third validation set and select hyper-parameters based on it before testing it on the test set (ideally using cross validation).• Impact of training data composition: there is emerging evidence that the choice of negative data can impact prediction accuracy and generalisation in both antibodies [103,180,181] and TCR specificity prediction [182][183][184][185]. Therefore, care must be taken as to how negative and positive datasets are defined.8.For each model and metric there is a tradeoff between false-positives and false-negatives.For each project and problem there will be specific choices and these needs to be informed by the overall objectives.Clearly understanding the tradeoffs will enable us to obtain optimal performance.9. Feature Analysis and Interpretation of Results: Evaluating Feature Importance: Assessing the importance of features helps in understanding their contribution to the model's performance and refining the model.For feature significance, for example, one can use p-values to assess feature significance or other methods like GINI impurity for Random Forest models or SHAP analysis [186].
In conclusion, model evaluation is a multifaceted process that involves validation, benchmarking, metric selection, performance assessment, and feature analysis.By carefully considering each aspect, researchers can ensure that the chosen models are accurate and effective, ultimately leading to more successful drug discovery projects.Without proper testing, there is a large risk of the performance of a model not being representative of real-world performance on new or unseen data.This can in the best case undermine the trust in scientific results and in the worst case result in huge costs or harm to patients.

Conclusions
The use of ML in antibody drug discovery holds significant promise, but its real-world impact remains limited.We believe that the next few years will see an unprecedented impact of such methods on current and future drug discovery programs.This will be due to the availability of more and better data, better methods, and more robust processes.This review establishes a clear set of best practices, primarily focused on robust data generation, capture, and model building.This includes rigorous model validation, to allow for the evaluation of real-world prospective performance in therapeutic antibody R&D.By addressing pitfalls and offering clear guidelines, this review bridges the gap between theoretical advances and practical applications in therapeutic antibody engineering.The focus on practical considerations ensures that ML applications not only accelerate the R&D process but also contribute to the development of safer and more effective biotherapeutics.Overall, by adhering to best practices and robust validation approaches, the field can progress to produce higher quality antibodies, thus offering better therapeutic options and meeting unmet medical needs.Future work should continue to emphasise the importance of robust end-to-end processes from data generation and storage to model validation and deployment.We hope that the widespread adoption of standardised guidelines will drive the field forward and maximise the benefit to patients.In summary, while ML offers vast opportunities to revolutionise antibody drug discovery, its full potential can only be realised through the adoption of rigorous guidelines and best practices as we propose in this review.

Figure 1 :
Figure 1: Overview of the entire machine learning (ML) process for antibody R&D from data collection to model evaluation.

Figure 4 :
Figure 4: Key properties that are typically co-optimised to achieve a specified developability attributes of the CDTP.

Figure 5 :
Figure 5: A systematic process overview of the different steps that are required to evaluate ML models in antibody discovery.These steps are generic and apply to most ML processes.

Figure 6 .
Figure 6.Data from different controls evaluated in cell-based assays from different campaign pre (left) and post (right) normalisation can dramatically change the picture.Here we clearly see two clusters (control-1 and control-2) emerging after normalisation.

Figure 7 .
Figure 7. Illustration of transfection process of 12k multispecific antibodies in 96-well format (149 plates).Variants were transfected over two days (TF1 & TF2) using three cell batches per transfection day (P1-P6) level.Each plate holds 8 replicates of a transfection control.Upper panels show the arithmetic mean of the expression titers of the transfection control calculated from 8 individual replicates per 96-well plate (green in plate layout, middle panel).As the expression level of the transfection controls varies over the complete transfection process (e.g.due to changes in cell batches), it is required to normalise the transfection titers of the sample molecules (blue in plate layout, middle panel) against the mean transfection levels of the transfection controls on each plate when aiming for an overall comparison of the transfection levels of sample molecules distributed over all 149 sample plates.Normalised expression ratios are shown in the lower panel.Source: [29] (supporting material) [179]) • In antibody D&D, one typically engages either of the following data forms: (a) Mutational variants, (b) different "wild types" (no mutational relationship; directed against same or different targets), or (c) a mix of the above.For 'general' models such as for expression or stability, then (c) is the most common case, while (a) and (b) are very common for program-specific predictions such as function during lead identification (b) or optimization of lead candidates (a).
4. Model Types: Depending on the problem you might want to use a different set of descriptors (e.g.structure, sequence, physchem descriptors).Structure-based models use 3D molecular structures, sequence-based models focus on the amino acid sequences, and descriptor-based models use calculated features to represent molecules.You can use different descriptors with different models and in many cases different descriptor-model combinations will result in different performance.
is a technique used to assess the degree of similarity between training and testing datasets in terms of feature distribution.One way to perform adversarial validation is to train a binary classification model to predict whether a given sample belongs to the training or test dataset.The training and test datasets are combined, and labels are assigned to each sample (0 for training, 1 for test).The performance of this model, typically evaluated using an ROC curve or AUC score, gives an indication of how similar the two datasets are.If the adversarial model performs poorly, this suggests the training and test data are similar, and conventional validation techniques should work well.Conversely, if the adversarial model performs well, this indicates that the training and testing data distributions differ significantly, potentially signalling a risk of overfitting and misleading model validation results.The goal of adversarial validation is to ensure that the model built using the training dataset will generalise well to the unseen test dataset, reducing the risk of overfitting and improving the robustness of model predictions