Prediction of HIV-1 protease resistance using genotypic, phenotypic, and molecular information with artificial neural networks

Drug resistance is a primary barrier to effective treatments of HIV/AIDS. Calculating quantitative relations between genotype and phenotype observations for each inhibitor with cell-based assays requires time and money-consuming experiments. Machine learning models are good options for tackling these problems by generalizing the available data with suitable linear or nonlinear mappings. The main aim of this study is to construct drug isolate fold (DIF) change-based artificial neural network (ANN) models for estimating the resistance potential of molecules inhibiting the HIV-1 protease (PR) enzyme. Throughout the study, seven of eight protease inhibitors (PIs) have been included in the training set and the remaining ones in the test set. We have obtained 11,803 genotype-phenotype data points for eight PIs from Stanford HIV drug resistance database. Using the leave-one-out (LVO) procedure, eight ANN models have been produced to measure the learning capacity of models from the descriptors of the inhibitors. Mean R2 value of eight ANN models for unseen inhibitors is 0.716, and the 95% confidence interval (CI) is [0.592–0.840]. Predicting the fold change resistance for hundreds of isolates allowed a robust comparison of drug pairs. These eight models have predicted the drug resistance tendencies of each inhibitor pair with the mean 2D correlation coefficient of 0.933 and 95% CI [0.930–0.938]. A classification problem has been created to predict the ordered relationship of the PIs, and the mean accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC) values are calculated as 0.954, 0.791, 0.791, and 0.688, respectively. Furthermore, we have created an external test dataset consisting of 51 unique known HIV-1 PR inhibitors and 87 genotype-phenotype relations. Our developed ANN model has accuracy and area under the curve (AUC) values of 0.749 and 0.818 to predict the ordered relationships of molecules on the same strain for the external dataset. The currently derived ANN models can accurately predict the drug resistance tendencies of PI pairs. This observation could help test new inhibitors with various isolates.


INTRODUCTION
Acquired immunodeficiency syndrome (AIDS) disease caused by the human immunodeficiency viruses, HIV-1 and HIV-2, began to spread in the 1970s and came into focus in the early 1980s as one of the most severe public health threats in history (Sharp & Hahn, 2011). Detection of reverse transcription activity in cultures of lymph node cells from AIDS patients in the early 1980s revealed that AIDS was caused by a retrovirus later called human immunodeficiency virus (HIV) (Das & Arnold, 2013). Zidovudine (AZT), the first nucleotide reverse transcriptase inhibitor (NRTI) that inhibits the reverse transcription enzyme of HIV, was approved in 1987, and today there are nearly thirty approved drugs (Lu et al., 2018). HIV-1 has affected approximately 38 million people today, and just about 26 million people are receiving ''highly active antiretroviral treatment'' (HAART) (Jespersen et al., 2021). The HAART therapy proposed in the mid-1990s was defined as the procedure of using three or four different drugs that act on various targets in the virus's life cycle (Palmisano & Vella, 2011). With HAART therapy, the death rate fell to 47% in 1997, just ten years after the first AIDS case was detected (World Heath Organization).
HIV protease enzyme plays a vital role in forming infectious viruses by regulating immature viruses' synthesized gag and gag-pol polyproteins (Zhang, Kaplan & Tropsha, 2008). Protease inhibitors are generally included in the scope of HAART therapy, and eight approved drug molecules are used effectively today (World Heath Organization). Doseresponse curves of protease inhibitors were shown that they have higher Hill coefficient values than the fusion (FI), integrase (II), nucleoside reverse transcriptase (NRTI), and non-nucleotide reverse transcriptase (NNRTI) inhibitors (Rosenbloom, Hill & Rabi, 2012). Even if a person is infected with the wild-type virion, resistant variants may emerge with dosing disruptions or the use of inappropriate combinations in the HAART therapy (Jilek et al., 2011). The success rate of HAART therapy can be increased by measuring the efficacy of existing and novel inhibitors over resistant genotypes (Xing et al., 2013;Lima et al., 2008). Observing drug-efficacy relations with cell-based assays is expensive and time-consuming in the presence of genotype information. Mathematical models are essential to tackle this important problem (Wei et al., 2015;Yu, Weber & Harrison, 2014;Hosseini, Alibés & Noguera-Julian, 2016).
Here, a machine learning model was constructed that simultaneously takes molecular fingerprints and mutational information jointly as inputs to estimate the fold change values. For training and testing sets, we used data from eight approved protease inhibitors atazanavir (AZT), darunavir (DRV), fosamprenavir (FPV), indinavir (IDV), lopinavir (LPV), nelfinavir (NFV), saquinavir (SAV) and tipranavir (TPV) in the Stanford HIV drug resistance database. By imposing leave one out (LVO) test procedure, our drug-isolate-fold (DIF) change-based artificial neural network (ANN) models are seen to have the ability to learn both from inhibitor descriptors and mutational genotype information to predict fold-change values. The model can predict the fold change of hundreds of isolates. To that end, the learned hundreds of predictors(fold-change of isolates) can be successfully used to assess the resistance potential of inhibitors. We used pairs of drugs to predict the more resistance-prone molecule. We called these pairwise comparisons the resistance tendencies. Our DIF-based ANN models predicted each protease inhibitor (PI) pair's drug resistance tendencies accurately, and these quantitative results support our central arguments.

Representation of isolates
Four hundred ninety-eight unique mutations were observed in the eight protease inhibitors dataset. The binary barcoding technique was applied here to represent the isolates that occurred in the dataset, as also used in several studies of modeling genotype-phenotype data for various HIV-1 inhibitors (Lu et al., 2018). Thus, a 498-dimensional vector of binary entries with 0s and 1s uniquely representing any existing isolates is considered. Assume that the 498 unique mutations produce the vector X = [x 1 ,x 2 ,...,x 498 ] where x i is a mutation pattern that occurred in the dataset. For instance, x 1 denotes the occurrence of the mutation A22S or x 478 denotes the occurrence of the mutation V82A. For example, the isolate I j = [A22S,V82A] can be barcoded as X = [1,0,...,0,1,0,...,0] in which only the first and four hundred seventy-eighth position take value one, and the remaining entries have value zero. Any isolate can be obtained from any combination of these mutations, and the isolate j can be defined as I j = {a 1 ,a 2 ,...,a n } with a k = 1, if x k ∈ I j 0, otherwise. In this way, each isolate can be transformed into a unique 498-dimensional input vector used for the machine learning. The binary barcoding approach has the advantage of representing two or more mutational changes in the amino acids since each mutation has a unique position in the 498-dimensional input vector. For example, assuming that x 480 and x 481 denotes mutations V82F and V82I, the isolate I = [V82F,V82I] can be represented as X = [0,0,...,0,1,1,0,...,0]. It is important to note that, the genotype-fold change measurements are made on population of viruses, so that it is possible to find two separate mutations for a single residue.

Representation of inhibitors
To construct a drug-isolate-fold change model for the HIV-1 protease inhibitors, the molecular representations of the inhibitors have been built with binary Morgan fingerprints. The Morgan fingerprints provide an effective way of the vector representations of molecules and are widely used in machine learning models (Jespersen et al., 2021). The RDKit environment of the Python program has been used to convert the smile representations of ATV, DRV, FPV, IDV, LPV, NFV, SQV, and TPV inhibitors to a binary 512-bit vector representation. 234 out of 512 bits have been seen to provide unique characteristics for 8 PI. Thus, the molecular representation of each PI needs 234-dimensional vectors.

Artificial neural network (ANN) model for regression
An ANN model has been constructed with isolate-inhibitor inputs and fold change outputs with the machine learning and deep learning toolbox of the MATLAB program. Since isolates and inhibitors are uniquely represented by 498-and 234-dimensional vectors, the ANN model has 732-dimensional input. The ANN architecture includes 732-dimensional input, five hidden layer neurons, and one output neuron with hyperbolic tangent-sigmoid and linear activation functions. Logarithms of fold-change values in the dataset are taken as output variables of the neural network models. In the training process, the scaled conjugate gradient algorithm with MATLAB built-in function ''trainscg'' is utilized over GPU (Palmisano & Vella, 2011).

Ensemble processing
Since we have only eight inhibitors, measuring the molecular learning capacity of our ANN model is crucial. In this way, an ensemble learning procedure is used to improve the molecular learning performance of the model. For each PI, the 100 × 50 model has been trained with the data of the remaining seven inhibitors. From every 50 models, a model is chosen that yields the minimum mean square error for the interior test set of the corresponding PI data. Thus, 100 optimal models are obtained, and the final model is calculated as the average of these models.

Regression performance of molecular learning models
Eight feed-forward neural network models have been constructed with drug-isolate-foldchange (DIF) data by excluding one of the drugs from training in each case. The ANN model was trained with the remaining seven DIF data predicted the excluded results. The sizes of the training and test sets were changed according to the excluded PIs (mean values are 10,328 and 1,475 for training and test sets, respectively). The regression performances of each model are illustrated in Fig. 1 with corresponding R 2 values (square of the linear correlation coefficient). The best and worst results are obtained by predicting the outcomes of the drugs LPV and TPV with R 2 = 0.822 and R 2 = 0.359, respectively. Similarly, predicting the fold-change results of the inhibitor TPV was observed to be the worst in the literature (Yu, Weber & Harrison, 2014). The mean R 2 value of all predictions is 0.716 and the 95% confidence interval is [0.592-0.840]. The DIF-based ANN model provides accurate estimations even if the test data consists of unseen drugs. This observation implies that our ANN models accurately learn molecular information from the Morgan fingerprints. The detailed performance results of our DIF-based ANN models are presented in Table 1.
An inevitable question is how molecular information changes the regression and classification performance of our ANN models. To clarify this, we trained isolate-foldchange (IF) based ANN models for each inhibitor and compared them with the current DIF-based models. In Figs. S1-S2, the model performances have been compared by measuring R 2 and area under the curve (AUC) values for each PI. Table S1 also shows the accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC) scores of both DIF-based and IF-based models. These findings suggest that the DIF-based ANN model performs slightly better in terms of regression and classification for six of the eight PIs. The two models are compatible with the remaining two inhibitors (DRV and TPV). Therefore, the molecular information used by the DIF-based model provides a better predictive capability. It is very important to note that IF-based models can never predict fold-change values for novel molecules. The real distinction between DIF and IF-based models is that the DIF-based model can predict fold-change values for a novel drug. Hence, the DIF-based ANN model has both better learning capability from mutant information (by comparing to the IF-based model) and the ability to test novel molecules for a given isolate.

Prediction of drug resistance tendencies for each PI pair
The inhibition potential of each PI in the presence of various genotypes is known to be variable. Tendencies of the logarithmic fold change values for each PI pair provide valuable information about the resistance profiles of the inhibitors, as seen in Fig. 2. Prediction of these tendencies by the DIF-based ANN models and the corresponding 2D correlation coefficients are presented in Fig. 2   , respectively. In this context, the current DIF-based ANN models have ability to capture the binary relations between any PI pair with high approximation performance.
Performance metrics of the current ANN models for capturing binary relations of PI pairs are presented in  Testing the molecular learning model with the external data set For the external dataset, we conducted a search for compounds that were comparable to the eight HIV PIs that were already available in the ChEMBL database (Sevenich et al., 2017). An initial set of 1,305 compounds and their biological activity values were extracted. First, compounds having 70% or less similarity to each drug were filtered out. Then, molecules with determined IC 50 values in mutant viruses were collected. The compounds with determined IC 50 values in mutant viruses were then gathered. Furthermore, the maximum Tanimoto similarities of these molecules to the current eight PIs were computed (using 512-bit Morgan fingerprints), and molecules with 40% or less structural similarity were filtered out. Finally, molecules with 1s in the discarded 278-dimensional fingerprint vectors, that is, molecules with information loss, were filtered out. A final dataset of 87 genotype-phenotype relationships involving 51 different molecules was obtained (see Data S1). Only six of these molecules are in the existing PI set (DRV, IDV, NFV, ATV, LPV) and only 20 of 87 genotype-phenotype relations belong to these six PIs. In the presence of eight distinct strains, there is a fold change difference in IC 50 values for these molecules. Fortunately, the data includes different molecules tested on the same protein, so we can evaluate the efficiency of our ANN model on ranking of these molecules. We have used the eight existing PIs (see Fig. 4 for chemical structures) to construct an ANN model as described in the Methods and Material section. The main difference is the consideration of all molecules rather than the LVO procedure. The Morgan fingerprints of the new 51 molecules were determined, and then the similar mapping was used to reduce the 512-dimensional vectors into 234-dimensional inputs. It should be noted that the molecules were chosen in such a way that there are no 1s in the discarded 278 bits. This   condition was specifically designed into the external dataset. In this approach, we ensure that no meaningful information for novel molecules is lost during the machine learning process. Two different classification performances of our ANN model were observed to test its molecular learning potential. The first is the classification of resistant and susceptible strains for a given PI, with a threshold fold chance value of 3 (Shen et al., 2016). The goal of this classification task is to evaluate the resistance labeling performance of the current model for diverse molecules. As demonstrated in Table 3, the accuracy, sensitivity, specificity, MCC, and AUC metrics of our model for labeling the resistance in external data are 0.678, 0.536, Table 3 Classification performance of the ANN model on the external test data. In resistance classification procedure, the ANN model is used to classify the genotypes as drug resistant (Fold Change ≥3) and susceptible (Fold Change <3) for each external test data. On the other hand, for ranking classification procedure, the ANN ranks the drugs for a given genotype in terms of their logarithmic fold change values. Our major point is that the built ANN model is capable of ranking the efficiency of PIs for a given strain. The external dataset contains 853 pair of resistance scores for 51 different molecules. Our ANN model predicted these 853 pair of resistance scores as well, and we measured the ranking performance of the model. This testing approach is identical to the previously described tendency and ranking assessments of the eight existing PIs. For this classification task, the accuracy, sensitivity, specificity, MCC, and AUC scores have been 0.749, 0.767, 0.730, 0.497, and 0.819, respectively (see Table 3). As a result, our ANN model appropriately ranks the compounds based on their resistance profiles (see Fig. 5 for a representative example). The ROCs for both resistance classification and ranking classification performances of the ANN model can be seen in Fig. 6. The ANN model learned considerable information from the molecular structure of the eight PIs, according to the ROCs. In summary, if the external PIs have no information loss (no 1s in the discarded 278 bits within 512 bits) in comparison to the existing eight PIs, our ANN model has a high ability to rank these PIs.

Model / Metric
One method for reducing and visualizing high-dimensional data is principal component analysis (PCA). It is widely used to describe the chemical space occupied by a set of Figure 6 The ROCs corresponding to resistance and ranking classifications for the test data. The DIFbased ANN model has been used to (A) classify the resistance and susceptible strains (B) classify the rankings of 853 pair of resistance scores, for various molecules existing our external data. The AUC ratings associated with the ROC curves measure how threshold probabilities affect FPR-TPR pairs in classification tasks.
Full-size DOI: 10.7717/peerj.14987/ fig-6 molecules. When the descriptors of the compounds are used for analysis, it allows for the clustering of similar molecules as well as the distinguishing of diverse molecules. To verify the applicability of our model in terms of chemical similarity and diversity, we have performed the PCA on the external set molecules using 234-dimensional vectors employed for fold-change prediction. We only evaluated the unique molecules in the external set, and for molecules that were tested in several strains and had multiple fold-change prediction values, we have taken the values with the largest absolute prediction error (AE). As the first two components, PC1 and PC2, were able to represent more than half of the variance in the data and accurately depict the correlations between the similarities of the molecules, 234-dimensional vectors for each unique molecule were projected to 2D-space. Figure 7 displays the PCA plot for both training and external set compounds and colored according to the AE for fold-change prediction or training compound name. According to the figure, other training compounds other than the IDV were not involved in cluster formation with external set molecules. Though the same external set molecules were not forming clusters, still there were three clusters that contain three or more molecules. For each of these cluster, we have selected one representative structure and additionally we have found the maximum common substructure (MCS) using the FMCS algorithm implemented in RDKit. We discovered that for all compounds in the cluster with the representative structure A in Fig. 7, the fold-changes were predicted with low errors (AE <1.0). The compounds in this cluster have certain substructures with HIV inhibitors, such as a sulfonamide group, a bis-THF alcohol moiety, and a benzodioxole group (Sevenich et al., 2017). The only structural difference in this cluster of compounds was observed on the side chain connecting to the phenoxy group. Our model, on the other hand, did not perform well for another cluster of compounds represented by structure B in Fig.  7. Despite the fact that the MCS for three compounds in this cluster were similar to the preceding cluster, represented by structure A, there were subtle differences that caused compounds to be in distinct clusters. For these, the cyclohexyl hydroxy or cyclopentyl hydroxy group was attached to the N of the sulfonamide group. Furthermore, the varying side group of the phenoxy group in the first cluster was shorter in the second cluster since there was only phenyl group. Compounds having structure C are represented in another cluster, as seen in Fig. 7. These compounds had a high resemblance to the IDV (see Fig.  4 for a 2D representation of the IDV), and the fold-change was predicted with minimal errors (AE <1.0). These compounds, like the IDV, have substructure groups such as piperazinecarboxamide, indenol derivative, and phenyl group. However, the compounds in this cluster have different attachment groups connected to piperazine ring instead of pyridine group in the IDV. We have also discovered that for the compound where the fold change estimation error above the threshold (AE ≥ 1.0), the exact value was 1.01, indicating that it is negligible. It should be noted, however, that this compound contains a Fluorine substituent, which is known to cause a significant increase in the inhibitory activity of compounds (Amano et al., 2022) and it may not be easily learnt using our model. This chemical similarity analysis revealed that our model can predict the fold change for similar compounds with low error, despite the fact that some of the external set molecules based on our descriptors were not similar (or in close proximity in the PCA plot in Fig. 7) to internal set molecules.

DISCUSSION
This article presents a machine learning approach for predicting fold-change values using HIV-1 protease inhibitor and isolate characteristics. The filtered PhenoSense assay results made accessible in the Stanford HIV drug resistance database have been used training and testing machine learning models. Seven of the eight inhibitors have been used to train drug-isolate-fold change-based feed-forward artificial neural networks, with the remaining one serving as test data (LVO). In this context, the LVO procedure produces an objective testing approach for determining the learning capacity of models from the descriptors of the inhibitors. Both inhibitors and isolates have been encoded using binary mappings, which have been shown to be computationally effective. Because of their acknowledged advantages in molecular machine learning models, the Morgan fingerprints have been exploited as binary mappings of protease inhibitors (Cai et al., 2021;Beerenwinkel et al., 2003;Beerenwinkel et al., 2002). An efficient ensemble process has been proposed and verified through various quantitative experiments to handle the overfitting trouble.
The most significant contribution of this research is the construction of drug-isolate-fold change (DIF)-based ANN models, as opposed to the widely studied isolate-fold change (IF)-based models (Amamuddy, Bishop & Bishop, 2017;Amamuddy, Bishop & Bishop, 2018;Wang & Larder, 2003;Drăghici & Potter, 2002;Kjaer et al., 2008;Steiner, Gibson & Crandall, 2020;Wang et al., 2009;Shen et al., 2016;Shah et al., 2020;Tarasova et al., 2020;Tarasova et al., 2018;Ota et al., 2021). Because the IF models do not take the molecular fingerprints as input, they are insensitive to molecular structure. This study shows the possibility of achieving such a generalized model by feeding models with adequate data from various PIs in the presence of isolates. With the use of the LVO procedure throughout our investigation, the current DIF-based models have been shown to be capable of predicting the drug resistance profiles of unknown inhibitors. Even though the Stanford HIV database only has eight available inhibitors, having many isolates for each inhibitor has contributed to the learning process, and reasonable predictions have been found in the regression performance of remaining inhibitors.
The prediction of drug resistance tendencies for each PI pair is an unavoidable expectation from our DIF-based ANN models. Our generalized models can predict resistance trends with high 2D correlation scores, as demonstrated here. The DIF-based models have provided satisfactory accuracy, sensitivity, specificity, and MCC values by creating classification problems from the tendency relations of each PI pair. The DIF-based ANN model utilizes valuable information from the Morgan fingerprints to predict the fold change values of hidden inhibitors, according to our all-quantitative observations.
We have shown that our ANN model can categorize resistant and susceptible strains as well as rank inhibitors based on resistance profiles for an external dataset. The external dataset is designed in such a way that the unique molecules are comparable enough to any of the primary eight PI and there is no bit loss in the reduction procedure of the Morgan fingerprints from 512 to 234 dimensions. Thus, whenever a molecule satisfies these conditions, our DIF-based model has ability to compare this molecule with existing PIs in terms of their resistance scores.
Instead of building independent separate models for each inhibitor, this study offers a fresh viewpoint on the field by incorporating inhibitor characteristics on the input side of machine learning models. The most obvious drawback of our model is the dearth of protease inhibitors with sufficient genotype-phenotype information. Nevertheless, our encouraging findings have demonstrated that including genotype-phenotype information of novel protease inhibitors will help build more generic drug-isolate-fold change-based machine learning models. Additionally, feeding the DIF-based models with data from many conventional and nonconventional inhibitors may result in a unified model for forecasting drug resistance tendencies for any PI pair in the presence of known genotypes. The drug development process for evolvable diseases, such as HIV, bacterial infections, and cancer should be fundamentally different from diseases such as blood-pressure regulators. A drug needs to be effective and stay effective through the test of evolution. Predicting resistance potentials for drugs is becoming a necessity.

CONCLUSIONS
This study has revealed the advantages of developing DIF-based models to predict drug resistance profiles. Instead of IF-based models, the current approach has allowed us to investigate a new model that can predict the drug resistance tendencies of PI pairs. Even with only eight PIs available, internal and external test results show that the DIF-based model takes significant information from inhibitor descriptors and leads to satisfactory regression performance. As a result, after finishing this study, it is highlighted on the research agenda to train ANN models with more inhibitors by expanding the existing dataset. In this context, it will be feasible to track the drug resistance profiles of any novel protease inhibitor, and it is strongly believed that these insightful forecasts will be a right direction for moving forward.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by TUBITAK, 2232-International Fellowship for Outstanding Researchers, Project number 118C244. All the results are in sole responsibility of the authors. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: TUBITAK, 2232-International Fellowship for Outstanding Researchers: 118C244.