Utility of a machine-guided tool for assessing risk behaviour associated with contracting HIV in three sites in South Africa

Introduction Digital data collection and the associated mobile health technologies have allowed for the recent exploration of artificial intelligence as a tool for combatting the HIV epidemic. Machine learning has been found to be useful both in HIV risk prediction and as a decision support tool for guiding pre-exposure prophylaxis (PrEP) treatment. This paper reports data from two sequential studies evaluating the viability of using machine learning to predict the susceptibility of adults to HIV infection using responses from a digital survey deployed in a high burden, low-resource setting. Methods 1036 and 593 participants were recruited across two trials. The first trial was a cross-sectional study in one location and the second trial was a cohort study across three trial sites. The data from the studies were merged, partitioned using standard techniques, and then used to train and evaluate multiple different machine learning models and select and evaluate a final model. Variable importance estimates were calculated using the PIMP and SHAP methodologies. Results Characteristics associated with HIV were consistent across both studies. Overall, HIV positive patients had a higher median age (34 [IQR: 29–39] vs 26 [IQR 22–33], p < 0.001), and were more likely to be female (155/703 [22%] vs 107/927 [12%], p < 0.001). HIV positive participants also had more commonly gone a year or more since their last HIV test (183/262 [70%] vs 540/1368 [39%], p < 0.001) and were less likely to report consistent condom usage (113/262 [43%] vs 758/1368 [55%], p < 0.001). Patients who reported TB symptoms were more likely to be HIV positive. The trained models had accuracy values (AUROCs) ranging from 78.5% to 82.8%. A boosted tree model performed best with a sensitivity of 84% (95% CI 72–92), specificity of 71% (95% CI 67–76), and a negative predictive value of 95% (95% CI 93–96) in a hold-out dataset. Age, duration since last HIV test, and number of male sexual partners were consistently three of the four most important variables across both variable importance estimates. Conclusions This study has highlighted the synergies present between mobile health and machine learning in HIV. It has been demonstrated that a viable ML model can be built using digital survey data from an low-middle income setting with potential utility in directing health resources.


Introduction
Ending the HIV epidemic has been the focus of global health efforts for the better part of the last two decades, and the UNAIDS "Fast-track" targets of 95%-95%-95% for HIV testing, treatment, and viral suppression are generally accepted as the foundation for ending the HIV epidemic by 2030 [1,2]. Prioritising geographical locations and populations which are lagging is fundamental to achieving these targets [3]. In addition, identifying groups at high risk of HIV acquisition not only allows for improved testing strategies but also facilitates the implementation of effective preventative strategies such as Pre-Exposure Prophylaxis (PrEP), which has been shown to be up to 100% effective in preventing HIV transmission [4][5][6][7][8].
There has been significant effort to leverage digital technology, including strategies such as rapid diagnostic self-tests, and mobile health (mHealth) and electronic health (eHealth) technologies, to help combat the HIV epidemic [9,10]. More recent strategies have coupled such approaches with advanced data sciences techniques, including machine learning (ML) and artificial intelligence, to augment the digital tools [11]. Such strategies have found particular relevance in quantifying or predicting the risk of acquiring an illness [12,13]. In HIV ML modelling and AI algorithms have been shown to be very effective in classifying and quantifying HIV risk across both high-income and low-middle income (LMIC) settings [14][15][16][17][18]. Such techniques have also found potential utility in decision making around PrEP [19,20], which may be of particular relevance in more resource limited environments [21]. These studies are typically based on secondary data analysis, often using large electronic health record datasets. However, this limits their ability to assess the utility of such techniques when using data collected via mHealth methods that are more appropriate to LMICs. This paper reports data from two sequential studies undertaken by the authors to evaluate the viability of using ML to predict the susceptibility of adults to HIV infection using responses from a digital survey. Four broad categories of input variables were evaluated: demographics, lifestyle, sexual behaviour, and symptoms, each of which comprised its own set of questions.

Objectives
To evaluate the accuracy of a ML-based risk assessment tool, trained using data collected from a digital survey, in assessing HIV risk in those believed to be negative or unaware of their HIV status.

Study design
Two separate studies were conducted. Trial one was a cross sectional study conducted in an urban setting in Johannesburg to evaluate the correlation between self-reported socio-demographic, and behavioural risk factors and HIV status among adults believed to be HIV negative. Some of the initial exploratory data analysis of trial one is reported in the protocol for trial two published in 2021 [22]. Trial two was a longitudinal study in which data were collected in two phases with phase one the first visit and phase two a follow-up three months later.
On recruitment for both trials, participants were educated on the reason for the study and what would be required of them. Thereafter, consenting participants responded to a chain of behavioural questions on a digital application platform independent of assistance from study staff. Exceptions were only made on solicitation by the participant. Some of the questions and/or fields present in the first trial were consolidated or excluded in the second trial to make a more concise and accurate screening tool. Fig. 1. Flowchart of the study process [22]. Abbreviations: RDT, rapid diagnostic test; ART, antiretroviral therapy.
Ground truth was established using two RDT HIV tests (First Response HIV-1-2-0 [Premier Medical Corporation Ltd., Kachigam, India] and Alere Determine HIV1/2 [Alere Medical Co. Ltd, Matsudo, Japan]) performed by a trained nurse/counsellor. Patients who tested HIV positive were advised to access medical intervention at a facility of their choosing. In the event of discordant results, the RDT procedure was repeated. In the second trial, patients who tested HIV negative were invited to present themselves for a second visit three months after the first.
At the end, participants filled out a user experience form after receiving their HIV test results. In the second trial, negative patients were furnished with an appointment card, and then sent monthly text message reminders to present themselves at the second visit. On attending the second visit, the process was repeated. On culmination of this visit, participants who tested positive for HIV were referred for clinical intervention and those who tested negative were advised to maintain regular screening (see Fig. 1).

Study population
Male and female adults who were either self-declared HIV negative or who were of uncertain HIV status from rural, semi-urban and urban populations, were recruited into the study.

Study sites
Study participants were recruited from communities across three provinces. Trial one was conducted in Johannesburg, Gauteng (urban) while trial two was conducted in Tshwane, Gauteng (urban); Gert Sibande, Mpumalanga (semi-urban); and Ugu, Kwa-Zulu Natal (rural).

Sampling and sample size
Non-probability sampling was employed for both trials. Convenience sampling was used for trial one and a combination of convenience and snowball sampling was used for trial two. Phase one was a crosssectional study and enrolled all consenting patients presenting to the study site. The sample size calculation for phase two was informed by findings from Figueroa et al. [23]. Considering this, a sample size of 600 was estimated with an attempt made to acquire a sample that comprised equal numbers of males and females.

Inclusion
The inclusion criteria included adult candidates who could speak and read English as well as demonstrate comprehension of the written informed consent form. In addition, participants had to have a negative or unknown HIV status, be willing to detail their accurate medical history, and supply specimens for two finger-prick blood tests.

Exclusion
The exclusion criteria included known HIV positive status, currently receiving PrEP or antiretroviral treatment, or having received any experimental HIV vaccine. Candidates who, in the estimation of the facilitator, displayed inability to perform the study processes (e.g., acutely ill, under the influence of substances) were also excluded.

Data management
HSTAR staff were responsible for all data management and quality control of both electronic-based and paper-based data. Paper-based data were entered into an electronic database by a data-capturer within two days of having received it from research personnel.

Data cleaning
Data were initially extracted from the relational database used to store responses to the mobile client, and then reformulated to provide human-readable, single row views of each respondent with the associated ground truth test being linked via the unique study identifier. Data were partitioned according to the phase of collection with a single dataset for the first trial and two datasets (initial enrolment and threemonth follow-up) for the second trial.
2.8.1.1. Merging of trial data. Collated data were produced by merging the initial presentation datasets from both trials utilising fields common to both. This was possible as the fields collected during the first trial formed a superset of the fields present in the second trial. Fields were merged using human-readable response values. Careful attention was paid to ensuring consistency between the sets so as not to introduce errors secondary to different variable encoding schemas.

Descriptive statistics
Data distributions were assessed using density plots and descriptive statistics presented as per data normality. Hypothesis testing was performed using Chi Squared tests for categorical variables, ANOVA for normal continuous variables, Kruskal-Wallis for non-normal continuous variables, and Fisher's exact tests for categorical variables with small cells (n ≤ 5). Statistical significance was considered below an α value of 0.05 and descriptive statistics were performed in RStudio [24].

Data partitioning and model building
Baseline exploratory data analysis was conducted utilising conventional univariate and multivariable modelling as well as visualisation of the distribution of variables. Subsequent predictive modelling utilised several data partitions. Initial ML model selection employed only data from the first trial with the set randomly partitioned via a 70:30 split into training and test data. Training data were then split repeatedly via 10-fold cross validation for model parameter selection via the R caret library [25]. Area under the receiver operator curve (AUROC) was used as the performance metric for parameter selection with 95% confidence intervals calculated for each via bootstrapping. The final chosen parameters were thereafter fitted on the full training set and finally evaluated on the holdout test set. A variety of models were evaluated including standard logistic Generalised Linear Models (GLMs), Bayesian GLMs, Lasso Regression (LR), Support Vector Machines (SVMs), Decision Trees, and Gradient Boosted Decision Trees [26]. All implementations were derived from the caret package with a fixed data split used across all models. All models utilised only features available in both trial datasets. These overlapping features were encoded in the same manner across all analyses to ensure interoperability between models.
The generalisability of models trained on data from trial one was evaluated using the second trial's data as an external test set. This was then repeated with the second phase of trial two to evaluate the prospective predictive ability of the model in stratifying individuals at risk of seroconversion.

Variable importance
Variable importance was assessed in two ways. The first method used an implementation of the Permutation Importance (PIMP) algorithm [27]. This method employs the final trained model and shuffles individual predictor columns (or groups thereof) and determines the reduction in performance, in this case AUROC, across several replications. This change in performance is compared against a null distribution wherein both a predictor and the outcome columns are shuffled. This produces two distributions with the distance between them indicating the scale of performance difference obtained with inclusion of the given variable. The shuffling operation was repeated a thousand times to produce each distribution. This distribution distance is quantified and compared by means of a T-statistic which is subsequently utilised to produce a set of variables ordered by importance to model predictive performance. In addition, the use of an outcome-linked variable importance methodology was chosen to provide insight into a variable's utility in discriminating between risk groups as opposed to simply attributing predictions to features. To this end, SHapley Additive ex-Planations (SHAP) [28] were used to decompose model predictions on aggregate-and individual-level outputs for the purpose of aiding in model explainability a both levels.

Ethical considerations
This study received ethical approval from the University of Witwatersrand HREC (ethics reference no. 200312) in August 2020 and is on the South African National Clinical Trial Registry (www.sanctr.gov.za; DOH-27-042021-679). A reimbursement of ZAR 155 was issued to participants for their time. The labelling of records and all documents were designed to maintain patients' confidentiality. All documents were cached in a facility and exposed to limited access and stored for at least two-years post-investigation. Clinical particulars were only shareable on receiving written consent form the participant, apart from information requisite for auditing by regulatory and oversight bodies.

Participant characteristics
These results are presented in Table 1.

Demographics
1036 participants were enrolled in trial one and 593 in trial two, 18% (186) and 13% (76) of whom were HIV positive respectively, yielding a prevalence of 16% across the total of 1630 patients enrolled. Overall, HIV positive patients had a higher median age (

Sexual behaviour
Condom use was significantly associated with HIV status with 55% (758/1368) of HIV negative participants and 43% (113/262) HIV positive participants reporting consistent condom use (p < 0.001). Reported rates of the different sexual intercourse behaviours were generally higher in the second trial than the first trial. Overall, vaginal receptive, vaginal insertive, and oral receptive intercourse were all significantly associated with HIV status. Specifically, vaginal receptive intercourse was reported more commonly in HIV positive participants as compared to HIV

Model results
Initial AUROC results produced from 10-fold cross validation of the training split of the first trial are shown below in Table 2. Performance values are the average across all runs with 95% confidence intervals calculated from the result standard distribution. The training split was derived from a 70:30 random subset of the trial's responses. This was utilised to provide baseline estimates of each model group's performance, as well as to optimise any hyperparameters in relation to AUROC.
Following this process, the optimised hyperparameters were used to fit each model on the full training split and subsequently evaluated on the holdout testing set. Performance metrics and associated receiver operator curves (ROCs) are shown in Fig. 2. AUROC values for the test set were substantially less disperse than those from model crossvalidation, largely as a result of the logistic regressive models (glm and bayesglm) demonstrating a substantial improvement in performance. Irrespective of this alteration, the best performing architecture overall remains the gradient boosted tree model (xgbTree) with an AUROC of 82.84% (95% CI 76.48-89.21) (see Table 3).
Beyond simple holdout sample validation, the next evaluation sought to estimate model performance on an out-of-distribution, external validation scenario. This was achieved by evaluating the models trained from trial one's data on the entirety of phase one of trial two (see Table 4). While the question and response options matched across both trials, the geographical and demographic distributions differed substantially. Performance results and matched ROCs can be seen in Fig. 3. Despite these predictive challenges, model performance deteriorated only slightly with an average decline of only 1.16% (95% CI 0. 13-2.20). Across all evaluations, gradient boosted decision trees averaged the greatest AUROC and as such were the best model for predicting HIV status.

Final model performance and interpretation
Final heterogenous model performance was evaluated using a mixed dataset consisting of trial one and the first collection window of trial two. Data was split into training, validation, and test subsets in a 70:20:10 ratio. The model was trained using the CatBoost python package. Both trials were shuffled and joined prior to splitting to enable an estimation of overall model performance on a heterogenous  population. Performance evaluation employed a cut-off value of 0.12 on the holdout test-set as determined by a pre-determined sensitivity threshold of 90% on the validation-set. This model resulted in a sensitivity of 84% (95% CI 72-92), specificity of 71% (95% CI 67-76), and a negative predictive value (NPV) of 95% (95% CI 93%-96%), when evaluated on the hold-out test data (Table 5).

Number needed to treat
The final mode was tested on an unseen dataset in the form of the three month follow up data from trial two (phase two). The predictions are tabulated against the ground truth as a two-by-two table (Table 6). This is used to evaluate the predictive performance of the model for seroconversion at three months. All individuals that acquired HIV were flagged by the model at the first visit with a risk value greater than the threshold. If these 42 individuals marked as being at risk were placed on pre-exposure prophylaxis this would mean the treatment of 10 (95% CI 5-155) individuals would have been required to prevent one seroconversion within three months.

Model interpretation and variable importance
The most important variables in determining the risk for an individual, as ascertained using the PIMP methodology, are demonstrated in  Table 7. Using this approach, the most significant variable by a wide margin is the age of the individual (t = 673.0), with the second most substantial being the length of time since the last HIV test (t = 472.1). Following on from this, the number of male sexual partners (t = 261.6), as well as the biological sex of the respondent (t = 186.7) have the next most substantial impact on risk. Sexual behaviour characteristics such as having previously had a urinary tract or sexually transmitted infection (UTI/STI) (t = 148.5), pattern of condom usage (t = 140.1), and type(s) of sexual intercourse also play a role. Other included factors are socioeconomic predictors, such as the level of education (t = 111.4) or occupation (t = 132.5) of an individual, and symptomatology such as the presence of night sweats (t = 59.2) or recent loss of weight (t = 158.8). Fig. 4D presents a summary SHAP as an additional estimate for aggregate explanation of variable importance. Age, duration since last HIV test, number of male sexual partners, a history of assault, and reporting weight loss were identified as the top five most impactful variables on model output. Fig. 4C and D presents further plots of variable importance ascertained using the SHAP methodology. Fig. 4C presents the individual contributions of each observation in the hold-out test dataset to the aggregate SHAP estimation (Fig. 4B) while Fig. 4D presents an example of a single prediction decomposed. In this example, the final model estimation of a high-risk score close to 1 is substantially driven by the individual's age, work travel pattern, long HIV testing interval, STI history, and occasional condom usage.

Discussion
Having the ability to handle great quantities of population and health-related data swiftly, machine learning has the potential to identify patients who are at high risk of contracting HIV. This study was undertaken to assess the viability of ML in the identification of such patients particularly in LMICs, using patient data collected from digital surveys administered directly to potential patients as opposed to secondary analysis of a pre-existing electronic database. Participants who were not known to be HIV positive were enlisted in two consecutive trial studies.

Study characteristics
Trial one and trial two had similar distributions of baseline demographic characteristics. Overall, there are notably high levels of unemployment, a feature possibly attributable to the timing and incentive structure of the studies. There were more differences between the studies with respect to sexual behaviour with noticeably higher rates of all forms of intercourse in trial two which was likely a product of a flaw in the digital survey used in trial two that allowed participants to "select all". If there were non-random use of the select all function this could have biased the study results. Lifestyle characteristics were similar across both studies. An important difference between symptoms reported between the studies was that in the second trial respondents had to answer all the questions and thus there was no "unknown" field across trial two. This again could represent a source of bias if there was differential non-response. The HIV prevalence across trial one and two was relatively similar (18% and 13% respectively). This is in keeping with the estimated 13.4% HIV prevalence across South Africa in 2019 [29]. One would have expected the prevalence in the studies to be lower given being known with HIV infection was one of the exclusion criteria, but this may have been countered by the intentional selection of high prevalence settings with HIV prevalence statistics that vary from 13.1% (Hammanskraal, Tshwane) to 27.8% (Isipingo, Ugu) [30].

Demographic characteristics
Key baseline demographic associations with HIV included female sex, older age, lower levels of education, and longer durations of work travel. These findings are in keeping with findings from other studies in South Africa [31][32][33]. The importance of demographic characteristics in HIV risk quantification is reflected in both the final ML model variable importance estimators with age being the most important variable using both the PIMP (t = 673.0) and global SHAP methodology. The importance of these characteristics is well exemplified in the single prediction decomposition in Fig. 4d in which work travel and age were the greatest contributor to the example individual's risk.

Sexual behaviour
Condom use is a well described and quantified behavioural intervention to mitigate HIV infection risk with data to suggest around an 80% efficiency at preventing transmission [34]. This is in keeping with the significantly lower reported rates of consistent condom use among HIV positive participants across both trials. The importance of this behavioural intervention is again reflected in both the PIMP and global SHAP variable importance assessments as well as highlighted in the example single prediction decomposition SHAP in which "sometimes" condom use is the 4th most important contributor to the example individual's risk. In addition, receptive vaginal sex was noted to be the most common form of sexual intercourse among HIV positive participants with a significant difference between the groups. This is in keeping Fig. 2. Receiver operator curves for all models evaluated using the holdout test portion of the initial data split from trial one. Table 3 AUROC values for all models evaluated using the holdout test portion of the initial data split from trial one.  Table 4 AUROC values for all models trained using data from trial one and evaluated on the first phase of trial two. with HIV epidemic within South Africa where heterosexual females are noted to be at highest risk of HIV infection for various social and biological reasons [35,36]. However, within the modelling, this appears to be captured by the number of male partners (t = 261.6). This is also reflected in Fig. 4d where one male sexual partner was the 6th most important contributor to the example individual's risk.

Lifestyle
A longer duration since the participant's last HIV test as well as higher numbers of STIs and experiences of assault within the last year were the most important lifestyle related variables. The significance of the duration since last HIV test highlights the importance of broad access to HIV counselling and testing services and the need to expand these services to missed groups [37]. The association with recent history of STIs is in keeping with the literature and has a known biological basis [38]. It is useful to note that HIV and STIs are likely co-linear which would be a challenge for traditional modelling strategies but allows for synergy in prediction of each in this case. The final model places high importance on both STI history and, particularly, duration since last HIV test as highlighted in the PIMP and global SHAP and once again is reflected in the relative importance of the presence of a longer duration since last HIV test and a positive STI history in the high-risk assessment of the example individual reflected in Fig. 4b.

Symptoms
The higher prevalence of tuberculosis symptoms among HIV positive participants is unsurprising given the well documented higher rates of the disease among HIV positive patients regardless of HIV progression [39]. The other symptoms reported more commonly are typically associated with less advanced stages of HIV infection [40] and highlights the potential value of this tool in identifying patients with HIV at earlier clinical stages before they present to a healthcare facility with an AIDS-defining condition. Generally, symptomatology had less influence on the model's predictions although it is noted that a few tuberculosis symptoms featured more prominently in the SHAP variable importance assessment.

Model comparison and final model performance
Model comparison revealed that, despite their simplicity, a large proportion of predictive performance was captured by logistic classification models. These models enable greater explanatory ability at the expense of some performance due to the lack of automated variable interaction among other techniques. However, the final model selected was a gradient boosted decision tree as it provided a substantial increase in performance. The superior classification capacity of ML over traditional modelling previously noted by other authors [14] was similarly noted in this study. In addition, if data collection were to be expanded to include additional geospatial or temporal factors, this class of predictive models would likely to scale rapidly in performance as was found by Orel et al. where an XGBoost model performed best for their data which included longitude, latitude, and altitude [15].
The AUROC measures of the models we evaluated were similar to those of other authors [14,19,20]. Our model and chosen cut-off performed better prospectively, flagging 100% (4/4) of incident HIV cases as high risk as compared to 38.6% (32/83) [19] however, on the hold-out data from the final model partition our model performed almost identically, flagging 38.3% (46/120) of positive participants as high risk. These findings highlights that the use of digital survey data can produce ML models that are comparable to those built using larger datasets.

Fig. 3.
Receiver operator curves for all models trained using data from trial one and evaluated on the first phase of trial two.   (Figs. 3 and 4) highlight the important components of population-level risk. Given enough data one would be able to model HIV risk in different communities and use the variable importance estimates to identify the largest local contributors to risk and adjust the public health response accordingly. For example, should limited condom use and many STIs be contributing significantly to a community's risk, specific action could be taken to address those two issues. Similarly, the forced SHAP plot (Fig. 4d) helps to quantify the specific components of an individual's risk. If this was thought of in terms of modifiable risk (e.g., condom use) and non-modifiable risk (e.g., age) this could help guide personalised counselling services towards either behaviour change strategies or PrEP.

PrEP decision support
Much of the ML work done in HIV thus far has been focused on supporting and guiding PrEP treatment decisions [19][20][21]. In particular, Zheng et al. used machine learning to maximize the number of seroconversions prevented while minimizing the number of people on PrEP and showed that one can use such an approach to offer individualized PrEP decision support in resource limited settings [21]. This is in keeping with data from another study that showed that ML can identify an "at risk" population that is small enough to offer directed preventative services [15]. Given the relative ease of use of the digital survey used in this trial, such a survey and associated risk assessment algorithm could be used to identify the at-risk individuals who would benefit most from being offered PrEP while tailoring the risk-threshold to the resources available. The estimated number needed to treat of 10 individuals to prevent one infection at three months emphasizes how successful such an approach could be.

Future directions
To our knowledge this is the first study of its kind to develop a digital survey of socio-behavioural questions and use it as a primary data collection tool to build a ML model to estimate HIV risk. It is anticipated that such a tool could potentially be integrated into the clinical workflow to enhance the public health response to HIV at both the population and individual level as well as to decision support around PrEP initiation. Such an approach would provide a continuous source of data, which if linked to ground truth could be used to validate and further enhance the predictive capacity of the model. In addition, given the colinearity, such a platform could be used to assist with screening for tuberculosis and STI symptoms in the general population. Finally, the inclusion of geospatial, temporal, or local incidence data may further enhance the predictive ability of the model.

Limitations
The major limitation of this study concerns the combination of data from two separate studies. While this process was made simpler by the fact that the second trial used a refined version of the survey from the first trial and every effort was made to ensure consistency between the datasets, such an approach still carries the risk of miscoding a response. Similarly, the "select all" option for different intercourse options in trial two as well as the ability not to select an option to some questions in trial one could have introduced bias if there was differential selection of either of these options by HIV status. Additionally, it is to be noted that all variable importance methodologies have recognised limitations and neither of the two presented here can be certain to fully capture the portion of risk attributable to a particular variable. Finally, the nonrandom sampling strategies employed are also a limitation to the study as they may also have introduced bias.

Conclusion
This study has highlighted the synergies present between mHealth methodologies and ML in the field of HIV risk prediction. It has been demonstrated that a viable ML model can be built using digital survey data with interlinking potential utility in directing health resources, including PrEP, towards the areas of greatest potential benefit. It has been shown that such an approach could be viable in LMICs, such as South Africa, in which they are most needed.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Samanta Lalla-Edward reports financial support was provided by Bill & Melinda Gates Foundation.