Identifying HIV sequences that escape antibody neutralization using random forests and collaborative targeted learning

Yutong Jin; David Benkeser

doi:10.1515/jci-2021-0053

Open Access Published by De Gruyter October 26, 2022

Identifying HIV sequences that escape antibody neutralization using random forests and collaborative targeted learning

Yutong Jin and David Benkeser

From the journal Journal of Causal Inference

https://doi.org/10.1515/jci-2021-0053

Abstract

Recent studies have indicated that it is possible to protect individuals from HIV infection using passive infusion of monoclonal antibodies. However, in order for monoclonal antibodies to confer robust protection, the antibodies must be capable of neutralizing many possible strains of the virus. This is particularly challenging in the context of a highly diverse pathogen like HIV. It is therefore of great interest to leverage existing observational data sources to discover antibodies that are able to neutralize HIV viruses via residues where existing antibodies show modest protection. Such information feeds directly into the clinical trial pipeline for monoclonal antibody therapies by providing information on (i) whether and to what extent combinations of antibodies can generate superior protection and (ii) strategies for analyzing past clinical trials to identify in vivo evidence of antibody resistance. These observational data include genetic features of many diverse HIV genetic sequences, as well as in vitro measures of antibody resistance. The statistical learning problem we are interested in is developing statistical methodology that can be used to analyze these data to identify important genetic features that are significantly associated with antibody resistance. This is a challenging problem owing to the high-dimensional and strongly correlated nature of the genetic sequence data. To overcome these challenges, we propose an outcome-adaptive, collaborative targeted minimum loss-based estimation approach using random forests. We demonstrate via simulation that the approach enjoys important statistical benefits over existing approaches in terms of bias, mean squared error, and type I error. We apply the approach to the Compile, Analyze, and Tally Nab Panels database to identify AA positions that are potentially causally related to resistance to neutralization by several different antibodies.

Keywords: HIV; collaborative targeted minimum loss-based estimation; variable importance; random forests

MSC 2010: 62G10; 62G35

1 Introduction

In spite of reductions in incidence due to availability of pre-exposure prophylaxis products [1,2], HIV incidence remains stubbornly high. In 2019, 36,801 new HIV diagnoses were reported in the United States and at year-end an estimated 1.1 million individuals were living with HIV in the United States [3]. To reduce incidence further, there is great demand for the development of alternative products for HIV prevention. One example of such a product is a preventive HIV vaccine. However, development of preventive HIV vaccines has proven challenging due to the highly diverse nature of HIV viruses. In general, vaccines work by presenting a so-called antigen to the host immune system. For example, this antigen could be a piece of viral genetic material or a protein expressed on a virus. The host’s immune system reacts to the antigen and generates a biological response that is capable of protecting the host from breakthrough infection and disease if the host is exposed to the real form of the virus at a later time. There are myriad immune responses generated in response to vaccine antigens; however, one response that is hypothesized to be crucially important for effective HIV vaccines is antibody responses. Antibodies are proteins that circulate in the blood and are designed to bind to specific foreign substances, such as viruses, thereby neutralizing the substance. To understand how antibodies are thought to neutralize HIV, it is first important to understand the structure of the virus itself. A simple diagram of HIV is illustrated in Figure 1. The green spikes protruding from the virus are called envelope (Env) proteins and are used by the virus to bind to and subsequently infect host CD4 T-cells. The Env protein is also the putative site for antibody binding. If enough antibodies bind to these Env proteins, they may block the virus from infecting cells, effectively neutralizing the virus. However, HIV can and does evolve specifically to avoid such immune responses and may be able to adapt the geometric shape and/or their physio-chemical makeup of the Env protein to make it more difficult for antibodies to recognize. This presents a challenge for HIV vaccine development – vaccines must induce antibody responses that are capable of binding to and neutralizing a wide variety of Env proteins.

Figure 1

Illustration of the structure of HIV. The envelope protein, which includes the gp120 and gp40 sub-proteins, is the putative binding site for antibodies and other immune responses.

Due to the challenges associated with selecting an antigen capable of inducing such antibodies, an alternative paradigm has emerged for HIV prevention – rather than relying on vaccine antigens to induce the right kind of antibodies, we can identify and manufacture effective antibodies in a laboratory ourselves. This mode of prevention is currently of great interest in the field of HIV prevention. Using sera from HIV-infected humans and/or non-human primates, scientists have identified several so-called broadly neutralizing antibodies (bnAbs), single antibodies that are capable of neutralizing a wide variety of Env proteins. These antibodies can be manufactured at scale and be given to individuals at risk of HIV acquisition to prevent future HIV infection. Several prevention trials along these lines are being conducted [4], including HVTN-704, a Phase 2b study of a bnAb named VRC01 [5]. This trial showed modest overall efficacy for preventing HIV, with a multiplicative reduction in risk of infection of about 20%. However, further analysis for this trial revealed a crucial result – individuals given VRC01 were indeed protected from some types of HIV, but were susceptible to others. In particular, individuals who were given VRC01 were protected from viruses that were capable of being neutralized by the antibody VRC01, but remained susceptible to infection by HIV viruses that were able to escape VRC01. Investigators demonstrated this effect by sequencing the Env protein from viruses of HIV-infected individuals in the trial. Laboratory experiments were then conducted to measure how well these Env proteins could be neutralized by VRC01. We refer to this outcome measure as the neutralization sensitivity of the virus [6]. Combining these data with the clinical data from the randomized trial, researchers estimated prevention efficacy of VRC01 as a function of neutralization sensitivity, demonstrating that receipt of VRC01 was associated with almost no protection from infection with viruses with low neutralization sensitivity, but provided high efficacy against viruses that were highly sensitive to neutralization by VRC01. These results highlighted that bNAbs should be considered a highly promising means of HIV prevention, but, as with vaccines, the key obstacle is how to select antibodies that are capable of neutralizing genetically diverse HIV Env proteins.

Overcoming this challenge will require careful interaction of randomized studies, like HVTN 704 described above, and observational laboratory studies. In the latter, HIV Env sequences isolated from infected participants around the globe are used to measure neutralization sensitivity to various candidate bnAbs. The CATNAP database collates data from these neutralization studies into a central, publicly available repository [7]. However, the number of sequences available for some antibodies is limited, and moreover, databases such as these are not exhaustive in covering all possible configurations of the Env protein that may be observed in the population. Therefore, it is of great interest to develop methods that use observational databases, like CATNAP, to learn the causal effect of mutations in the Env protein on antibody resistance.

Figure 2 illustrates the interplay between observational neutralization studies and randomized controlled trials of bnAb therapies. Randomized experiments are generally time-consuming and expensive, and thus we must take care to only advance the most promising bnAbs to randomized trails. One common practice is using neutralization studies to identify promising bnAbs that can be advanced to randomized trials, like HVTN704. If low overall efficacy but significant efficacy against sensitive viruses is found, we may view the trial a partial success in that the bnAb successfully neutralized viruses that it was biologically capable of neutralizing. Then we may consider adding one or several additional bnAbs to create a cocktail of multiple bnAbs. The question then arises as how to select these additional bnAbs. The first valuable source of data are the randomized trials, where we can perform a so-called sieve analysis [8].

Figure 2

Illustration of the interplay between observational neutralization studies and randomized controlled trials of bnAb therapies for prevention. Green arrows illustrate where our proposed method may contribute to the process.

Sieve analysis quantifies prevention efficacy of the bnAb regimen studied in the randomized trial as a function of characteristics of the Env protein. Because this analysis is based on randomized trial data, it can identify causal effects of Env protein mutations on prevention efficacy under minimal assumptions. However, with more than 800 amino acid (AA) residues in the Env protein, the search space for mutations that may be causally related to antibody resistance is huge. With few observed HIV infections in a randomized trial, sieve analyses that are forced to consider a large number of potential mutations will suffer from low statistical power after accounting for multiple testing. Therefore, it is crucial to utilize observational neutralization data to identify potentially interesting mutations a priori in order to reduce the impact of multiple testing adjustments.

Causal analysis of neutralization studies can also be used to identify “gaps” in one bnAb – mutations that allow the virus to escape neutralization. If such gaps are identified, we may search for additional bnAbs that are robust to these mutations. These results together with the results of the sieve analysis can be used to select a multi-bnAb cocktail to be evaluated in a future randomized trial.

If and when a successful bnAb or a combination of bnAbs is identified, it will also be important to subject such a regimen to various “stress tests.” That is, we may wish to use observational neutralization studies to suggest mutations that may be particularly difficult for the therapy to neutralize. Neutralization capability of viruses with these mutation patterns can be studied prospectively to suggest whether and to what extent the bnAb therapy will remain effective against new mutations. Such studies may aid in regulatory decision making on the approval of bnAb therapies. For example, during the COVID-19 pandemic, previously approved monoclonal antibody therapies saw their use restricted after the emergence of new variants against which their neutralization capacity was diminished [9].

The above points to a clear need to learn causal relationships between mutations in the Env protein and sensitivity to neutralization by antibodies. In this work, we aim to identify complementary bnAbs that target the residues where some existing antibodies show modest protection. Additionally, we provide a useful hypothesis test that guides downstream sieve analysis by reducing the search space. To achieve these goals, we thus discuss whether and how the causal effect of a mutation at a single AA residue in the Env protein on antibody sensitivity can be identified and estimated from observational neutralization data. We discuss the plausibility of the assumptions in the context of known HIV biology and the data that are typically available in observational neutralization studies. We also propose an approach for dealing with the challenges associated with practical violations of the positivity assumption. In particular, we develop a random forest-based, outcome-adaptive, collaborative targeted minimum loss-based estimation (CTMLE) approach. Our estimation and inference are built on two models. The first is a model of the probability of sensitivity as a function of Env AA residues and other measured viral characteristics, or the so-called outcome regression (OR). The second is a model of the distribution of AAs at a particular residue as a function of other sequence features, or the so-called generalized propensity score (GPS). For both models, in this work we focus on the random forest algorithm [10], which has been shown to predict well in this setting [11].

A typical CTMLE implementation uses a greedy forward selection algorithm that sequentially seeks best next candidate feature to be included in each iteration [12]. However, this approach yields heavy computational burden and it motivates us to pursue a method that is able to reduce time complexity. Our work is similar, yet distinct from the partial-correlation-based pre-ordering of covariates suggested by Ju et al. [13]. Rather than pre-ordering covariates, we instead adopt an explicitly outcome-adaptive approach along the lines of ref. [14]. That is, we advance the most relevant features identified in the OR for inclusion in the model for the GPS. We show that this stabilizes inference relative to procedures that include all features in the generalized propensity model. The modification results in significant time savings relative to ref. [13], which is particularly important given the high-throughput nature of the scientific context. Although Ju et al. [13] focused on methods for estimating the effect of a single treatment, our analysis requires estimation of an effect for each of several hundreds of AA residues. We demonstrate that our tests based on our CTMLE estimates appropriately controls type I error in realistic sample sizes, even in the presence of highly correlated residues, while maintaining power to detect effects of AA mutations in Env on antibody sensitivity.

2 Methods

2.1 Counterfactual antibody resistance probability

We assume that we have access to a database like the aforementioned CATNAP. An example illustration of one such data set is given in Table 1. The data consist of n HIV Env sequences. For each sequence, we have a measure of sensitivity to an antibody of interest (e.g., 1 = can effectively be neutralized by the antibody, 0 = cannot) and basic information about the origin of the sequence. The remaining columns represent the AAs that make up the Env proteins of the sequence. Each AA position, or residue, of the Env protein can assume one of the 22 different values – one of the 20 AAs that build proteins, a frameshift mutation (indicating an insertion or deletion of nucleotide bases), or a gap (an artifact of the sequencing technology to maintain alignment with a referent HIV sequence). For our purposes, it suffices to think of each AA residue as a categorical variable that theoretically could assume up to 22 different values, while in practice we generally observe between two and four unique AAs at each residue. We treat the n sequences as independent, which may be reasonable because (i) almost all sequences in CATNAP are isolated from different individuals and (ii) HIV replicates extremely rapidly, such that any two isolated sequences, even if close in geographic proximity, are likely to be distant ancestors of one another.

Table 1

Example data set illustrating structure of CATNAP data for a given antibody

Sequence ID	Sensitive ( Y )	Origin ( W 0 )	AA 1 ( W 1 )	AA 2 ( W 2 )	…	AA 856 ( W J )
1	1	Africa	N	R	…	L
2	0	Europe	N	K	…	Q
⋮	⋮	⋮	⋮	⋮	⋮	⋮
n	1	N. America	S	R	…	L

Sensitive is a binary read out from the assay indicating whether the sequence was effectively neutralized by the antibody. AA j = amino acid at residue j of the Env protein.

We denote the sensitivity outcome by Y , where Y = 1 indicates that the virus is effectively neutralized by the antibody, and Y = 0 indicates that the virus is not effectively neutralized by the antibody. We denote the origin of the virus together with the vector of AA information on the Env sequence as W . For simplification, basic information about the origin is encoded as W 0 . Below, we consider methods for analyzing the causal impact of mutations at each of the J AA residues in turn. It is therefore convenient to introduce the notation W − j to denote the origins and the vector of all Env characteristics with the j th residue removed, where j = 1 , … , J . Thus, the data that we use for the analysis of the j th residue can be considered n copies of the triplets O j = ( W j , W − j , Y ) .

For each AA residue, we define a counterfactual resistance outcome Y ( W j = w ) , which is the binary resistance indicator that would be observed under a hypothetical intervention that fixes the AA at residue j to w ∈ W j , the set of possible AAs at residue j that could be observed in the entire population of virus Env sequences. We also define p j ( w ) = P [ Y ( W j = w ) = 1 ] as the proportion of virus Env sequences in this counterfactual world that would be sensitive to neutralization by the bnAb of interest.

To achieve the goals outlined in Figure 2, we may be interested in estimation and inference about p j ( w ) . For example, to identify gaps in one bnAb, we would be interested in identifying ( j , w ) combinations for which p j ( w ) is low. To complement such a bnAb in a multiple bnAb cocktail, we would look for other antibodies such that the same counterfactual sensitivity probability is high. On the other hand, if the scientific goal of the analysis of the observational neutralization data is to reduce the search space for a sieve analysis, we may instead wish to test the null hypothesis that an AA substitution at residue j has no impact on average bnAb resistance, that is, H 0 : p j ( w ) is constant in w . AA residues for which this null hypothesis is rejected may be advanced for further consideration in a sieve analysis using data from a randomized trial. Regardless of the ultimate scientific goal, the problem of identification and estimation of counterfactual sensitivity probabilities is an important problem.

2.2 Causal identification

We require two main conditions for identifiability of p j ( w ) : (i) conditional exchangeability and (ii) positivity. If these conditions hold, then p j ( w ) is identified by the G-functional p j ( w ) = E [ E ( Y ∣ W j = w , W − j ) ] . We discuss these conditions in detail below and conclude with a discussion of their plausibility in the present context.

2.2.1 Exchangeability

A DAG is displayed in Figure 3. On the left, we display a DAG that is useful for discussing the biology of antibody neutralization. On the right, we display a DAG that is useful for describing backdoor pathways and types of unmeasured confounders that may prove problematic for our proposal.

Figure 3

DAG for HIV resistance. The DAG on the left shows a pipeline of gene expression, which defines how gene regulates downstream activities and further affects the antibody sensitivity through Envelope AAs of interest.

Beginning with the DAG on the left, the nodes labeled G represent genes on the viral genome. Certain genes on the genome correspond with AA residues on the Env protein (nodes labeled AA). These AA configurations result in certain physiochemical properties of the Env protein. The DAG lists several Env characteristic that are putatively important for antibody binding and neutralization. These characteristics act as mediators of AA configuration on antibody sensitivity. The structure (or shape) of the Env protein is determined by the AAs present; this in turn determines which portions of the protein are “exposed” to antibodies. There are other physiochemical properties that are also impacted by the Env protein AA sequence that could impact affinity of antibodies for binding the protein. One particular process that may be important for HIV infectivity and antibody sensitivity is the presence of glycosylated regions of the Env protein. Such regions can play a key role in binding, both of the virus to the host CD4 T-cells, as well as antibodies to the virus. Therefore, these regions are expected to play an especially important causal role in neutralization of viruses via antibodies.

A simplified DAG is shown on the right with nodes collapsed by the role they play in opening/closing backdoor paths between the outcome and AA residues in the protein. As mentioned above, our approach considers each AA in turn. So for a particular residue j , W j is acting as the “exposure” of interest, while all other AA residues are acting as potential “confounders” W k . The genes are unmeasured and play the role of potential unmeasured confounders, while the origin of the virus can be considered another potential confounder W 0 . The mediators in this problem, here collapsed to a single node M , are the aforementioned structural or physiochemical properties of the Env protein. The DAG on the left suffices to discuss the biological plausibility of causal identification in this problem. In particular, there are three types of pathways arising to be relevant to identification. The first are pathways of the form W j ← G 2 → W k → M → Y . In such pathways, there is a gene responsible for coding both the AA residue of interest and other residues that are associated with mediators of antibody sensitivity. In this case, controlling for W k suffices to block the pathway. The second type of pathway is one of the form W j ← G 2 ← W 0 → G 1 → M → Y . Here, the geographic region ( W 0 ) is associated with two genes that code separate residues on the Env protein. This pathway can be blocked by conditioning on W 0 . Finally, there are pathways of the form W j ← G 3 → M → Y . Here, there is a gene that codes the AA residue of interest, but is also associated with some other mediating pathways of antibody sensitivity. For example, it is possible that having more Env proteins expressed reduces antibody sensitivity of viruses. If a particular gene was associated with the number of Env proteins that are expressed and also associated with the AA residue of interest, then there would be no way to block this pathway, since the gene information is unavailable. Thus, in order to formally establish causal identification, we would need to assume that no such genes exist. If not, then the DAG implies exchangeability, that is Y ( W j = w ) ⊥ W j ∣ W − j .

2.2.2 Positivity

Causal identification would additionally require that P [ P ( W j = w ∣ W − j ) > 0 ] = 1 for all w ∈ W j . That is, we require that there be a positive probability of observing AA w at position j given the AA present at other positions and the country of origin. This assumption could be satisfied by a careful selection of W j . In practice, we will establish W j empirically by looking at the observed support of each W j . However, this alone is insufficient to ensure the positivity condition holds. We must additionally assume that any observed AA substitution is plausible, regardless of the other AAs present in the Env sequence. This assumption may be plausible for some residues, but implausible for others depending on the physio-chemical properties of the AAs in question. Thus, in practice it may be possible to have structural violations of this assumption. We also expect practical violations of this assumption, with some combinations of AA configurations only rarely observed in the source population.

2.2.3 Plausibility of assumptions

The above discussion highlights the tenuous nature of causal interpretations in the context of antibody sensitivity. We generally do not yet possess a strong enough understanding of HIV biology to fully justify the exchangeability condition, and it is also possible that structural violations of the positivity assumption may be present. If the scientific context of an analysis demands a confirmatory causal conclusion to be drawn, then we may require additional causal sensitivity analysis to determine the robustness of the estimated effects to the presence of unmeasured confounding pathways. However, we argue that a causally-oriented analysis may still provide a useful framework for hypothesis generation and exploratory analysis in the context of the research process described in Figure 2. In this case, we may prefer to interpret the G-functional identification parameter more cautiously, and summarize the “variable importance” measures of particular AA residues, as in ref. [15]. In this case, follow-up experiments should be recommended to validate the findings.

2.3 Motivation for novel method

There are many methods available that could in principle be applied in a non-causal context in this setting. For example, we could first conduct an analysis using LASSO on half of samples. The best value of the tuning parameter λ can be determined via cross-validation to arrive at a final model. The AA residues included in the final model can be included in a logistic regression on the withheld half of samples for formal inference. Another approach would be to use random forest “variable importance” measures, for which we may obtain p -values using a fast variable importance test [16]. However, we argue that methods that have a causal interpretation, even if under limited circumstances, are appealing in this setting since causal inference is fundamentally at the core of the scientific question. Thus, we are motivated to explore causal inference-inspired methodology for estimating the G-functional E [ E ( Y ∣ W j = w , W − j ) ] and associated significance tests for AA residues.

There are two related challenges associated with causal inference methods in this context: (i) the dimensionality of the covariates and (ii) practical violations of the positivity assumption. The dimensionality of the covariates presents challenges to estimation. The discussion of exchangeability above highlights the need to potentially adjust for a high-dimensional W j , with a limited number of sequences available for a given analysis. Thus, we may be motivated to consider flexible machine learning algorithms that are built specifically for modeling high-dimensional covariates. It would be natural to couple these modeling strategies with doubly robust methods for inference, as these methods typically allow for regular, parametric-rate inference, under certain statistical assumptions. However, these methods can be susceptible to erratic behavior in the presence of practical positivity violations. One approach to improving behavior of doubly robust estimators in the presence of positivity violations involves a careful pre-selection of adjustment covariates, informed by background knowledge [17]. However, this would appear difficult in the present setting due to imperfect understanding of the relationships between AA residues on the Env protein and their relationship with antibody neutralization. Moreover, for a method to be successfully employed to scan the entire Env protein for significant residues, it should be a high throughput method that is capable of delivering stable inference on hundreds of AA residues. An analysis that involves deliberate pre-selection of covariates AA residue-by-residue is infeasible. Therefore, we are motivated to develop an automated procedure for covariate selection in this context. In the subsequent sections, we propose a solution for this problem using CTMLE.

2.4 TMLE

Our CTMLE builds on targeted minimum loss-based estimators (TMLE) of the average treatment effects [18]. TMLEs are constructed using estimates of two key quantities: the OR and the GPS. The OR is defined as the conditional mean of the outcome given the treatment and confounding factors, which we denote by Q ¯ ( W j , W − j ) = P ( Y = 1 ∣ W j , W − j ) . The GPS is the conditional distribution of AAs at residue j given all other AA residues and sequence information, which we denote by g j ( w ∣ W − j ) = P ( W j = w ∣ W − j ) for w ∈ W j . A TMLE for estimating E [ Y ( W j = w ) ] for a particular j and w can be constructed in two steps. In step 1, initial estimator Q ¯ n of the OR can be obtained using any learning technique for classification of a binary outcome. Similarly, the estimator g n , j of GPS could be estimated using any multi-class classification technique. In step 2, a single iteration of boosting is used to update the initial OR. For a given OR estimate, Q ¯ , we can compute its empirical risk,

(1) L n , j ( Q ¯ ) = 1 n ∑ i = 1 n − [ Y i log { Q ¯ ( W j , i , W − j , i ) } + ( 1 − Y i ) log { 1 − Q ¯ ( W j , i , W − j , i ) } ] .

Next, we update Q ¯ n by minimizing empirical loss (1) along a particular univariate regression model indexed by Q ¯ n . For each w , we define the logistic regression model

Q ¯ n , ε ( w ) ( W j , W − j ) = expit logit { Q ¯ n ( W j , W − j ) } + ε ( w ) I ( W j = w ) g n , j ( w ∣ W − j ) , ε ( w ) ∈ R .

We can obtain a maximum likelihood estimate ε n ( w ) = arg min ε ( w ) L n , j ( Q ¯ n , ε ( w ) ) and define the updated OR estimate

(2) Q ¯ n ∗ ( w , W − j ) = expit logit { Q ¯ n ( w , W − j ) } + ε n ( w ) g n , j ( w ∣ W − j ) .

The final estimate is computed as

p ˆ j ( w ) = 1 n ∑ i = 1 n Q ¯ n ∗ ( w , W − j , i ) .

The key idea motivating the second step of the TMLE is that the bias of the revised OR estimator Q ¯ n ∗ is generally smaller than that of Q ¯ n with respect to estimating p j ( w ) .

This procedure can be repeated for each w ∈ W j . With a minor abuse of notation, we define the implied updated estimate of E { Y ∣ W j , W − j } as

Q ¯ n ∗ ( W j , W − j ) = expit logit { Q ¯ n ( W j , W − j ) } + ∑ w ∈ W j ε n ( w ) I ( W j = w ) g n , j ( w ∣ W − j ) .

Recalling our null hypothesis of interest that E [ Y ( W j = w ) ] is constant in w , a TMLE-based test based on these estimates can be derived as follows. For w ∈ W j , we define

D j , i ( w ) = I ( W j , i = w ) g n , j ( w ∣ W − j , i ) { Y − Q ¯ n ( W j , i , W − j , i ) } + Q ¯ n ( W j , i , W − j , i ) − p ˆ j ( w ) ,

and let D j , i be a ∣ W j ∣ -length row vector consisting of D j , i ( w ) for each w ∈ W j . Let D j be a n × ∣ W j ∣ matrix formed by stacking the row vectors D j , i , i = 1 , … , n . The estimates p ˆ j = { p ˆ j ( w ) : w ∈ W j } of p j = { p j ( w ) : w ∈ W j } have asymptotic covariance that is consistently estimated by Σ ˆ = n − 1 D j ⊤ D j .

We propose to test H 0 using a ∣ W j ∣ 2 − 1 degree-of-freedom Wald test. Specifically, we define a contrast matrix A , where each row defines a contrast between p ˆ j ( w ) and p ˆ j ( w ′ ) for w , w ′ ∈ W j . For example, if ∣ W j ∣ = 4 , then

A = 1 − 1 0 0 0 1 − 1 0 0 0 1 − 1 ,

and our null hypothesis can be written as H 0 : A p j = 0 .

If instead inference on a single p j ( w ) is desired, the standard error of p ˆ j ( w ) is given by the square root of the ( j , j ) element of Σ ˆ . Assuming n is large enough, a two-sided Wald-style hypothesis test that rejects H 0 : p ˆ j ( w ) = β whenever ∣ p ˆ j ( w ) − β ∣ / σ ˆ n is greater than the 1 − α / 2 quantile of a standard normal distribution will have type I error no larger than α .

2.5 CTMLE

A challenge in utilizing TMLE in the present setting is that the estimated GPS may be extremely small for some virus sequences due to high correlations between residues in the Env protein. This can lead to erratic estimates ε n ( w ) of ε ( w ) , which can degrade the OR estimate Q ¯ n ∗ in (2) considerably. The resulting ATE estimate could be highly biased, with correspondingly inflated type I errors and poor confidence interval coverage. There are fixes available [17]; however, the approaches require manual manipulation from the analysts (e.g., to identify overlapping covariate regions and redefine the causal parameter accordingly). Unfortunately, in our motivating example we wish to consider hundreds of AA residues. This motivates the pursuit of a more high-throughput analysis approach that maintains stability even in rather extreme scenarios.

Recently, data-driven PS model-building strategies have emerged [14,13]. A key insight of these approaches is that PS models should adjust only for variables that are related to the outcome. Thus, estimated ORs can be used to inform variable inclusion in propensity score models. By appropriately screening so-called instrumental variables (those that impact only the GPS but not the OR), we may generate less extreme estimates of the GPS and thereby attain more stable behavior of TMLE.

In particular, we propose using variable importance measures from an OR model to select variables to include in the GPS model. We focus on random forest models explicitly, for estimation of both the OR and GPS, though the methods theoretically extend to any machine learning framework that has associated variable importance measures. We focus on random forests in particular for two reasons: (i) robust and fast software implementations with variable importance measures are readily available; (ii) past work has shown that random forests perform extremely well in predicting antibody neutralization, even when compared with other state-ofthe-art approaches like deep learning [11,19].

Our approach entails first using random forests to estimate the OR. During the random forest construction, variable importance for each AA residue is summarized by the mean decrease in Gini index. The Gini index quantifies how much each feature contributes to the decline of node impurities on average. We select a fixed number of the top-ranked features to include in the GPS model. The OR and GPS are then combined into estimates of E [ Y ( W j = w ) ] for each j and w and a test of the significance of each AA residue. We propose a CTMLE algorithm for selecting the optimal number of features to include in the GPS model.

2.5.1 Details of CTMLE

A CTMLE-based estimate of p j ( w ) can be obtained in a sequential algorithm as follows.

Fit the OR model using random forests to obtain an initial estimator Q ¯ n ( 1 ) . All covariates should be included in this model. The covariates are ranked by their feature importance.
Propose K potential values, r 1 , … , r K , of the number of covariates to be included in the GPS model. We assume r K = J , though that need not be the case. A sequence of GPS estimators can be constructed as g n , j ( 1 ) , … , g n , j ( K ) . The k th estimator in this sequence is an estimate of the conditional distribution of AAs at residue j given the r k top-ranked features in the feature importance list specified in the previous step. If residue j is included in the r k top-ranked features, it will be excluded and only ( k − 1 ) features will be used in GPS estimation.
If k = 1 , an initial triplet is built as ( g n , j ( 1 ) , Q ¯ n , j ( 1 ) , Q ¯ n , j ∗ , ( 1 ) ) , where Q ¯ n , j ∗ , ( 1 ) is a fluctuation of Q ¯ n , j ( 1 ) = Q ¯ n ( 1 ) using TMLE algorithm presented in Section 2.4. For k = 2 , … , K :
1. Assign Q ¯ n , j ( k ) = Q ¯ n , j ( k − 1 ) .
2. Obtain a corresponding Q ¯ n , j ∗ , ( k ) by performing similar TMLE steps using Q ¯ n , j ( k ) and g n , j ( k ) .
3. Evaluate L n , j ( Q ¯ n , j ∗ , ( k ) ) as defined in equation (1). If L n , j ( Q ¯ n , j ∗ , ( k ) ) ≤ L n , j ( Q ¯ n , j ∗ , ( k − 1 ) ) , then set the k th triplet to ( g n , j ( k ) , Q ¯ n , j ( k ) , Q ¯ n , j ∗ , ( k ) ) ; otherwise, set Q ¯ n , j ( k ) = Q ¯ n , j ∗ , ( k − 1 ) and repeat steps 2 and 3. In this case, by the construction of the TMLE, it must be true that L n ( Q ¯ n , j ∗ , ( k ) ) ≤ L n ( Q ¯ n , j ∗ , ( k − 1 ) ) , since the TMLE Q ¯ n , j ∗ , ( k ) uses Q ¯ n , j ∗ , ( k − 1 ) as initial estimate.

Once K triplets have been derived using the full data, the above procedure is repeated in V training samples to obtain K training-sample-specific triplets, denoted for the v th training sample by ( g n , j , v ( k ) , Q ¯ n , j , v ( k ) , Q ¯ n , j , v ∗ , ( k ) ) . Let S v ⊆ { 1 , … , n } denote the indices of observations in the v th validation sample. For v = 1 , … , V and k = 1 , … , K , we compute

L n , v ( Q ¯ n , j , v ∗ , ( k ) ) = 1 ∣ S v ∣ ∑ i ∈ S v − [ Y i log { Q ¯ n , j , v ∗ , ( k ) ( W j , i , W − j , i ) } + ( 1 − Y i ) log { 1 − Q ¯ n , j , v ∗ , ( k ) ( W j , i , W − j , i ) } ] ,

and select the choice of k that minimizes the cross-validated risk, k n = arg min k ∑ v = 1 V L n , v ( Q ¯ n , j , v ∗ , ( k ) ) . The final CTMLE estimate is

p ˆ j , CTMLE ( w ) = 1 n ∑ i = 1 n Q ¯ n , j ∗ , ( k n ) ( w , W − j , i ) .

Asymptotic properties of this estimator follow from general results pertaining to CTMLE [12].

3 Simulation study

3.1 Design

We conducted Monte-Carlo simulation studies to compare the performance of the proposed CTMLE to several implementations of TMLE. The first was a standard implementation of TMLE, where all features were included in the GPS model, while the other TMLEs used reduced numbers of features (either 5, 10, 50, or 100) based on their estimated importance in the OR model. For all estimators, GPS estimates were truncated at 0.01. We considered two sample sizes n ∈ { 500 , 1,000 } , which are similar to the sample sizes of the CATNAP data. For each sample size, 1,000 data sets were generated as follows. To represent a sequence of AA residues, we simulated 200 moderately correlated categorical variables, W 1 , W 2 , … , W 200 , each containing four levels, W j ∈ { 1 , 2 , 3 , 4 } . To generate a realization of W = ( W 1 , … , W 200 ) ⊤ , we drew a random variable X = ( X 1 , … , X 200 ) ⊤ from a multivariate normal distribution N ( μ 0 , Σ 0 ) with μ 0 = ( 0 , 0 , 0 , 0 ) ⊤ and an autoregressive-1 covariance structure, Σ 0 = σ 2 Σ ( ρ ) , where Σ ( ρ ) is a matrix with ( j , k ) th entry equal to ρ ∣ j − k ∣ . In our example, σ and ρ were fixed at 0.9 and 0.75, indicating a moderate correlation between adjacent features. Next, we set W j = 1 if X j < q 0.25 , W j = 2 if q 0.25 < X j < q 0.50 , W j = 3 if q 0.50 < X j < q 0.75 , and W j = 4 if X j > q 0.75 , where q p is the p th quantile of a standard normal random variable. Five true signals were randomly selected among all features ( W j : j ∈ J , where J = { 37 , 87 , 94 , 135 , 151 } ). Given W = W i , the outcome Y i was generated from a Bernoulli distribution with success probability = expit ∑ j ∈ J ∑ ℓ = 1 4 β j l I ( W j , i = ℓ ) (Table 2).

Table 2

True coefficients ( β ) used in simulation study

	β j 1	β j 2	β j 3	β j 4
W 37	0.160	− 0.321	− 0.492	0.214
W 87	0.181	0.521	− 0.612	0.321
W 94	0.104	0.414	− 0.789	− 0.117
W 135	0.178	0.350	− 0.453	− 0.433
W 151	0.072	0.311	0.638	− 0.320

We focused on evaluating operating characteristics for three specific AA residues: (i) position 10, a position unrelated to the outcome and essentially uncorrelated with any true signals; (ii) position 85, a position unrelated to the outcome, but highly correlated with a true signal; and (iii) position 87, a position that is truly related to the outcome ( p 87 ( 1 ) = 0.514 , p 87 ( 2 ) = 0.588 , p 87 ( 3 ) = 0.342 , p 87 ( 4 ) = 0.545 ). For each of these sites, the C/TMLE point estimates were evaluated by their bias and mean-squared error (MSE) and we compared the power of the Wald test for rejecting the null hypothesis of no relationship between AAs and outcome.

3.2 Results

We found that the bias and MSE of the TMLE estimators tended to increase as more variables were included in the GPS model (Figure 1, left/middle column), with the largest bias seen for the true signal (position 87, blue plus). The CTMLE had considerably reduced bias and MSE relative to the standard TMLE that included all 200 features in the GPS model. Moreover, in the two null scenarios (positions 10 and 85), the standard TMLE had drastically inflated type I error (Figure 4, right column). For example, the tests incorrectly rejected the null hypothesis more than 60% of the time when n = 1,000 . On the other hand, CTMLE-based tests appropriately rejected the null hypothesis about 5% of the time for the null positions. For the position exhibiting a true signal (position 87), CTMLE rejected the null hypothesis 57% of the time at n = 500 and 82% of the time at n = 1,000 .

Figure 4

Bias, MSE, type I error rate for null cases and power for true signals. Simulation results with sample size of 500 (first row) and 1,000 (second row) were visualized for two null cases and one true signal. Compared with standard TMLE approach using all 200 features, CTMLE largely reduced bias and MSE, especially for the non-signals. it also maintained a relatively high power with better controlled type I error as good as the best model among the five proposed TMLE scenarios. As sample size increases, the power for CTMLE increased from 0.57 to 0.82.

3.3 Comparison with non-causal approach

We also compared our CTMLE-based test for identifying residues causally related to antibody neutralization to a test developed for random forest-based importance measures. Using the simulation design above, we compared the tests’ type I error rates (proportion of tests in which the null hypothesis was rejected for W 10 and W 85 ) and power under the alternative (proportion of tests in which the null hypothesis was rejected for W 87 ). We found that both tests controlled the type I error rate for the uncorrelated residue ( W 10 ); however, for the correlated residue ( W 85 ), the type I error for the random forest-based test was inflated relative to 0.05. The inflation remained at larger sample sizes. On the other hand, the random forest-based test exhibited higher power under the alternative ( W 87 ) (Table 3).

Table 3

Proportion of null hypotheses rejected using a random forest (RF)-based test and the proposed CTMLE-based test

Sample size	Methods	W 10	W 85	W 87
500	RF	0.048	0.072	0.746
500	CTMLE	0.029	0.04	0.57
1,000	RF	0.056	0.079	0.971
1,000	CTMLE	0.041	0.058	0.82

W 10 is a residue that has low correlation with other residues that are causally related to antibody neutralization; W 85 is a residue that has strong correlation with W 87 , which is causally related to antibody neutralization.

4 Data analyses

4.1 CATNAP datasets

The proposed methods were applied to the Compile, Analyze and Tally NAb Panels (CATNAP) database [7]. This database contains information on HIV virus sequences including the HIV envelope AA sequence (the features of interest), other key variables describing structural and biological information (e.g., geographic origin), and a binary measure of whether the virus can be neutralized by a particular antibody (our outcome of interest). We analyzed five antibodies: VRC01, VRC26.08, PGT145, PGT121, and 10-1074. For each antibody, a pre-screening step was applied to exclude AA positions that were nearly constant across all virus sequences. In practice, some residues in Env may exhibit little or no variation. For these residues, we wish to avoid hypothesis testing since there is no hope of garnering sufficient statistical evidence for signal detection. In particular, at each position, AAs that were observed in fewer than 30 total virus sequences were combined into a single AA category. After the combination, only residues still with smallest level count greater than 30 were retained for analysis. Since the number of remaining AA residues that pass this variable screen procedure is still large, we used a Bonferroni correction to control the chance of falsely rejecting null hypotheses. Namely, we tested each individual residue at a significance level of α / N A A ( post-screening ) , where N A A ( post-screening ) is the number of AA residues passing the screening above.

4.2 Results

The numbers of AA residues that passed the screening process are summarized in Table 4 and the plots of − log 10 ( p -value) for each residue are shown in Figure 5. For each antibody, the figure highlights functional regions of the gp120 protein, which are putatively associated with antibody activity. Our analysis revealed that many significant residues fell within known regions – the V1/V2 loop in particular contained many significant residues – however, there were also many significant residues in gp41, particularly for VRC01.

Table 4

Summary of CATNAP data; N obs = number of virus sequences; N A A = number of AA residues

Antibody	N obs	N A A (pre-screening)	N A A (post-screening)	N sig (%)
VRC01	828	784	328	23 (7.0)
VRC26.08	407	726	229	3 (1.3)
PGT145	581	755	294	3 (1.0)
PGT121	581	755	313	8 (2.6)
10-1074	581	755	303	8 (2.6)

Figure 5

Significant AA residues for five antibodies. Orange highlights denote V1/V2/V3 loop regions that are putatively associated with these antibodies; purple regions highlight additional functional regions of the gp120 protein. Gray points denote residues that did not pass variability screening. (a) VRC01, (b) VRC26.08, (c) 10-1074, (d) PGT121 and (e) PGT145.

4.3 Comparison with other approaches

We compared our results for the VRC01 antibody to the LASSO and random forest approach by feature selection described above (Figure 6). For LASSO-based approach, there were more than 15 multi-level features selected in the CV-selected model, so the inference from the GLM in the held-out data was highly unstable. This illustrates a potential difficulty of this approach, which requires the analyst to make arbitrary, post-hoc modeling decisions to stabilize inference. On the other hand, our proposed CTMLE provides a fully automated procedure that balances type I and II errors.

The random forest-based approach provided greater stability and identified 76 significant residues. However, less than half (16/76) were sites in gp120, which has been identified through other experiments as the likely region of importance for neutralization by VRC01. This result raises the possibility of a high false positive rate of the random forest procedure in this context. Additionally, the associated variable importance measures have dubious relevance in terms of their scientific interpretation. In contrast, our CTMLE is able to provide biologically meaningful interpretations of estimated parameters.

Figure 6

Results for the VRC01 antibody based on three approaches: LASSO, random forests (RF), and CTMLE. The dashed lines indicate thresholds from a Bonferroni-correction to compensate multiple comparisons.

5 Discussion

In this study, we designed a test capable of detecting AA residues along the HIV Env protein that are significantly associated with antibody resistance. To that end, we introduced a CTMLE-based test that is able to cope with the high-dimensional and correlated nature of the data. Our simulation study showed only a relatively modest impact of correlation on performance. In particular, we found that the type I error rate was appropriately controlled for null AA residues, regardless of how correlated they were with a true residue. In contrast, the standard TMLE approach tended to have inflated type I error that increased with the strength of correlation. On the other hand, outcome-adaptive approaches tended to better control the type I error of the test. Among outcome-adaptive TMLE estimators, the type I error was generally better controlled when fewer features were included in the GPS model. However, including too few features led to overly conservative tests, where type I error was properly controlled, but power to detect true signals suffered as a result. We found that CTMLE can be used to appropriately determine the best number of features to include resulting in a test that balances type I and type II errors.

An important area for future research is the performance of the method in settings where there is less sparsity in the set of predictors. Understanding the extent to which the CTMLE is able to adapt to this setting is an open question. There is also room for the optimization in this work along other dimensions. First, the method could in theory be extended to other machine learning frameworks beyond random forest that provide suitable rankings of feature importance. Another potential for improving on this work is in considering other multiplicity corrections while performing hypothesis testing, such as Holm’s method and Hochberg’s method [20,21]. Finally, the random forest simulation results point to a need for a more extensive comparison of available methods suited to this problem.

Acknowledgments

I would like to extend my gratitude to Dr. Rajan Patel and Kinnery Naik Patel, who generously contributed the Patel-Naik Award in the Department of Biostatistics and Bioinformatics at Emory. This award is an excellent morale-booster that encouraged me to do my best work.

Funding information: This work has been supported by National Institutes of Health under Grants 5UM1AI068635.
Conflict of interest: Prof. David Benkeser is a member of the Editorial Board in the Journal of Causal Inference but was not involved in the review process of this article.
Data availability statement: All codes for this study can be found on GitHub (https://github.com/YutongJ/HIV_CTMLE).

References

[1] Grant RM, Lama JR, Anderson PL, McMahan V, Liu AY, Vargas L, et al. Preexposure chemoprophylaxis for HIV prevention in men who have sex with men. New England J Med. 2010;363(27):2587–99. 10.1056/NEJMoa1011205Search in Google Scholar

[2] Baeten JM, Donnell D, Ndase P, Mugo NR, Campbell JD, Wangisi J, et al. Antiretroviral prophylaxis for HIV prevention in heterosexual men and women. New England J Med. 2012;367(5):399–410. 10.1056/NEJMoa1108524Search in Google Scholar

[3] Centers for Disease Control and Prevention. Diagnoses of HIV infection in the United States and Dependent Areas, 2019; 2021. Accessed [June 21, 2021]. http://www.cdc.gov/hiv/library/reports/hiv-surveillance.html. Search in Google Scholar

[4] Morris L, Mkhize NN. Prospects for passive immunity to prevent HIV infection. PLoS Med. 2017;14(11):e1002436. 10.1371/journal.pmed.1002436Search in Google Scholar

[5] Corey L, Gilbert PB, Juraska M, Montefiori DC, Morris L, Karuna ST, et al. Two randomized trials of neutralizing antibodies to prevent HIV-1 acquisition. New England J Med. 2021;384(11):1003–14. 10.1056/NEJMoa2031738Search in Google Scholar

[6] Sarzotti-Kelsoe M, Bailer RT, Turk E, Lin Cl, Bilska M, Greene KM, et al. Optimization and validation of the TZM-bl assay for standardized assessments of neutralizing antibodies against HIV-1. J Immunolog Meth. 2014;409:131–46. 10.1016/j.jim.2013.11.022Search in Google Scholar

[7] Yoon H, Macke J, West Jr AP, Foley B, Bjorkman PJ, Korber B, et al. CATNAP: a tool to compile, analyze and tally neutralizing antibody panels. Nucleic Acids Res. 2015;43(W1):W213–9. 10.1093/nar/gkv404Search in Google Scholar

[8] Gilbert P, Self S, Rao M, Naficy A, Clemens J. Sieve analysis: methods for assessing from vaccine trial data how vaccine efficacy varies with genotypic and phenotypic pathogen variation. J Clin Epidemiol. 2001;54(1):68–85. 10.1016/S0895-4356(00)00258-4Search in Google Scholar

[9] Coronavirus (COVID-19) Update: FDA limits use of certain monoclonal antibodies to treat COVID-19 due to the omicron variant. US Food & Drug Administration. 2022 Jan. Available from: https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-fda-limits-use-certain-monoclonal-antibodies-treat-covid-19-due-omicron. Search in Google Scholar

[10] Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. 10.1023/A:1010933404324Search in Google Scholar

[11] Magaret CA, Benkeser DC, Williamson BD, Borate BR, Carpp LN, Georgiev IS, et al. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. PLoS Computat. Biol. 2019;15(4):e1006952. 10.1371/journal.pcbi.1006952Search in Google Scholar PubMed PubMed Central

[12] Van der Laan MJ, Rose S. Targeted Learning: Causal Inference for Observational and Experimental Data. vol. 4. Springer; 2011. 10.1007/978-1-4419-9782-1Search in Google Scholar

[13] Ju C, Gruber S, Lendle SD, Chambaz A, Franklin JM, Wyss R, et al. Scalable collaborative targeted learning for high-dimensional data. Statist Meth Med Res. 2019;28(2):532–54. 10.1177/0962280217729845Search in Google Scholar PubMed PubMed Central

[14] Shortreed SM, Ertefaie A. Outcome-adaptive lasso: Variable selection for causal inference. Biometrics. 2017;73(4):1111–22. 10.1111/biom.12679Search in Google Scholar PubMed PubMed Central

[15] van der Laan MJ. Statistical inference for variable importance. Int J Biostat. 2006;2(1):2. 10.2202/1557-4679.1008Search in Google Scholar

[16] Janitza S, Celik E, Boulesteix AL. A computationally fast variable importance test for random forests for high-dimensional data. Adv Data Anal Classification. 2018;12(4):885–915. 10.1007/s11634-016-0276-4Search in Google Scholar

[17] Petersen ML, Porter KE, Gruber S, Wang Y, Van Der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Statist Meth Med Res. 2012;21(1):31–54. 10.1177/0962280210386207Search in Google Scholar PubMed PubMed Central

[18] Van Der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):11. 10.2202/1557-4679.1043Search in Google Scholar

[19] Chu Z. Using deep learning methods to predict the VRC01 neutralization sensitivity by HIV-1 gp160 sequence features [Master’s thesis]. Emory University; 2020. Available from: https://etd.library.emory.edu/concern/etds/q237ht29n?locale=en. Search in Google Scholar

[20] Holm S. A simple sequentially rejective multiple test procedure. Scandinavian J Statist. 1979;6(2):65–70. Search in Google Scholar

[21] Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75(4):800–2. 10.1093/biomet/75.4.800Search in Google Scholar

Received: 2021-10-13

Revised: 2022-09-01

Accepted: 2022-09-02

Published Online: 2022-10-26

This work is licensed under the Creative Commons Attribution 4.0 International License.

Identifying HIV sequences that escape antibody neutralization using random forests and collaborative targeted learning

Abstract

1 Introduction

2 Methods

2.1 Counterfactual antibody resistance probability

2.2 Causal identification

2.2.1 Exchangeability

2.2.2 Positivity

2.2.3 Plausibility of assumptions

2.3 Motivation for novel method

2.4 TMLE

2.5 CTMLE

2.5.1 Details of CTMLE

3 Simulation study

3.1 Design

3.2 Results

3.3 Comparison with non-causal approach

4 Data analyses

4.1 CATNAP datasets

4.2 Results

4.3 Comparison with other approaches

5 Discussion

Acknowledgments

References

Journal and Issue

Articles in the same Issue