An interpretable classification method for predicting drug resistance in M. tuberculosis

Motivation The prediction of drug resistance and the identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Modern methods based on testing against a catalogue of previously identified mutations often yield poor predictive performance. On the other hand, machine learning techniques have demonstrated high predictive accuracy, but many of them lack interpretability to aid in identifying specific mutations which lead to resistance. We propose a novel technique, inspired by the group testing problem and Boolean compressed sensing, which yields highly accurate predictions and interpretable results at the same time. Results We develop a modified version of the Boolean compressed sensing problem for identifying drug resistance, and implement its formulation as an integer linear program. This allows us to characterize the predictive accuracy of the technique and select an appropriate metric to optimize. A simple adaptation of the problem also allows us to quantify the sensitivity-specificity trade-off of our model under different regimes. We test the predictive accuracy of our approach on a variety of commonly used antibiotics in treating tuberculosis and find that it has accuracy comparable to that of standard machine learning models and points to several genes with previously identified association to drug resistance. Availability https://github.com/hoomanzabeti/TB_Resistance_RuleBasedClassifier Contact hooman_zabeti@sfu.ca

at least one infected member. The problem then becomes to find the subset of individuals 114 whose infected status would explain all of the positive results without invalidating any of the 115 negative ones. By carefully selecting the groups, the total number of required tests m can be 116 drastically reduced, i.e. if n is the population size, it is possible to achieve m n. 117 Mathematically, a group testing problem with m tests can be described in terms of a Boolean

2:4 An interpretable classification method for drug resistance
where ∨ is the Boolean inclusive OR operator, so that (1) can also be written 125 If the vector w satisfying equation (1) is assumed to be sparse (i.e. there are few infected 126 individuals), the problem of finding w is an instance of the sparse Boolean vector recovery 127 problem: 129 where w 0 is the number of non-zero entries in the vector w. 130 Due to the non-convexity of the 0 -norm and the nonlinearity of the Boolean matrix product,    rates. To obtain this information, we modify the ILP (4) in two ways.

202
For clarity, in the following section we assume thatĉ is a binary classifier trained on a sample 203 y with corresponding Boolean feature matrix A. In addition, unless otherwise stated, we 204 refer to the misclassification of a training sample as a false negative if it has label 1 (is in P), 205 and as a false positive if it has label 0 (is in Z). For instance, in the case of drug resistance, 206 a false negative would mean that we incorrectly predict a drug-resistant isolate as sensitive, 207 while a false positive would mean that we predict a drug-sensitive isolate as resistant. 208 First, note that in ILP (4), ξ P corresponds to the training error ofĉ on the positively labeled 209 subset of the data, while ξ Z does not correspond to its training error on the negatively 210 labeled subset. This follows from the fact that A is a binary matrix and w is a binary vector, 211 so ξ P is also a binary vector, with Note that the motivation behind this 221 replacement is to count the number of non-zero elements of A Z w by ξ Z . Therefore, we 222 can observer that eq.(8a) ensure that ξ i = 0 if A i w = 0 and eq.(8b) ensures that ξ i = 1 223 if A i w > 0. However, eq.(8a) can be eliminated in those settings where the ξ Z enter the 224 objective function to be minimized with a positive coefficient. We will see similar situations 225 in the following section.

226
After these modifications, we obtain To provide the desired flexibility, we further split the regularization term into two terms 229 corresponding to the positive class P and the negative class Z: The general form of the new ILP is now as follows:

2:7
In this new formulation, λ P and λ Z control the trade-off between the false positives and the 234 false negatives, and jointly influence the sparsity of the rule. This formulation can be further 235 tailored to optimize specific evaluation metrics. In the following section we demonstrate this 236 for sensitivity and specificity, as an example.

238
Since the ILP formulation in (11) provides us with direct access to the two components of 239 the training error, we may modify the classifier to optimize a specific evaluation metric. For 240 instance, assume that we would like to train the classifierĉ to maximize the sensitivity at a 241 given specificity thresholdt. First, recall that 242 243 From equation (10), equation (12) and the definition of Z, we get the constraint 247 248 Our objective is to maximize sensitivity, which is equivalent to minimizing i∈P ξ i by 249 equations (13) and (6). Hence, the ILP (11) can be modified as follows: The maximum specificity at given sensitivity can be found analogously.  From equations (10), (16) and the definition of Z we get Assuming further that the limit on rule size is equal toŝ, we have the following constraint: 272 Therefore, the modified version of the ILP (11) suitable for computing an AUROC is:

Sensitivity at a fixed specificity 318
As another evaluation criteria we compute the sensitivity of our model at a desired specificity 319 level (i.e. β% specificity). To do so, we use the ILP (15). In this formulation, the λ P 320 parameter can be tuned to provide the desired trade-off between the sparsity of the classifier 321 (i.e., rule size) and the number of false negatives. However, in order to make a consistent 322 comparison between the trained models for different drugs, we set a specific limit on rule  insights into interpretable models, by emphasizing the balance between these characteristics.

331
In this section, we use the PDR framework to evaluate our models in the following ways.

332
First, in Section 4.  Table 2 Comparison between AUROCs of models produced by our method with different rule size limits. We observe that even small rule sizes produce models with a high AUROC.

Figure 1
Comparison between the test AUROC of our rule-based model (with no limit imposed on the rule size), 1-regularized logistic regression and Random Forest.  Table 3 Comparison between AUROCs of models produced by 1-regularized logistic regression with different numbers of non-zero regression coefficients.

Figure 2
Test AUROC for models trained on each drug with various rule size limits. Beyond a certain rule size, which varies with the drug, the AUROC of the predictive model no longer improves. Figure 2 demonstrates the change in AUROC as we increase the limit on the rule size. Our 359 results show that as the limit on the rule size increases, we get higher AUROC on the training 360 set. However, on the test set, we see that the AUROC increases more slowly after a rule size 361 limit of 10, and eventually starts to decrease.

362
As shown in Figure 2 and Table 2, the AUROC does not increase significantly beyond a rule 363 size limit of 10. Thus, our method is capable of producing models with a rule sizes small 364 enough to keep the model simple yet keep the AUROC within 1% of the maximum. 365 Table 3 shows the same trend for the 1 -regularized logistic regression. We see that, at the low 366 rule-size limits (such as 10 and 20), our approach produces a comparable performance to that of 1 -regularized logistic regression, while it is slightly worse for larger rule-size limits. At the 368 same time, as we show in Figures 4a and 4b below, our approach results in the recovery of a 369 lot more genes known to be associated with drug resistance than logistic regression.

Figure 3
Comparison between the sensitivity of our rule-based method with the rule size limit set to 20, 1-Logistic regression and Random Forest at around 90% specificity on the testing data.

2:14
An interpretable classification method for drug resistance  371 Our results show that the models produced by our method contains many SNPs in genes whether it has a known association to drug resistance ("known") or not ("unknown"), with 380 an additional class for SNPs in intergenic regions. We show these numbers for 1 -Logistic 381 regression models with as close as possible to 20 non-zero regression coefficients in Figure 4b.

382
A comparison between these figures suggests that when both approaches are restricted to a 383 small number of features, our approach detects more relevant SNPs than 1 -logistic regression.

384
The list of "known" genes, selected through a data-driven and consensus-driven process by a 385 panel of experts, is the one in [31], containing 183 out of over 4,000 M. tuberculosis genes. 386 We note that in both cases, a group of SNPs in perfect linkage disequilibrium was coded as 387 "known" if at least one of the SNPs was contained in a known gene, "intergenic" if none of 388 them appeared in a gene, and as "unknown" otherwise.  390 We run our code on a cluster node with 2 CPU sockets, each with an 8-core 2.60 GHz Intel knowledge of the functional effects of individual SNPs. 419 We also note that the genes we define as having a known association to drug resistance are 420 not specific to the drug being tested, i.e. some of them may have been found to be associated 421 with the resistance to a drug other than the one being predicted. This is to be expected, 422 however, as the distinct resistance mechanisms are generally less numerous than antibiotics 423

Running time
[44]. It will be interesting to see whether methods such as ours are able to detect specific,

426
Our goal in this paper was to introduce a novel method for producing interpretable models 427 and explore its accuracy, descriptive ability, and relevance in detecting drug resistance in 428 Mycobacterium tuberculosis isolates. In this study, the focus was mostly on the predictive 429 accuracy, and we will explore the similarities and differences between our model and other   Technical report, Review on Antimicrobial Resistance, 2014. URL: https://amr-review.org/