netDx: Interpretable patient classification using integrated patient similarity networks

Patient classification has widespread biomedical and clinical applications, including diagnosis, prognosis and treatment response prediction. A clinically useful prediction algorithm should be accurate, generalizable, be able to integrate diverse data types, and handle sparse data. Importantly, the resulting model should be easily interpretable, as clinicians are unlikely to trust black box statistical models. We describe netDx, the first supervised patient classification framework based on patient similarity networks. netDx meets the above criteria and particularly excels at data integration and model interpretability. We demonstrate the features of this framework by integrating up to six heterogeneous datatypes, including clinical variables, DNA methylation, somatic mutations, mRNA, miRNA and protein expression profiles, for survival prediction in kidney, lung, ovarian and brain cancer. We benchmarked netDx performance as a machine-learning method by predicting binary survival in four tumour types. netDx ranks at the top for two tumours and within the top 20th percentile for all four, demonstrating consistently good performance. In comparison to traditional machine learning-based patient classifiers, netDx results are more interpretable, visualizing the decision boundary in the context of patient similarity space and identifying biological pathways and other features important for prediction. By defining patient similarity using pathway-level gene expression, netDx identifies known molecular correlates of poor survival in kidney cancer, and identifies potentially novel pathways and biomarkers. Thus, netDx can serve both as a useful classifier and as a tool for discovery of biological features characteristic of disease. An open-source R/Java implementation of netDx is available along with sample files and automation workflows packaged as vignettes.

open--source R/Java implementation of netDx is available along with sample files and automation workflows packaged as vignettes.

Introduction
The goal of precision medicine is to build quantitative models that guide clinical decision-making by predicting disease risk and response to treatment using data measured for an individual. Within the next five years, several countries will have general--purpose cohort databases with 10,000 to >1 million patients, with linked genetics, electronic health records, metabolite status, and detailed clinical phenotyping; examples of projects underway include the UK BioBank 1 , the US NIH Precision Medicine Initiative (www.whitehouse.gov/precision--medicine), and the Million Veteran Program (http://www.research.va.gov/MVP/). Additionally, human disease specific research projects are profiling multiple data types across thousands of individuals, including genetic and genomic assays, brain imaging, behavioural testing and clinical history from integrated electronic medical records 2--4 (e.g. the Cancer Genome Atlas, http://cancergenome.nih.gov/). Computational methods to integrate these diverse patient data for analysis and prediction will aid understanding of disease architecture and promise to provide actionable clinical guidance.
Statistical models that predict disease risk or outcome are in routine clinical use in fields such as cardiology, metabolic disorders, and oncology 5--8 . Traditional clinical risk prediction models typically use generalized linear regression or survival analysis, in which individual measures are incorporated as terms (or features) of a single equation. Standard methods of this type have limitations analyzing large data from genomic assays. Machine learning methods can handle large data, but are often treated as black boxes that require substantial effort to interpret how specific features contribute to prediction. Black box methods are unlikely to be clinically successful, as physicians must understand the characteristic features of a disease to make a confident diagnosis 9 . Further, many existing methods do not natively handle missing data, requiring data pruning or imputation, and have difficulty integrating multiple different data types.
The patient similarity network framework can overcome these challenges and excels at integrating heterogeneous data and generating intuitive, interpretable models. In this framework, each input patient data feature (e.g. gene expression profile, age) is represented as a patient similarity network (PSN). Each PSN node is an individual patient and an edge between two patients corresponds to pairwise similarity for a given feature.
For instance, two patients could be similar in age, mutation status or transcriptome. PSNs can be constructed based on any available data using a similarity measure. Because all data is converted to a single type of input (networks), integration across diverse data types is straightforward. Patient similarity networks (PSN) have been used successfully for unsupervised class discovery in cancer and type 2 diabetes 10,11 .
We describe netDx, the first PSN--based approach for supervised patient classification. In this system, patients of unknown status can be classified based on their similarity to patients with known status. This process is clinically intuitive because it is analogous to clinical diagnosis, which often involves a physician relating a patient to a mental database of similar patients they have seen. As demonstrated below, netDx has strengths in classification performance, heterogeneous data integration, usability and interpretability.

Algorithm Description
The overall netDx workflow is shown in Figure 1. This example conceptually shows how PSNs can be used to predict if a patient is at high or low risk of developing a disease based on a variety of patient--level data types. Similarity networks are computed for each patient pair and for each data type. In this example, high--risk patients are more strongly connected based on their clinical profile, which may capture age and smoking status, and metabolomics profile. Low--risk patients are more similar in their clinical and genomic profiles. The goal of netDx is to identify the input features predictive of high and low risk, and to accurately assign new patients to the correct class.
Input data design. Each patient similarity network (PSN) is a feature, similar to a variable in a regression model (we use the terms "input networks" and "features" interchangeably).
A PSN can be generated from any kind of patient data, using a pairwise patient similarity measure ( Figure 1A). For example, gene expression profile similarity can be measured using Pearson correlation, while patient age similarity can be measured by the normalized difference. A reasonable design is to define one similarity network per data type, such as a single network based on correlating the expression of all genes in the human genome, or a network based on similarity of responses to a clinical questionnaire. If a data type is multivariate, it is more interpretable to define a network for each individual variable.
However, this approach may lead to too many features generated (e.g. millions of SNPs), which increases computational resource requirements and risk of overfitting. Thus, there is a trade--off between interpretability and overfitting/scalability, which is implicit in machine learning feature design. To help address this problem for 'omics data, we group gene--based measurements into biological pathways, which we assume capture relevant aspects of cellular and physiological processes underlying disease and normal phenotypes. This biological process--based design generates ~2,000 networks from gene expression profiles containing over 20,000 genes, with one network per pathway.

Selecting features informative of class prediction. Feature selection identifies the input
networks with the highest generalizable predictive power, and is run once per patient class. netDx is trained on samples from the class of interest, using cross--validation ( Figure   S1) and an established association network integration algorithm 12,13 . The algorithm scores each network based on its value in the classification task. The ideal network is one connecting all patients of the same class without any connections to other classes. The least useful network is one that connects patients from one class to patients from other classes, without connecting any patients in the same class. In each cross--validation fold, regularized linear regression assigns network weights, reflecting the ability to discriminate query patients from others, and removes uninformative networks. netDx increases a network's score based on the frequency with which it is assigned a positive weight in multiple cross-validation folds. The classifier's sensitivity and specificity can be tuned by thresholding this score; a network with a higher score achieves greater specificity and lower sensitivity. The output of this feature selection step is a set of networks that can be integrated to produce a predictor for the patient class of interest. Scores for each feature are returned and if pathway features are used, they are visualized using an enrichment map ( Figure  1D) 16 . The integrated patient network is visualized and used to assess the strength of class separation, and distance of one patient to others in the class, using network topology measures (Online Methods, Figure 1C).

Predictor checklist.
Each netDx classifier should be assessed using a checklist of tests to gain confidence in the classification results ( Figure 1C). Such a checklist should include: Each test results in a pass, fail or conditional pass. This simple categorization helps ease overall performance assessment ( Figure 1C).
Benchmarking performance by predicting binarized survival in cancer survival), generating for each network a score between 0 and 10. Networks that scored better than 8 out of 10 were used to classify blind test samples. This process was repeated 100 times for random splits of train and blind test ( Figure 2A). Predictor performance was measured as the average of blind test classification across the 100 splits. A network that consistently scored well (10 out of 10) across all 100 splits was selected for the final predictor.
We use AUROC to measure performance as it is used by the PanCancer survival project. All Assessing predictor reliability through a checklist We evaluated each predictor using a set of tests ( Figure 3, Table 1). First, a predictor needed to perform better than chance, as measured by average AUROC and AUPR across the 100 splits ( Figure 3A). To validate class assignments, we compared the survival profiles for netDx--predicted good and poor survivors using a log--rank test, and measured the fraction of the 100 splits with p < 0.05 ( Figure  3B). We categorized a predictor has having fully passed the test (>75% splits with p < 0.05), conditionally passed the test (50--75% splits with p < 0.05) or failed the test (<50% splits with p < 0.05). Figure S6 shows all results for this test. KIRC passes this test for any predictor that includes clinical data The comparison of checklist performance for the best scenario for KIRC and OV shows that KIRC consistently passes the checklist tests, making it a more reliable predictor ( Figure 3).
The KIRC survival predictor, particularly when used with clinical data, was the only consistently reliable predictor of all configurations and tumour types we tested. Predictors for OV survival are less reliable ( Figure S7). None of the conditions tested for LUSC or GBM were reliable in our tests ( Figure  2C, E; Figure  S7). The checklist framework helps identify reliable predictors among all tested configurations.
Evaluating the predictive performance of individual clinical variables Our first survival predictor, described above, used one network for each data type. Clinical data is composed of relatively few variables and separating these could result in a more interpretable predictor. To test the effect of coding each variable as a separate feature, we created a KIRC predictor splitting clinical data into three networks, one each for tumour stage, grade, and patient age. This predictor shows a significant improvement in AUROC score compared to our original classifier ( Figure  class, thus we relaxed the feature selection threshold to 10 out of 10 in >= 70% of splits. Feature--selected pathways for good survival from gene expression include "reactions specific to the complex N--glycan synthesis pathway" and "thyroxine biosynthesis", both related to glycoprotein hormones. Six pathways were feature--selected for poor survival, with themes including salt transport, vitamin and co--factor metabolism and cell adhesion ( Figure 4; scores of top networks in Table S2 Using pathway--based transcriptomic features in netDx results in a predictor that outperforms one where mRNA data is treated as a single feature (mean AUROC=0.73 for pathways; 0.66 for single; one--sided WMW p < 3.1e--9; Figure S10A). Additionally, the pathway--based netDx predictor results in an improved separation of survival curves for classified samples (one--sided WMW; p <1.4e--3, Figure S10B). The integrated patient network for single--feature and pathway--based features is similar, in that good survivors group closer together than opposite--class pairs, but poor survivors do not ( Figure S10C). Therefore, pathway--level features perform better than the single--network configuration on the predictor checklist.
To examine how strongly each pathway feature correlates with outcome, we performed Altogether, our results show that using pathway features provides performance improvements and substantially improves biological interpretability and insight into disease mechanisms.

Discussion
We describe netDx, the first supervised clinical sample classification system based on a patient similarity network framework, and demonstrate its features using multimodal data from four different tumour types. This framework can be used to create accurate, generalizable predictors, and has particular strengths in data integration and netDx is flexible and general. Heterogeneous datatypes are converted into a common "patient similarity" space, easing their integration. We integrated up to six major data types, including five 'omic layers ( Figure 2). netDx can accept input with missing values, because the network association algorithm it uses for feature--scoring has this capability 15 .
In this case, the patient is not represented in the particular network where its value is missing, but it will be in other networks that have data for that patient. netDx also includes support for feature grouping to improve interpretability while keeping feature number low, to mitigate the risk of overfitting and improve signal detection with sparse data. This may improve prediction performance. We demonstrate this functionality by grouping gene-level expression measures into pathway--level features (Figure 4). Users may construct groupings for any patient data type, though groupings with clear clinical or mechanistic interpretation will aid class interpretation.
We implemented a predictor checklist or "report card" to evaluate predictors based on classification performance, model interpretability and consistency (Figure  1,3; Tables  1,2).
While netDx provides output useful to construct a checklist, some checklist items are specific for a classification task (e.g. log--rank test for survival). We hope that including diverse performance measures will contribute to greater insight into possible cause and effect relationships in the model (e.g. Bradford Hill criteria for inferring causation 28 ).
In the PanCancer survival prediction example, the renal clear cell carcinoma (KIRC) survival predictor consistently outperformed predictors for the other tumor types ( Figure   2,3). The predictors for glioblastoma multiforme (GBM) and ovarian carcinoma perform worse than those for KIRC, despite having comparable or substantially more samples. Thus, increased sample size is not a guarantee of improved performance. This variation may be because survival was dichotomized to balance sample sizes in the two groups, rather than based on some clinical or biological criterion; the threshold for good survival is one year for GBM, while that for KIRC is the longest, at four years. Another possibility is that the 'omic data may not contain detectable signal for survival time in GBM. Also, in all instances, clinical data outperforms 'omic data in predicting survival. This seems surprising because the 'omics data is much larger than the clinical data. However, tumour stage and grade, which measure the spread and size of a tumour, are well known to negatively impact survival time ( Figure S9B). Genomic data is still valuable, as it enables netDx to provide mechanistic insight into disease via use of pathway features. Thus, it is useful to analyze all available data to support both prediction performance and biological discovery.
While the pathway--based predictor performs better than the one where mRNA data is presented as a single similarity network, we observed that we can create a predictor with performance similar to that of the pathway--based predictor by using randomly generated "pseudo" pathways containing non--pathway genes ( Figure S10). We interpret this to mean that grouping genes reduces noise and improves good vs. poor survival signal in the gene expression data. Consistent with this idea, destroying the correlation structure of the gene expression matrix drops performance to random ( Figure S12). Use of real pathways selects those that match known biology of each cancer type, which we interpret to mean that netDx can select appropriate, interpretable pathways. Thus, pathways may improve performance by providing both biological signal and general noise reduction.
netDx provides a complete framework for precision medicine, however the ultimate vision is to enable clinical researchers to assess classification performance for questions of interest, such as 'will a patient respond to one therapy or another?' based on patient measurements and outcomes present in large electronic medical record databases. Output   Project. The y--axis shows the mean AUROC for a given data combination and algorithm, and points are sorted by decreasing performance (rank, ties averaged). Each panel shows results for eight machine--learning methods and multiple input data combinations for a given tumour type.   where 1<=k<=5, the similarity S between two patients a and b is defined as the average of normalized differences for each of the variables:

Integrated patient network
The integrated patient network starts by compiling the edges from all feature--selected networks into a single network, such that each pair of patients now has multiple similarity edges. The similarity between two patients in the integrated network is the mean of these pairwise similarities. Visually, the goal is to view more similar patients as being more tightly grouped, and more dissimilar patients as being farther apart. Similarity is therefore converted to dissimilarity, defined as 1--similarity. Weighted shortest path distances are abs(a b) max(g) min(g)  9 . Only pathways with 10 to 500 genes were included (1,801 pathways). Pathway--level patient similarity was defined as the Pearson correlation of the expression vectors corresponding to member genes, and the network was sparsified (see next section).

Sparsification of input networks
All negative similarities are set to zero. The 50 strongest correlations are kept per patient.
In case of ties, all interactions tied with the 50th ranked interaction are retained, for a maximum of 2% of the sample size, or 600 patients. This follows the parameters established in the GeneMANIA algorithm (used here to integrate networks) for gene expression correlation network sparsification 10 .

Map of feature--selected networks
The Enrichment Map app (v2.1.1--HOTFIX_1) in Cytoscape 3.5.1 11 was used to generate the map of selected pathways in Figure 2D 9 . A Jaccard overlap threshold of 0.05 was used to prune identical gene sets. AutoAnnotate v1.1.0 was used to cluster similar pathways using MCL clustering with default parameters. The network was visualized in Cytoscape 3.5.1 30 .
The weighted shortest path between patient classes (a node set) was computed using Dijkstra's method (igraph v1.01 12 ); distance was defined as 1--similarity (or edge weight from a patient similarity network). The overall shortest path was defined as the mean pairwise shortest--path for a node set. using solely gene expression data. "Pseudo pathways" were generated by randomly-sampling exactly 10 genes, out of a universe of genes not annotated as being in pathways (11,326 genes

Pathway name
Max score in >=70% trials Abacavir transport and metabolism 10 Bile salt and organic anion SLC transporters 10 Metabolism of water--soluble vitamins and cofactors 10 Platelet Adhesion to exposed collagen 10 Regulation of IFNA signaling