netDx: interpretable patient classification using integrated patient similarity networks

Abstract Patient classification has widespread biomedical and clinical applications, including diagnosis, prognosis, and treatment response prediction. A clinically useful prediction algorithm should be accurate, generalizable, be able to integrate diverse data types, and handle sparse data. A clinical predictor based on genomic data needs to be interpretable to drive hypothesis‐driven research into new treatments. We describe netDx, a novel supervised patient classification framework based on patient similarity networks, which meets these criteria. In a cancer survival benchmark dataset integrating up to six data types in four cancer types, netDx significantly outperforms most other machine‐learning approaches across most cancer types. Compared to traditional machine‐learning‐based patient classifiers, netDx results are more interpretable, visualizing the decision boundary in the context of patient similarity space. When patient similarity is defined by pathway‐level gene expression, netDx identifies biological pathways important for outcome prediction, as demonstrated in breast cancer and asthma. netDx can serve as a patient classifier and as a tool for discovery of biological features characteristic of disease. We provide a free software implementation of netDx with automation workflows.


Supplementary Figures
Appendix Figure S1. Variation in univariate filtering by lasso regression.
Appendix Figure S2. Variation in feature-level scores with increasing number of train/test splits.
Appendix Figure S3. Comparison of netDx and Gene Set Enrichment Analysis for expression-based binary LumA prediction in breast cancer.
Appendix Figure S4. Comparison of selected features from netDx and DIABLO binary breast tumour classifier using RNA and miRNA data.

Supplementary Tables
Appendix Table S1. Comparison of predictor methods for netDx and other methods (PanCancer Survival) Appendix Table S2. Comparison of netDx performance to PanCancer Survival project broken down by machine-learning algorithm. Bold indicates best AUROC value or significant p-value.
Appendix Table S3. Mean AUROC values reproduced from the PanCancer Survival project. Table S4. netDx scores for pathway-level features in asthma case/control prediction. Score shown is the best achieved by a given network for over 70% of the 100 trials. Only networks scoring a max of three or more out of 10 in over 70% trials are shown here.

Supplementary Figures
Appendix Figure S1. Variation in univariate filtering by lasso regression. Each panel shows the frequency with which -out of 20 train/splits -a given measure (e.g. transcript for RNA, or protein for RPPA) had a non-zero weight. Data are shown for ovarian cancer survival prediction. The predictor was run for 20 train/test splits. Within each split, lasso regression was run on training samples only (i.e. within cross-validation), and only variables with non-zero weights were used to create patient similarity networks. The x-axis starts at 1. The percentage of variables that never passed lasso regression was: sCNA: 68.8% ; DNAm: 99.3%; mRNA: 99.1% ; miRNA: 94.4% ; RPPA: 72.1%. Figure S2. Variation in feature-level scores with increasing number of train/test splits. The plot shows variance (σ 2 ) in pathway-level score (out of 10) for the Luminal A ("LumA") class, for gene-expression based binary classification of breast tumours. Each boxplot shows data for a different cumulative number of train/test splits; e.g. the boxplot at x=15 shows pathway-level variance for 15 train/test splits. Figure S3. Comparison of netDx and Gene Set Enrichment Analysis for expression-based binary LumA prediction in breast cancer. In the enrichment map shown, nodes indicate pathways, and edges indicate shared genes. Node fill indicates whether a pathway was significant in the GSEA analysis (yellow, Q <= 0.05, N=126 pathways), was consistently high-scoring in netDx (magenta; scores>=7 out of 10 in >=70% of 100 splits, N=80 pathways), or both (split fill). Node size represents gene set size. Nodes were connected if they share 40% or more of genes in their gene sets (similarity). Singleton nodes (i.e. nodes not connected to any other nodes) were moved into related clusters if they were found to be connected to at least one node in that cluster in a map with a lower (50%) gene set similarity threshold; other singleton nodes are listed in the full set of pathways in Dataset EV4). The EnrichmentMap app in Cytoscape was used to generate the map (Merico et al., 2011), and the AutoAnnotate app was used to cluster pathways and thematically label clusters (Kucera et al., 2016).