Caution regarding the specificities of pan-cancer microbial structure

Results published in an article by Poore et al. (Nature. 2020;579:567–574) suggested that machine learning models can almost perfectly distinguish between tumour types based on their microbial composition using machine learning models. Whilst we believe that there is the potential for microbial composition to be used in this manner, we have concerns with the paper that make us question the certainty of the conclusions drawn. We believe there are issues in the areas of the contribution of contamination, handling of batch effects, false positive classifications and limitations in the machine learning approaches used. This makes it difficult to identify whether the authors have identified true biological signal and how robust these models would be in use as clinical biomarkers. We commend Poore et al. on their approach to open data and reproducibility that has enabled this analysis. We hope that this discourse assists the future development of machine learning models and hypothesis generation in microbiome research.


MODELS PRONOUNCE NONSENSICAL GENERA ARE INFORMATIVE OF TUMOUR TYPE
Even when the model does appear to identify samples better than the negative predictor, we have concerns that many of the key features used in the model are implausible.For example, the model predicting adrenocortical carcinoma is significantly better than a negative predictor (P=0.002) and boasts high sensitivity (0.9565), specificity (0.998) and positive predictive values (0.71).Therefore, this model should hold some features that truly distinguish it from the remaining cancer types.The top ten most important features for this model are Hepandensovirus (relative feature importance score: 9431, a virus that infects crustaceans [3]), Paeniclostridium (973), Comovirus (846), Thalassomonas (267, bacteria causing coral disease [4]), Simkania (160), Cronobacter (151), Simonsiella (148), Leucothrix (145, bacteria from marine macroalgae [5]), Phikmvlikevirus (128) and N4likevirus (88).It is unclear how Phikmvlikevirus and N4likevirus might be informative for adrenocortical carcinoma as they are bacteriophages and therefore would be dependent on the co-occurrence of their bacterial hosts in the adrenal glands (or alternatively the remaining anatomical locations [6,7]).Many of the top performing features of other models under the most stringent decontamination approach also seem nonsensical (Table 1).This point is not covered by the Whalen pitfalls because it is generally presumed that the features being modelled exist to begin with, which in the case of taxonomic classification is not always true.Some models do demonstrate plausible and promising results.For example, in hepatocellular carcinoma, Orthohepadnavirus is known to have a causal relationship with cancer formation [8] and has been found to be specific to the liver in other datasets [9].This is reflected well in Poore et al.'s model where the estimated variable importance score of Orthohepadnavirus in their model (2020.53)dwarfs the next most 'important' feature (Levivirus, 975.09).Despite this, the model is still not significantly better than a negative predictor [P-value (Acc>NIR)=1] and suffers a poor positive predictive value (0.4).

POTENTIAL FOR READ MISCLASSIFICATION
We believe that these nonsensical genera arise because the models produced by Poore et al. are built on many features that are likely to be taxonomically misclassified, from human reads or other contamination [10][11][12][13][14], and therefore do not originate from microbes in the sample.One possible reason for these misclassifications is that extra steps were not taken to remove human reads prior to model building.Poore et al. detail the extraction of reads unaligned to a human reference genome which are then the subject of taxonomic classification.The authors claim to have used 'very stringent decontamination analyses that discarded up to 92.3 % of total sequence data' .This would suggest that 7.7 % of all sequencing reads were subject to taxonomic classification.This pool of reads will still contain human reads which have not aligned [15].For example, this could be because the reads are of low quality or they are mutated in cancer genomes, or due to sequencing artefacts.In addition, the authors detail no human reference sequences in their taxonomic database, using 59 974 microbial genomes only.Therefore, it is highly likely that human sequences will have been misclassified as microbial.The subsequent application of SHOGUN alignment of Kraken-classified reads is more specific but may still involve the inappropriate classification of human reads to a database with no representation of the human genome.Additional human depletion filtering and steps to remove contamination such as those employed by the cancer microbiome atlas to distinguish tissue-resident microbiota from contaminants would have helped to remove misclassifications [16].

Impact Statement
Finding evidence for microbes within cancer whole genome sequences is an emerging technology.These data are challenging to analyse, without the added complexities of resolving batch effects.We show that the normalization used on these data introduces false biological signal into the data where it should not exist.This results in hyperinflated machine learning performance, therefore overstating the use of microbes to distinguish between tumour types in these data.Moreover, many of the taxa reported in the dataset are highly unlikely to exist in vivo due to their extremophile nature and/or have low levels of evidence in terms of the number of classified sequencing reads.There have been numerous publications that re-analysed the Poore and Kopylova et al. data and, consequently, many of these studies are flawed.We wish to urge caution to the scientific community when analysing these data by highlighting these flaws and hope that subsequent studies treat these results with a more appropriate degree of scepticism.
Table 1.Top performing features for a selection of one-vs-all cancer type models in the most stringent decontamination approach as presented in Poore et al These taxa include extremophiles that have not previously been isolated from humans.See Table S1, available in the online version of this article (bacteria), and Table S2 (viruses) for a full description as on NCBI of the sources for each representative species within these genera.

INTRODUCES VARIANCE AND PERMITS MODELLING
Another possible contributing factor to the issues with the models is in how the data were processed.Microbiome data are dynamic [17] (Whalen I: distributional differences), and are typically heteroskedastic (meaning that the variance of a variable is non-constant over values of an independent variable, i.e. the number of sequencing reads assigned to each of two genera) [18].The authors resolve heteroskedasticity by applying a tool called Voom that is designed for RNA sequencing data of a single organism where the majority of genes have some level of expression.However, as applied by Poore et al. it suggests presence even when taxa are absent (Whalen III: leaky preprocessing).For example, for Hepandensovirus (genus of crustacean virus), the top feature for adrenocortical carcinoma, Voom transitions all zeros to non-zero values and untrue variation has been introduced by the global adjustment for technical variables including sequencing centre (Fig. 1a, batch correction relating to Whalen II: confounding).Therefore, this normalization appears beneficial on the global level but raises prominent concerns at the level of individual taxa.
Another example of how the processing of data can be problematic is provided by the extremophile genus Ignicoccus in prostate cancer samples.Ignicoccus shows a statistically significant increase in prostate cancer samples compared to other cancers in the normalized dataset (Wilcox signed rank-sum test P<2.2×10 -1 , Fig. 1b, c).In the raw, unprocessed data no increase in prostate cancer samples is apparent.Indeed, most values are zero and the maximum number of reads found in the raw prostate cancer data for Ignicoccus is 12 (low evidence of detection).It is also highly likely that these are false taxonomic assignments given that Ignicoccus was identified in marine hydrothermal vents [19].This taxon should have been filtered out prior to model building -the application of a minimum read threshold (i.e. 100 classified reads) would have assisted the removal of spurious taxa.

THE MODELS ARE TRAINED ON UNBALANCED DATA
The performance of the models may in part be due to the major imbalance in class size in the datasets (Whalen IV: unbalanced classes), meaning that before model construction, data in the cancer set under investigation are multiplied up many times (upsampling) so that patient numbers in the 'cancer groups' and in the 'all other cancers group' become similar. .This is a prominent concern, especially given how closely linked sequencing centre and disease type are (Table S3).Raw (b) and Voom-SNM normalized (c) Ignicoccus values, which was deemed the most important feature for predicting prostate cancer (PCa) from all other cancer types (n=13 883 primary tumours).Median values are as follows: Kraken raw other 0, Kraken raw PCa 1, normalized other 4.49, normalized PCa 5.82.In both the raw and normalized cases, the distributions are significantly different (Wilcox signed rank-sum test P<2.2×10 -1 ).
approach may amplify the of implausible artefactual data.Adrenocortical carcinoma for example has 79 associated samples (as per Metadata-TCGA-All-18116-Samples.csv provided by Poore et al.).This means that 18 037 are not adrenocortical carcinoma.Adrenocortical carcinoma therefore represents 0.44 % of the whole dataset and therefore data from adrenocortical carcinoma are amplified up to 230 times to equal the sample size of the rest of the dataset.The modelling is therefore overexposed to inappropriate variation in taxa such as Hepandensovirus (Fig. 1).

DISCUSSION
The detection of microbial composition via machine learning is increasingly being used in disease-based research.Extreme caution must be taken to avoid coming to inaccurate conclusions.In this letter we have reviewed the paper of Poore et al. [1] and highlighted many problems.Ideally, the authors would have followed as closely as possible the RIDE criteria set out by Eisenhofer et al. (also authored by Knight) [20].Where this is not possible, the conclusions drawn should be more cautious and the limitations made clear.Poore et al. use many good practices in machine learning [21] but there is the need to avoid the pitfalls of Whalen et al. and use more stringent methods regarding contamination, taxonomic misclassification and previous microbiological evidence.
The hypothesis that microbes (including those found in tumours) are dependent on anatomical location is well founded based on previous work [22], but the models produced by Poore et al. are at best suggestive and do not substantiate this observation.Additional care should be taken to include only taxa with strong evidence of presence based on computational evidence, consideration of the likelihood of contamination and prior biological evidence that the taxa are present in the biological sample of interest.
After we raised our concerns [23], Poore et al. published a response to our points [24].Despite its considerable length, it focused on the technical details of statistical modelling and did not address the core concerns raised in this letter regarding contamination, nonsensical taxa appearing as important and the flawed batch correction.
Even in the technical areas where they did respond, they did not address these points.For example, Poore et al. defended the use of Voom prior to batch correction by claiming that it had been cited >4000 times, but Voom was not developed for metatranscriptomics.Microbial community matrices are typically much sparser than single-organism gene expression matrices.Voom transforms zero values into non-zero values, subsequently with additional false signal introduced (Fig. 1a), which makes the use case of Voom followed by SNM inappropriate.In their response, they suggest that a difference of 0.006 in the normalized values for Hepandensovirus in adrenocortical carcinoma (Fig. 1a) is not significant.This is not correct.The machine learning algorithms used in Poore et al. do not require large differences to build a rule and make classifications.This is reflected in the fact that this genus is by far the most important feature in their near perfect performing model for predicting adrenocortical carcinoma.
The overarching problem, however, is the prevalence of nonsensical taxa appearing as informative in Poore et al.'s models.This is a sure sign that something is going wrong.Poore et al. have given some attention to the issue of contamination but nonsensical taxa with limited evidence of true involvement are still prominent, suggesting this has not gone far enough.In their response, Poore et al. also state that they 'extensively remove human reads from metagenomic data' , but sequencing reads that are not aligned to the human genome are not equivalent to non-human reads and there is no evidence that a human genome was present in their taxonomic database, which is best practice [25].Poore et al. noted that the most stringent decontaminated dataset was only produced to address a reviewer's concerns but that the structure of the data soon became unrecognizable.It is therefore alarming that performance metrics are still high and that nonsensical taxa are still reported as the best performing features in the models built on these 'unrecognizable' datasets with 'stringent' decontamination.Contamination is undoubtedly a major concern in microbiome research and has critically affected the results of a significant amount of research [10,11,13,14,20].Examples include the claims of a brain or placental microbiome [11,12].
It is our contention that there are critical flaws in the study by Poore et al. resulting in misclassifications and contamination being considered as important features to predict tumour type.Unless this issue is addressed, no matter how good the subsequent analysis, the results will still be questionable.Therefore, we believe that our central point of urging 'caution' to those interpreting the data and results of Poore et al. remains valid.
Finally, we would like to highlight the controversy surrounding the use of the term 'cancer microbiome' in this context.There are many definitions of 'microbiome' [26], but the commonly accepted use of the term could imply that microbes are ubiquitous in every single cancer sample, which they are not.There are many sites in the body with highly disproven 'microbiomes' such as the uterus and brain [11][12][13][14].Given the methodological issues we raise, it is difficult to see whether any of the reported microbes are cancer type specific or whether they go beyond the known tissue-specific microbes (hepatitis etc.).Therefore, it should be considered whether these really constitute a 'microbiome' or whether they are related to infection.
We that the study of in tumours is an exciting field, and that the use of large sequencing datasets with rich metadata can reveal much more about the nature of the interplay between microbes and cancer.Poore et al. have used machine learning models to describe the 'tumour microbiome' as being specific to tumour type, but we have serious concerns.Overwhelming contamination and inappropriate handling of the data do not support the claims in the original title: 'Microbiome analyses of blood and tissues suggest cancer diagnostic approach' .A dataset with a less pronounced batch effect, more balanced class sizes and modelling all tumour types at once (not one-vs-all models) might help to better distinguish the pan-cancer microbial structure.There needs to be a better demonstration of microbial differences between tumour types and rigorous validation of models before we can be certain of these differences and illuminate any taxa underpinning these differences.We are a long way from proving the utility of cancer microbial structure in improving cancer patient care.
Representative species within top features (Table S1) were identified by browsing GTDB [32] (release version 207).Associated metadata regarding isolation sources were found by accessing links presented on the GTDB taxonomy browser.

Fig. 1 .
Fig. 1.(a) Voom-SNM normalized TCGA samples (n=17 624) that were negative for crustacean virus hepandensovirus with zero classified reads in the original Kraken dataset with the most stringent decontamination approach.One sample contained two sequencing reads for Hepandensovirus, which has been omitted from this figure to illustrate inappropriate variation introduced by SNM.The colour of each point indicates the centre where the sample was sequenced and from where the resulting data were submitted [University of North Carolina, Harvard Medical School, Canada's Michael Smith Genome Sciences Centre, Broat Institute MIT and Harvard, Baylor College of Medicine, Washington University School of Medicine, MD Anderson -Institute for Applied Cancer Science, Johns Hopkins/University of Southern California, MD Anderson RPPA Core Facility (Proteomics)].The x-axis demonstrates cancer types using TCGA abbreviations as in Poore et al. [1].This is a prominent concern, especially given how closely linked sequencing centre and disease type are (TableS3).Raw (b) and Voom-SNM normalized (c) Ignicoccus values, which was deemed the most important feature for predicting prostate cancer (PCa) from all other cancer types (n=13 883 primary tumours).Median values are as follows: Kraken raw other 0, Kraken raw PCa 1, normalized other 4.49, normalized PCa 5.82.In both the raw and normalized cases, the distributions are significantly different (Wilcox signed rank-sum test P<2.2×10-16 ).