Eﬀects of microbiome rare taxa ﬁltering on statistical analysis

Background: Accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suﬀers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Filtering is deﬁned as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. This approach reduces extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured “mock” datasets, where the true taxa compositions are known. Although ﬁltering is frequently used, careful evaluation of its eﬀect on the data analysis and scientiﬁc conclusions remains unreported. Here, we assess the eﬀect of ﬁltering on the alpha and beta diversity estimation, as well as its impact on identifying taxa that discriminate between disease states. Results: The eﬀect of ﬁltering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where same cultured samples with known microbial composition are processed at diﬀerent labs and two disease study datasets. Results show that in microbiome quality control datasets, ﬁltering reduces the magnitude of diﬀerences in alpha diversity and alleviates technical variability between labs, while preserving between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Eﬀect Size (LEfSe) methods were used to identify taxa that are diﬀerentially expressed across groups of samples, and random forest models to rank features with largest contribution towards disease classiﬁcation. Results reveal that ﬁltering retains signiﬁcant taxa and preserves the model classiﬁcation ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between ﬁltering and contaminant removal method shows that they have complementary eﬀects and are advised to be used in conjunction. Filtering reduces the complexity of microbiome data, while preserving their integrity in downstream analysis. This leads to mitigation of the classiﬁcation methods’ sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.


Results:
The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs, while preserving between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially expressed across groups of samples, and random forest models to rank features with largest contribution towards disease classification. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between filtering and contaminant removal method shows that they have complementary effects and are advised to be used in conjunction.

Conclusions:
Filtering reduces the complexity of microbiome data, while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.
Keywords: Filtering; Fast Permutation Test; Quality Control; Microbiome; Contaminants Background Studies of microbiota association and human disease states have received increasing attention over the last decade [1]. It was shown that microbiota composition plays an important role in the development of multiple diseases including inflammatory bowel disease [2], diabetes [3,4], preterm birth [5,6], and liver diseases [7,8]. Next generation sequencing (NGS) of the 16S rRNA marker is currently among the most widely used methods for microbial organisms identification. In these studies, samples collected at different body sites (e.g., vaginal swab, stool or blood) give counts of DNA fragments which are then grouped into similar microbial organisms, usually referred to as taxa. Hence, the resulting data is usually referred to as the "taxa table", or "derived feature data". In contrast to other -omics measurements, microbiome data are very sparse as many taxa are rare and often have zero counts in most samples.
The extreme levels of sparsity in microbiome datasets is one of the major challenges in data analysis. Indeed, it is not unusual to have over 90% of 0s in this data as it contains a large number of rare taxa observed in as few as 1 to 5% of samples. Recent microbiome quality control studies indicate that many rare taxa are caused by sequencing artifacts [9], contamination and/or sequencing errors [10,11,12,13]. The most common approach to address this problem in the derived feature data is filtering, or removing spurious taxa from the 16S data set. Most filtering approaches are based on the rules of thumb, which vary from lab-to-lab. Recently, a filtering loss measure and a principled filtering test, namely PERFect [14], is introduced for deciding which taxa to remove. These methods are implemented in Bioconductor package PERFect [15], which includes a novel fast implementation of the permutation PERFect method. The implemented approach successfully reduces the original algorithm running time by almost four times.
While some techniques have been proposed to detect and remove contaminant and/or rare taxa, the literature in this research area is scarce. Davis et al [16] addressed this problem by introducing decontam R package that identifies contaminants by: (1) inversely correlating taxa frequencies with sample DNA concentration; and (2) using the prevalence of sequenced negative controls [17]. This method requires the auxiliary data from DNA quantitation, which is in most cases intrinsic to sample preparation, or negative controls data that is intrinsic to sequencing protocol. This approach is closely related to, but is not identical to filtering.
Traditional filtering methods were previously compared to the PERFect approach proposed by Smirnova et al [14] and tested on two datasets acquired from mock community experiments carried out at Virginia Commonwealth University (VCU) [12,18] and a reagent and laboratory contamination dataset [17]. The authors used the number of contaminant taxa removed from the mock datasets as the method evaluation criteria. However, in practice, filtering is used as an intermediary step applied to the derived taxonomic feature table prior to data analysis. While filtering is a commonly used and recommended approach [19,20], its benefits on data analysis and the effects on the scientific conclusions drawn from filtered and unfiltered data have not been reported.
The objectives of the current study are to evaluate: (1) the effects of filtering on technical variability for identical mock samples processed under different conditions; (2) the advantages and disadvantages of using filtering for detecting significant taxa discriminating two groups of medical conditions. To address the first goal, we analyze the recent MicroBiome Quality Control (MBQC) project [13] that includes 1, 016 oral mock samples sequenced at 15 laboratories and the results were then randomly distributed to 9 bioinformatics facilities for taxonomic classification; and the previously studied laboratory contamination dataset (denoted as Salter data) [17]. To address the second goal, we analyze two novel datasets on the gut microbiome studies on the TREAT consortium alcoholic hepatitis study [8] and the Human Microbiome Project inflammatory bowel disease [21]. To evaluate the effects of filtering, we concentrate on: (1) alpha (within) and beta (between) samples diversity analysis; (2) identification of significant taxa using random forest classification, LEfse and DESeq2 methods. Finally, we discuss the filtering and contaminant removal methodologies, and show that these approaches have complementary effects.
The rest of the paper is organized as follows. In the section "Motivating datasets" we introduce four datasets used in this study. In the section "Methods" we review the filtering and contaminant removal methodology details, as well as the statistical analysis methods used in the paper. We discuss analysis results in section "Results and discussion". Finally, in the section "Conclusions" we summarize major findings of the current study.

Mock data
The Microbiome Quality Control data Consider the dataset from the MBQC project, a collaborative effort designed to comprehensively evaluate sample processing and computational methods for human microbiome data analysis [13]. There are four types of samples in this dataset: (1) 11 unique fresh stool samples; (2) seven unique freeze-dried stool samples; (3) two unique chemostat samples generated from a Robogut; and (4) two artificial colonies representing the gut and oral cavity. The aliquot of these samples were first randomly sequenced at 15 laboratories and the results were then randomly distributed to 9 bioinformatics facilities for taxonomic classification. Here, we consider the oral artificial communities data comprised of 22 true taxa. The MBQC project identified a total of 27, 140 taxa across the four types of samples. For the purposes of this analysis, 14, 861 taxa that have a 0 count across all oral artificial community samples are excluded; 1, 277 taxa that matched names at the species level were combined; finally, 10, 210 taxa that appeared in less than 5% of the samples were removed. The final dataset considered for this analysis contains 1, 016 samples and 792 taxa. A limitation of this dataset is that the samples were created from the species in prescribed proportions; however, after the samples were processed many taxa were only identified up to the genus level (higher order phylogenetic hierarchy). As a consequence, only two signal taxa, Veillonellaceae Veillonella Parvula and Coriobacteriaceae Eggerthella Lenta, were correctly detected while the other 20 signal species are among the 184 taxa identified at the genus level. Figure 1 (A) displays the log-counts heat map for the 100 most abundant taxa for the first five labs, arranged in decreasing order of abundance. Here, we rank taxa abundance by the number of samples a taxon is present in, where most abundant taxon is ranked as 1, second abundant as 2, and so on. The white areas of the heatmap in the lower right corner indicate unobserved taxa, showing the decrease of signal strength with different processing institutes/labs. Figure 1 (B) displays the Principal Coordinate Analysis (PCoA) Bray-Curtis distance [22] multidimensional scaling (MDS) plots for 1, 016 samples from the heat map on the left. The first two principal components (PCs) that explain 32.2% of variability are shown on the plot. The samples cluster by bioinformatic labs, indicating differences across samples processed at different institutes even though they contain exactly the same signal species. These observations highlight the strong effects of the sequencing and bioinformatics protocol choice on taxa identification. Filtering, which removes rare taxa displayed in columns on the right-hand-side of the heatmap in Figure 1 (A) is one approach that could mitigate these differences. Left unresolved, this problem may cause a number of practical issues including: (1) falsely inflating within sample diversity, called α-diversity [23]; (2) obscuring true distances between samples, called β-diversity [24]; and (3) interpreting rare taxa as disease biomarkers (especially in low sample biomass environments).

The reagent and laboratory contamination data
The reagent and laboratory contamination study was designed to determine the effects of DNA extraction kits and other laboratory reagent contamination on sequencing output [17]. These data contain mock samples of a pure Salmonella bongori culture that had been processed at three different institutes: (1) Imperial College London (ICL); (2) University of Birmingham (UB); and (3) Wellcome Trust Sanger Institute (WTSI). Each mock sample underwent five rounds of serial ten-fold dilutions to generate a series of high (dilution = 0) to low (dilution = 5) biomass samples. Data visualization heatmap in Figure 6 (A) top panel displays the log-counts heat map for 635 observed taxa, generated using 40 Polymerase Chain Reaction (PCR) cycles. The taxa on the horizontal axis are arranged in decreasing order of abundance and the 18 samples on the vertical axis arranged by low to high (0 to 5) degrees of dilution. Results indicate that as the dilution number increases, true taxa contain less signal and are observed in lower counts, which makes it difficult to separate the signal from the noise.

Disease study data
In many microbiome studies, it is of interest to identify specific bacterial taxa that discriminate between two or more disease groups. We investigate the effects of filtering on identifying specific taxa that contribute to these differences using two recently reported microbiome studies.

Alcoholic hepatitis data
The study to characterize changes in fecal microbiome due to alcohol consumption and alcoholic hepatitis was performed by the sites involved in the TREAT consortium from 2014-2018 [8]. A total of 78 participants (healthy control (HC), n=24; heavy drinking control (HDC), n=20; moderate alcoholic hepatitis (MAH), n=10; severe alcoholic hepatitis (SAH), n=24) were studied. Results indicated that in random forest classification models, alcoholic hepatitis (moderate and severe alcoholic hepatitis groups combined; n = 34) was associated with a distinct microbiome signature compared to heavy drinking controls (AUC=0.826), and multiple microbial genera were identified as the key contributors to these differences.

Inflammatory bowel disease data
The inflammatory bowel disease (IBD) data was generated as a part of the NIH Common Fund's Integrative Human Microbiome Project (iHMP/HMP2). The initial findings and multi-omic datasets from these studies were published in the Nature family of journals in May and June of 2019 [21], and data are publicly available through the HMP Data Coordination Center (HMP-DACC) and HMP2Data Bioconductor package [25] in R. The subset of 132 patients (control (nonIBD), n=46; Crohn's disease (CD), n= 86) with the open-source 16S data available through the HMP2Data package was selected for the analyses presented in this manuscript.

Filtering methods
Currently, there are only a few filtering methods used to alleviate the issue of contaminant and rare taxa. The majority of these methods are based on a heuristic non-statistical rule, with two methods where a threshold is derived statistically from the data. Here, we give a brief overview of these methods.

Rule of thumb approaches
In practice, filtering is a variation of an ad-hoc, albeit simple, procedure. One of the most widely used techniques for filtering in microbiome studies selects taxa that have a number of counts above m = 0 in at least k samples. This approach is borrowed from the RNA-seq gene expression literature and is implemented in the R package genefilter [26] and in QIIME bioinformatics pipeline function filter otus from otu table.py [27]. The choice of the threshold k comes from the count that is 0.1% of the minimal library size (the total number of count reads in the sample). For example, often the minimal library size is set to 5000 reads, thus popular filtering rules keep taxa present in at least k = 5 samples. Another popular approach is to remove taxa that are observed in fewer than k% of the samples. The advantage of these methods is that they are simple, intuitive, and easy to communicate with collaborators. However, they do not have an explicit loss function and objective criteria for choosing the tuning parameters m and k.

Statistical threshold selection
In contrast to the rule of thumb approaches where thresholds for filtering taxa are determined heuristically, the statistical approach selects an empirical filtering threshold based on the information given by the data. It extends traditional rule of thumb filtering approaches to find the best subset of retained taxa for further analysis by implementing statistical data-driven significance cut-off thresholds. The current method for such approach is PERFect, a principled filtering test that removes taxa with insignificant contribution to the total covariance [14]. Specifically, this method ranks taxa importance, measures their contribution to the total covariance, and quantifies the chance that the loss increases for a set of filtered taxa is due to randomness using permutation tests. One drawback of the permutation filtering method is that it might be computationally expensive. Indeed, given that k permutations are performed for each taxon j = 2, 3, . . . , p, the algorithm requires a total of k(p − 1) permutations, where k and p are large. Thus, the newer version of this package (see supplementary information), employs parallel processing and an unbalanced binary search algorithm [28] that optimally finds the cut-off taxon j to remove the set of taxa without building the permutation distribution and computing the p-values for all p − 1 taxa.

Contaminant removal method
Contaminants in microbiome studies may arise from external sources such as the body of the study participant or sample collector [29,30], sample collection instruments and laboratory reagents [17,31,32] or from internal sources (crosscontamination) when samples were mixed with each other during sample processing [31] or sequencing [33]. The recently developed contaminants removal method Decontam [16] identifies external contaminants by either: 1) inversely correlating taxa frequencies with sample DNA concentration; or 2) using the prevalence of sequenced negative controls. Our results suggest that Decontam removes abundant taxa that are likely contaminants but does not address the issue of rare taxa. A practical limitation of this method is that it requires auxiliary information from DNA quantitation or negative controls that is intrinsic to the sequencing protocol and might not always be available.

Statistical analysis methods
Within sample (alpha) [23] and between samples (beta) [24] diversity are used to evaluate the effects of filtering on reducing technical variability in mock datasets. Differences in estimated alpha diversity between processing labs were evaluated using Dunn's test with Benjamini-Hochberg controlling the false discovery rate [34] multiple comparisons adjustment implemented in R.
A number of methods for disease state prediction and differentially expressed taxa identification commonly used in metagenomic data analysis are considered in studying the effect of filtering. These methods are first performed on unfiltered and filtered data, then features importance for prediction and differentially expressed taxa selected by each method are compared. For predictive modeling, random forest [35], which is extensively applied in computational biology and genomics [36], is used to identify the set of most predictive taxa based on their Mean Decrease Gini measures. The classification model diagnostic ability in filtered and unfiltered models is compared using area under receiver operating characteristic curve (AUC). To identify differentially expressed taxa, DESeq2 [37] and linear discriminant analysis effect size (LEfSe) [38], are used. DESeq2 fits a negative binomial generalized linear model for each taxon count to obtain maximum likelihood estimates log-fold change between two classes, uses Bayesian shrinkage to shrink the log-fold change towards zero for taxa with low counts and/or high dispersion, and performs the Wald test on these shrunken log-fold changes for significance testing. LEfSe determines differentially expressed features by coupling non parametric standard tests for statistical significance with linear discriminant analysis (LDA), allowing researchers to further identify features that are consistent with biologically meaningful categories [38]. However, there are some limitations for these methods when applied to microbiome data. Our results suggest that in many instances each of these three methods tends to flag a taxon as significant when the difference between two classes is driven by outlier counts. For example, a rare taxon that is absent for most samples and present in a few samples of one class can potentially be classified as differentially expressed taxon.

Results and discussion
PERFect simultaneous and permutation filtering approaches were previously validated [14] on three mock community data sets ( [10], [11], and [12]) using the number of contaminant taxa correctly removed as an efficiency criterion. Here, we concentrate on the effects of filtering on downstream analyses, using the two major exploratory analyses used in microbiome research, alpha and beta diversity, as well as its impact on identifying taxa that discriminate between disease states.

The MicroBiome Quality Control Data Analysis
One of the main goals of the MBQC project was to understand major differences in technology and methods for analyzing human microbiome data. This was achieved by analyzing the observed taxa variation between: (1) the labs that sequenced samples according to the internal protocol; and (2) bioinformatics pipelines used to perform taxomonic classification. Here, we concentrate on the effect of bioinformatics processing laboratories on the observed oral mock community data measured by alpha and beta diversity, two of the most commonly used summaries in microbiome research. Figure 2 (A) shows the Shannon index, the most widely-used diversity metric which accounts for both abundance and evenness of the species present [39] in the unfiltered and filtered data. The Shannon index, H is defined where p i is the proportion of total sample represented by species i and S is the total number of species in the community. This plot and the summary statistics in Table 1 indicate a decrease in the Shannon index between the unfiltered and filtered data, implying a reduction in the diversity and evenness of taxa. Specifically, the diversity decreases from 792 taxa to 175 and 222 taxa using the simultaneous and permutation method respectively, while retaining 22 true taxa. Since both filtering methods remove more than 70% of taxa, the distribution of the remaining taxa shifts in favor of the true taxa by increasing their proportions in the samples, resulting in less even communities. As the result, this reduction of alpha diversity of samples tends towards the true alpha diversity as indicated by the red dashed line.
In order to study the effect of filtering on differences across bioinformatics processing labs, we applied Dunn's test with a Benjamini-Hochberg correction for multiple testing to all possible pairwise Shannon alpha diversity comparisons between processing labs. Results are summarized in Table 2. Since all samples contained the same mock communities, in the absence of technical variability, none of the differences should be significant. For the unfiltered data, 21 out of 28 possible pairs have significant differences in alpha diversity at the 0.05 significance level. Applying simultaneous and permutation filtering decreases differences in alpha diversity for most pairs. Moreover, there are a total of 4 and 8 pairwise comparisons that are no longer significant at the 0.05 level after simultaneous and permutation filtering was applied respectively. While filtering does not remove all differences due to processing labs, these results indicate that it dramatically alleviates differences in alpha diversity estimates caused by the lab-to-lab variability.
To assess the effect of filtering on beta diversity, we calculated the pairwise Bray-Curtis distances between samples using a combined taxa matrix which consists of the unfiltered taxa matrix, and the taxa filtered matrices of PERFect simultaneous and PERFect permutation each at the p-value threshold of 0.1. The multidimensional scaling ordination plot for the first two principal components which explain 30.5% of the variability in the data is shown in Figure 2 (B). Three filtering methods (unfiltered, simultaneous and permutation PERFect) are arranged in columns and samples are colored according to 8 processing institutes. Figure 2 (B) shows that while data clusters by laboratory in each dataset, the proximity between clusters decreases when simultaneous or permutation filtering is applied. This observation indicates that filtering decreases dissimilarity between samples that contain the same mock communities and slightly alleviates the effects of lab-to-lab variability. Thus, filtering achieves dimension reduction (reduces the number of taxa) while preserving beta diversity.
The reagent and laboratory contamination data Figure 2 (C) displays the difference in the Shannon index of the filtered outputs, corresponding to the p-value threshold 0.1, using simultaneous and permutation filtering among 6 dilution levels and 3 processing institutes. It is expected that as the dilution levels increase, more uncertainty in true taxa identification is introduced into the biological system, thus the proportion of signal taxa decreases whereas that of noise taxa increases. This phenomenon is displayed on the top heatmap in Figure  6 (A), where taxa are arranged from left to right in decreasing abundance order (noise taxa to the bottom right of the heatmap). As the dilution levels increase (rows of the heatmap), each dilution 'band' becomes denser due to the increase in noise taxa counts. Therefore, dilution causes the true signal to become more even with noise, and thus leading to a higher Shannon index. This effect may cause problems comparing alpha diversity for different groups of samples with variable biomass because it will be more difficult to differentiate between signal and noise taxa in low biomass samples. The filtering methods address this issue by removing noise taxa in highly diluted samples (dilutions 3, 4 and 5), where the simultaneous filtering removes more taxa than the permutation algorithm and has more impact on reducing the alpha diversity.
To compare the beta diversity for filtered outputs, the pairwise between-sample Bray-Curtis distances were calculated using the taxa matrices' combination with a similar set up to the analysis with the MBQC data. The multidimensional scaling ordination plot for the first two principal components that explain 81.3% of the variability in the data is shown in Figure 2 (D). The six dilution levels and three filtering methods (none, simultaneous and permutation PERFect) are arranged in columns and rows respectively; samples are colored according to the three processing institutes. Ideally, the samples from all three processing institutes should have the same composition of taxa regardless of the dilution levels. However, contaminants that went into the samples during the DNA extraction and PCR process lead to the differences between the three processing institutes. Figure 2 (D) shows that filtering does not dramatically change samples' pairwise distances in ordination plots. This is due to the fact that PERFect, like many other filtering methods, removes taxa with low abundance which do not contribute to the signal, and thus do not dramatically affect samples' pairwise distances. These observations lead to the important conclusion that filtering reduces the number of taxa considered in the analysis, and thus reduces dimensionality of the taxa table, without affecting beta diversity.

Alcoholic Hepatitis Data Analysis Random forest results
The ROC curves of the random forest models between unfiltered (AUC = 0.826) and filtered (AUC = 0.816) data are shown in Figure 3 (A). The predictive abilities of the filtered and unfiltered models as measured by AUC values are similar, although there is a small decrease of 0.01 in the AUC for the model built on the filtered data. This implies that removing rare taxa has little effect on the classification ability of the random forest model. Further, the most predictive taxa, as measured by the mean Gini decrease criteria, also tend to be abundant (see Supplementary Table  1). Specifically, the ranks of the top 35 predictive taxa in the unfiltered data range between the first and 81st most abundant taxa out of the total 345 taxa in the unfiltered data set.
To compare the discrepancy of the taxa importance rank between these two random forest models, we use the elbow method on taxa mean Gini decrease in unfiltered data to choose the top 60 most predictive taxa. The taxa are chosen so that the differences of consecutive taxa mean Gini decrease are no less than 0.001. Then, their importance ranks are compared to those from the filtered data. Results indicate that in general, while there is minor variation, ranks are consistent and strong predictive taxa keep their classification ability after filtering (see Supplementary  Table 1).

LEfSe results
The LDA score for all significant taxa using LEfSe from unfiltered and filtered data are shown in Figure 4 (A). For each taxon, the log fold change values from unfiltered and filtered data are similar, although the values from unfiltered data tend to be slightly higher (range between 0.01 and 0.75). This indicates that filtering retains the differential expression for almost all taxa, thus taxa that are significant in unfiltered data tend to be significant in filtered data.
There are four taxa that are present in the unfiltered but absent in the filtered data results, where two are identified at the family level (Lactobacillaceae and Coriobacteriaceae) and two are identified at the genus level (Ruminococcaceae Butyricicoccus and Clostridiaceae Proteiniclasticum). At the family level, filtering removes rare taxa from each family (3 out of 6 taxa from Lactobacillaceae and 6 out of 12 taxa from Coriobacteriaceae). The remaining taxa aggregated to each the two families do not discriminate between the heavy drinking control (HDC, n = 20) and the Alcohol Hepatitis (AH, n = 34) group. At the genus level, Ruminococcaceae Butyricicoccus and Clostridiaceae Proteiniclasticum are flagged as significant in the unfiltered but as non-significant in the filtered data. Both taxa are overall more rare (42nd and 179th most abundant taxa out of 345 taxa), with relative abundance between 0 and 0.02 (max without outlier) (see Supplementary Figure 1), one outlier for Ruminococcaceae Butyricicoccus (relative abundance = 0.08), and only a few low relative abundance observations in the HDC group for Ruminococcaceae Butyricicoccus. This suggests that in the presence of taxa with outliers, the difference between groups appears to be stronger when tested in the unfiltered dataset with a large number of rare taxa. However, the strength of the outliers' effect is reduced when testing is performed in the filtered data, where extremely rare taxa are removed.

DESeq2 results
The DESeq2 method based on log(Count + 1) transformed data was used to identify differentially expressed taxa in filtered and unfiltered taxa tables; results are shown in Figure 4 (B). Taxa colored in black were identified as significant in both filtered and unfiltered datasets, while the taxon in blue (Lachnospiraceae Anaerostipes) was present in filtered data but was not significant. This discrepancy was not surprisingly due to small differences in significance level of this taxon for filtered and unfiltered data. At the alpha level of 0.1, this taxon was significant for both unfiltered and filtered data with raw p-value of 0.028 and 0.035, respectively. After the p-value adjustment step using Benjamini-Hochberg procedure, it remained significant in unfiltered data (p-value = 0.093) but became non-significant in filtered data (pvalue = 0.113). Since the change between raw p-values is relatively small and the adjusted p-values are close to the alpha level, we may conclude that for DESeq2 method, there are no major differences due to filtering in this dataset.

Summary of discrimination results
Common significant taxa between random forest models, LEfSe and DESeq2 results on unfiltered data are shown in Figure 4 (C). There are two taxa that are significant in the unfiltered but not significant in the filtered: Lachnospiraceae Anaerostipes (not significant if DESeq2) and Fusobacteriaceae Fusobacterium (low importance in random forest). Results indicate that these discrepancies occur for the borderline significant taxa.

IBD Data Analysis
Filtering analysis on the IBD data is performed using the workflow of Alcoholic Hepatitis data and the results are shown in Figures 5. We observe similar results' patterns with the Alcoholic Hepatitis analyses: 1) significant taxa tend to be highly abundant; 2) predictive abilities of the filtered and unfiltered random forest models as measured by AUC values are similar (unfiltered AUC = 0.852; filtered AUC = 0.853); 3) in the discriminant analysis, the differences between LDA scores for all significant taxa from unfiltered and filtered data are small; 4) the significance effect of more rare taxa with outliers is stronger in the unfiltered compared to filtered dataset.
Compared to Alcoholic Hepatitis data, DESeq2 identified three additional significant taxa in the filtered data: Lachnospiraceae Eubacterium (p-value = 0.080), Fusobacteriaceae Fusobacterium (p-value = 0.0999) and Enterobacteriaceae Es-cherichiaShigella (p-value = 0.070). Since these adjusted p-values are borderline significant in the filtered data that includes less taxa, these genera are non significant when DESeq2 is run on the unfiltered data due to a larger number of taxa used in multiple comparison adjustment.
Comparison with contaminant removal method Contaminant removal and filtering methods have a common goal of identifying potential features derived due to technical limitations that occur with sequencing and taxonomic classification. However, that main goal of each method is different, which led to complementary effects in our comparison studies. Filtering concentrates on removing rare taxa relying mostly on sparsity assumptions and using no auxiliary information about the derived feature data. Contaminant removal methods implemented in R package decontam use additional information from the sequencing process to apply statistical threshold rules marginally to one taxon at a time. The decontam package implements two methods, each using a specific auxiliary information about derived feature data: (1) frequency method uses DNA quantitation data recording the concentration of DNA in each sample; and (2) prevalence method employs a set of "negative control" samples in which sequencing was performed on blanks without any biological sample added.
Following the methods discussed by Davis et al, 2018 [16], we applied decontam frequency method to the reagent and laboratory contamination data to compare filtering and contaminant identification methods in terms of the type of taxa they remove and their effects on diversity. Results are illustrated in Figure 6, which compares the heatmaps, alpha and beta diversity for the derived feature data without filtering to the data where taxa are removed using decontam frequency and PERFect simultaneous filtering methods. Heatmaps in Figure 6 (A) indicate that decontam frequency identifies abundant taxa as contaminants (left oval in the middle panel heatmap) leaving rare taxa to the right of the plot in the data set. In contrast, PERFect removes rare taxa (right oval in the bottom panel heatmap) that decontam was not able to detect, while leaving abundant taxa in the data set. These observations highlight important methodological differences between two methods. Specifically, decontam frequency fits a regression model to compare a contaminant model, in which expected frequency varies inversely with total DNA concentration, and a non-contaminant model, in which expected frequency is independent of total DNA concentration [16]. For rare taxa regression model fit is unstable due to the small number of observations (a few samples where rare taxa appear), and thus decontam returns missing values for the taxa significance. Specifically, out of a total of 635 taxa, decontam identified 61 taxa as contaminants, and was not able to evaluate statistical significance for 221 taxa. Filtering approach has a major limitation of being skewed toward retaining more dominant features, as a result, a persistent contaminant feature might appear in a large number of samples, have a high contribution towards covariance and would not be removed from the data set.
Comparison of alpha diversity in Figure 6 (B) reveals that both decontam and PERFect reduce Shannon diversity. Based on this data, decontam leads to greater alpha diversity reduction, which is expected since it removes more abundant taxa.
However, it should be noted that this is a small sample size study with only 18 observations (six observations per each of the three institutes) and results may not be conclusive. Figure 6 (C) compares beta diversity Bray-Curtis distance plots for each method (rows) by dilution level (columns) with samples colored by processing institutes. All samples contain the same biological material, thus under no technical variability scenario, the points should overlap on the plot. This is the case for undiluted samples (first column dilution = 0), however the observed dissimilarity between samples increases with dilution. Davis et al, 2018 [16] showed that removing abundant contaminants (second row in Figure 6 (C)) reduces technical beta diversity. Comparing these results with filtering output confirms our previous observations based on the Microbiome Quality Control data set that removing rare taxa via filtering does not significantly effect beta diversity.

Conclusions
It is generally believed that filtering rare taxa is an effective quality control approach to remove possible contaminants, sequencing and taxomomic assignment artifacts. The current study supports this paradigm and demonstrates that filtering has a strong potential to reduce lab-to-lab variability between samples that contain similar microbial species and processed according to different protocols. Moreover, filtering removes rare taxa that have low contribution to the signal, thus reducing dimensionality of the data with minimal information loss. The ability of the methods to detect taxa significantly different across two disease groups is almost unaffected by filtering. Except for a small number of taxa detected as significant in unfiltered but not filtered (or visa versa) data, each method produces the same results. Major discrepancies in taxa that are identified as significant come from the data analysis method choice (Random Forest, LEfSe or DESeq2) but not from filtering. To the best of our knowledge, this is the first report on the effects of filtering on statistical analyses of microbiome data.
The statistical methodology literature on quality control for the derived feature data is scarce. Most previous studies either recommended filtering without thorough evaluation of its effects [19,20], or focused on the number of taxa removed from the mock artificial community studies [14] and on contaminant identification [16,10]. We have previously demonstrated that filtering methods were effective in identifying true species in mock data [14]. An underlying assumption of filtering is that most rare taxa are not informative in the analysis; however presence of rare taxa in the derived feature data increases sparsity and affects performance of statistical methods. The current study supports earlier hypotheses and validates that removing rare taxa does not impact the scientific conclusions.
It has also been established that the contaminant removal method implemented in decontam package [16] was effective in reducing technical variability across processing institutes. Comparison of filtering and contaminant removal methods on Salter data [17] reveals that the two methods have complementary effects: decontam removes persistent contaminant features that appear in a large number of samples while filtering removes rare taxa that appear in a small number of samples. This is not surprising because this is exactly the assumptions of these two methods, nevertheless this is a significant finding which suggests that in practice both methods may be used to remove sequencing artifacts from the derived feature data.
Another noteworthy finding is that most significant taxa in unfiltered data were abundant. The random forest variable importance ranks of the top 35 predictive taxa in the unfiltered data ranged between: (1) the 1st and 81st most abundant taxa out of the total 345 taxa in the alcoholic hepatitis; and (2) the 1st and 93rd most abundant taxa out of the total 409 taxa in the inflammatory bowel disease data set. Furthermore, in LEfSe and DESeq2 discrimination models, taxa that were found significant in filtered but not unfiltered data (or similarly in unfiltered but not filtered data) were overall more rare (present in small number of samples) and with low relative abundance. This is an important observation that may guide researchers' decision regarding how aggressive filtering should be.
A limitation of filtering is that the reduction of type I errors (probability of removing important taxa) will inevitably increase type II errors (probability of keeping unimportant taxa). Indeed, if we want to be cautious in removing rare taxa to ensure that important taxa will still remain in the data, we will not remove many taxa and will likely have a lot of unimportant taxa remained; if we remove taxa aggressively, there is a high chance of filtering important rare taxa. In particular, in studies that aim to discover rare taxa, filtering would not be advisable since it will likely remove the rare but important taxa. This issue can be moderated by having a good understanding of the data (where the data are sampled and how they are generated) and using auxiliary study information that allows us to filter with confidence. In particular for predictive modeling, for example using random forest approach in predicting alcoholic hepatitis, building a model with more abundant taxa may lead to higher reproducibility across studies as rare taxa may not be observed in another cohort sampled at different conditions.
We would like to stress that the goal of the current study is evaluation of filtering methods on commonly used microbiome analyses. As a part of this study, filtering was compared to a closely related contaminant removal method implemented in R package decontam using one of the datasets that was previously illustrated by the package developers [16]. It would be of interest to perform thorough comparison of these methods on other datasets used in this study, however this is outside of the scope of this paper.
In summary, the current study provides information on the effects of removing rare taxa on technical variability and scientific conclusions drawn from statistical analyses. We provide insights into the role of filtering in microbiome studies, and highlight the importance of derived feature data quality control prior to scientific analysis.

Competing interests
The authors declare that they have no competing interests.

Availability of data and materials
The datasets supporting the conclusions of this article are included within the article and its additional files. R version 3.  The horizontal dashed line represents the true Shannon index. (B) Beta diversity multidimensional scaling plots of the unfiltered, simultaneous and permutation PERFect filtered data colored by bioinformatics processing institutes. Data source: [13]. (C) Shannon index for the original data and two filtered data, colored by the dilution levels. (D) Multidimensional scaling plots of the unfiltered and filtered data at different dilution levels, colored by the processing institutes. Data source: [17]. Figure 3 ROC curves of random forest models. ROC curves of the random forest models from unfiltered and filtered data that are differentiated by colors. (A) ROC curves from the Alcoholic Hepatitis data [8]. (B) ROC curves from the IBD data [21]. changes for all significant taxa from LEfSe results from unfiltered and filtered data that are differentiated by colors. Taxa that are present in filtered data but are not significant are colored in dark blue. (B) Barchart of log(count+1) for significant taxa from DESeq2 results, colored by the disease states. Taxa that are present in filtered data but are not significant are colored in dark blue. (C) Barchart of log(count+1) for common significant taxa between random forest models, LEfSe and DESeq2 results on unfiltered data, colored by the disease states. From filtered data, while black taxa are common results with those from unfiltered data, blue taxa are non-significant in DESeq2 results and green taxa are not in the top 60 predictive taxa in the random forest model. Data source: [8]. for all significant taxa from LEfSe results from unfiltered and filtered data that are differentiated by colors. Taxa that are present in filtered data but are not significant are colored in dark blue. Taxa that are present in unfiltered data but are not significant are colored in dark red. (B) Barchart of log(count+1) for significant taxa from DESeq2 results, colored by the disease states. Taxa that are present in unfiltered data but are not significant are colored in dark red. Data source: [21].   Table 2 Pairwise comparisons of the Shannon index between laboratories using Dunn's test for each dataset. Data source: [13].