Method development for cross-study microbiome data mining: Challenges and opportunities

During the past decade, tremendous amount of microbiome sequencing data has been generated to study on the dynamic associations between microbial profiles and environments. How to precisely and efficiently decipher large-scale of microbiome data and furtherly take advantages from it has become one of the most essential bottlenecks for microbiome research at present. In this mini-review, we focus on the three key steps of analyzing cross-study microbiome datasets, including microbiome profiling, data integrating and data mining. By introducing the current bioinformatics approaches and discussing their limitations, we prospect the opportunities in development of computational methods for the three steps, and propose the promising solutions to multi-omics data analysis for comprehensive understanding and rapid investigation of microbiome from different angles, which could potentially promote the data-driven research by providing a broader view of the “microbiome data space”.


Introduction
Microbiome data provides a unique view to understand the micro-ecology and further investigate the interactions between microorganisms and their surrounding environment [1]. In recent years, a vast number of microbial community specimens have been sequenced to study on the microbial-associations to the natural environment dynamics [2,3], human health [4][5][6][7], agriculture [8,9], etc. Therefore, how to efficiently and comprehensively discover biological stories hidden under such a large-scale data has become one of the most essential bottlenecks for microbiome research at present [10,11]. Newly developed bioinformatics tools are bringing opportunities in deciphering the microbiome data, from general-purpose algorithms such as sequence alignment and machine learning (ML), to microbiome-specific approaches like operational taxonomy unit (OTU) picking [12] and phylogeny-based distance metrics [13,14]. On the other hand, challenges have also already been placed by the vast volume of microbiome data, especially in integration of datasets produced by multiple studies and platforms [15] [16] and status or disease classification and prediction by training on large-scale datasets [17,18].
Meta-analysis on cross-study datasets can generate constant and reproducible results as fundamental for further studies and applications [19][20][21]. Three analytical steps ( Fig. 1) are playing crucial roles in handling microbiome big-data: compositional profiling that decodes the microbiome taxonomical and functional profiles from sequences ( Fig. 1a), data integration that curates, normalizes and unifies existing datasets (Fig. 1b), and data mining that identifies and classifies the status of a given specimen by learned microbial features from integrated data (Fig. 1c). By reviewing the computational methods and tools development for microbiome profiling, integration and data mining respectively, in this mini-review we summarize the challenges and opportunities from such three aspects (Table 1 and Table 2), and propose more prospective solutions for comprehensive understanding and rapid investigation of microbiome from different angles by multi-omics data analysis.

Microbiome compositional profiling
DNA sequencing is the primary approach to survey the compositional features of microbial communities [22]. Generally, two sequencing strategies are widely used: amplicon sequencing that employs the marker genes (e.g. 16S rRNA, 18S rRNA or ITS) for taxonomy identification, and shotgun metagenomic whole-genome sequencing (WGS) that captures genome-wide sequences of all organisms in a sample.
For marker-gene-based analysis, several algorithms have been widely used for taxonomy assignment by sequence clustering and OTU picking algorithms like UPARSE [12] and Usearch [23] that based on sequence similarity. Amplicon sequence variants (ASVs) tools such as DADA2 [24], Deblur [25] and UNOISE3 [26] are further developed to improve the analytical precision of amplicon sequences on single-nucleotide level, which have higher reliability, reproducibility and comprehensiveness than regular OTUs [27]. Functional profiles could also be inferred from amplicons using the linkages between marker genes and reference genomes by PICRUSt [28,29], Tax4Fun [30] and other similar software. Most of these approaches have already been integrated into comprehensive pipelines such as QIIME [31,32], Mothur [33] or Parallel-META3 [34] with additional statistical processes for quantitative analysis on alpha and beta diversity of microbial communities. As a cost-efficient method, amplicon-based analysis has been adopted for large-scale microbiome surveys, however, the accuracy is also limited due to PCR bias [35], low-resolution of short-readbased markers and lack of marker-genome associations. For example, taxonomy annotation by targeting sub-regions of 16S rRNA short-reads is always on genus level [36,37], and function prediction is not accurate for environmental microbes that lack reference genomes [28].
Since WGS is more informative, some approaches utilize unassembled WGS short reads for species or strain level taxonomy annotation [38,39] (e.g. Karken [40], mOTUs [41], and MetaPhlAn2 [42]) and direct function parsing (e.g. HUMANn2 [43]), as well as binning-or assembling-based tools (e.g. metaSPAdes [44], meta-  Search-based approach Status-assumption-free and bio-marker-free Robustness to data heterogeneity and contamination Deep learning Hardware and system environment support for big-data training Optimization in multi-tag classification Well-implemented script-based packages IDBA [45] and MetaWRAP [46]) are capable for species genome reconstruction, de novo gene prediction and single nucleotide polymorphism (SNP) analysis. Nevertheless, WGS is also limited for a broad-range application by the 3-10 folds higher overall cost including sequencing, data storage and sharing, bioinformatics processing of reads quality control [47,48], taxonomical and functional [38,43] profiling than those of amplicons [28,34,49,50]. A new library preparation protocol of shallow shotgun sequencing obtains species-level taxonomic and functional profiles of microbiomes similar to that offered by regular deep sequencing, making the WGS in a more economical way [51]. Rather than targeting specific variable sub-regions of shortread-based amplification, full-length 16S rRNA gene sequencing by PacBio or Oxford Nanopore sequencing platforms has the potential for accurate classification of individual organisms from microbial communities at species or strain taxonomic resolution [52]. Meanwhile, since more and more full-length 16S rRNA gene sequences and full genomes have been released [53], mapping markers to unified references also enables the high-resolution comparison of microbiome profiles on a wide range. To couple with such advantages by long-read sequencing platform data, new denoising, sequence clustering and annotation algorithms and strategies should also be updated. Thus, the rapid development of microbiome profiling methods provides the basis to enable a broader view of the ''microbiome data universe".

Data repositories and integration
A huge number of microbiome datasets have been produced by studies such as Human Microbiome Project [54], Earth Microbiome Project [55] and American Gut Project [56]. Samples have been deposited in online repositories, e.g. NCBI-SRA [57], MG-RAST [58], EBI Metagenomics [59], JGI-IMG/M [60], MPD [61] and so on. Such massive data brings the ''materials" for research on the global-wide microbial diversity and distribution, while also makes new problems in data integration and reusage. In these repositories, most samples are organized by study and stored as raw or clean DNA sequences, and metadata among studies are not unified for feature selection and comparison, leading to the difficulty for seeking microbiomes under a targeted condition or with specific features.
To utilize and reuse valuable microbiome big-data for further meta-analysis and comparison, several works re-organized the microbiome samples with unified metadata format [62,63] and standard operating procedures (SOPs) [64] for sequence processing. GMrepo [65] is a database of well-organized and curated human gut metagenomes with constantly annotated metadata. GcMeta [66] features a data management system that integrated with data analysis tools and workflows for archiving and publishing data in a standardized way. In addition, Qiita [67,68] allows users to perform meta-analysis across multiple studies, and retrieve microbiomes that contain a specific feature (e.g. metadata, taxon terms, and sequence fragments) by SQL-like queries.
Nevertheless, when new microbiomes are sequenced, it is still difficult to find what existing microbiomes in the repositories or databases have overall similar composition to them, thus answer further questions like prediction of environmental conditions or human health status. To tackle this case, a Microbiome Search Engine (MSE) [69] has been developed for rapid ''community to communities" comparisons and matches. By a dynamic indexing strategy and a series of whole-microbiome-level similarity scoring function [70,71], MSE enables the real-time-level accessibility of targeted microbiomes with specific structure from massive volume of data.
Another important barrier for integrating the cross-study microbiome datasets is the technical variation of amplicon sequencing data from multiple sources and batches. Technical factors can significantly affect the comparison among datasets including DNA extraction, PCR primers for marker genes, sub-regions of the marker gene amplification, sequencing platforms and types of sequence reads [72]. For biological studies with large effect size like comparing environmental microbiomes from multiple habitat types, human microbiomes from different body sites and from hosts with different ages, locations and diets, the technical differences can be outweighed by referenced-based taxonomy assignment of 16S rRNA (e.g. mapping short-reads to full-length 16S rRNA genes) [73,74], making the cross-study integration to be meaningful. However, studies of more subtle effects still require unified experimental protocols for producing amplicon datasets. In contrast, shotgun WGS has been tested as less sensitive to technical differences in studying the disease association and temporal dynamics of microbiome [19,75], which is an alternative option for integration and comparison of cross-study datasets.

Data mining for status identification and classification
Since microbial communities shape the dynamics of ecological systems, ranging from the human gut to the marine, one potential of microbiome is linking variation of microbial composition to phenotypic and physiological statuses, which can inspire the development of new techniques for disease diagnosis, ecological dysbiosis detection and treatment evaluation. Previous studies have demonstrated the feasibility of ML methods [18,76] in disease detection and classification with human-associated microbiome data for inflammatory bowel disease (IBD) [77], colorectal cancer (CRC) [19], caries [78], etc., by extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), k-nearest neighbor (KNN) and other ML algorithms. As a quantitative approach, the MLbased indices are also designed to assess the risks for potential diseases and to evaluate the effects among different treatments [79,80].
Typically, microbiome-based detection has to make a priori assumption about a specific status (e.g. a disease) for given samples, and seek organismal or functional features (e.g. taxon or gene) that unevenly distributed between disease and control samples as bio-markers. Then ML models are trained and constructed using these bio-markers for disease recognition. Since the detection range is restricted to the given status types in such models, it is difficult to broadly decide whether the sample is healthy or not. Furthermore, extending a particular model of a disease to other cohorts can be challenging due to the heterogeneity of microbiome data among population [81]. In addition, the same bio-markers can be associated with multiple different diseases, which may also result in errors in multiple disease classification [82].
A search-based strategy for disease detection and classification has been developed, which detects abnormal samples via their outlier search-based novelty against large number of samples from healthy subjects, and then identifies the specific disease type by top-hits that searched in samples from patients [83]. This wholemicrobiome-level search and match strategy enables the identification of microbiome states associated with disease even in the presence of different cohorts, multiple sequencing platforms or significant contamination, while currently the software is only implemented for amplicon sequences processed by referenced OTU picking.
Nowadays, application of deep learning such as deep neural network (DNN) or convolutional neural network (CNN) has been shifted from computer vision problems to microbial biological field [17]. By parallel-computing-based hardware-level boost of multi-core CPU and many-core GPU, deep learning approach shows its advantages in big data integration and robustness to data heterogeneous [84], while the particular parameters in model construction still need to be optimized for solving different questions. At the same time, TensorFlow (https://www.tensorflow.org/) and PyTorch (https://pytorch.org/) packages provide the easy implementation of artificial intelligence (AI) techniques by Python, driving the applications of deep learning for microbial analysis in taxonomy identification [85], biomarker selection [86], multiple disease detection and classification [87]. Another potential of deep learning in microbiome research is the ability of multi-label classification that has been widely used in image processing [88]. It is common that a single microbiome specimen could be associated with more than one disease, and such samples have been collected by several studies [56,89]. Since the current studies on microbiome and disease mainly focus on single-label classification that each individual sample is only with one specific status, such situation could be solved by further extension of AI techniques in microbiome field.

Outlook of multi-omics data analysis
Studying on ''what organisms exist in a microbial community" and ''what a microbial community can do" is no longer adequate to fully understand the interactions between microbiome and environment. Although the profiling of DNA sequencing surveys the functional genes in a microbial community, the functional activities and gene expressions of cells and the metabolite products that reflect the biosynthetic features are still unclear. Multi-omics data analysis of microbiome [90] utilizes chemical and biological approaches to provide a comprehensive view on ''what a microbial community is doing", which investigates a microbiome community from further dimensions of metatranscriptomics [91], metaproteomics [92], metabolomics [93] and viromics [94]. Some of the previous works have demonstrated the in-depth and unique insights of multi-omics data in understanding human microbiome [95,96]. Nevertheless, the data types and computational tools are mostly omics-specific, e.g. software for metagenomic sequencing is not compatible with RNA-seq data of metatranscriptomics and mass spectrum data of metabolomics, making the combination of the multiple tools to be case-specific, inextensible and irreproducible. Recently, a workflow named IMP (Integrated Meta-omic Pipeline) was released to perform automatic, standardized and flexible analysis to incorporate metagenomic and metatranscriptomic data [97]. This open-development framework strategy enhances the integration of different type data analysis and the interpretation of results from multiple aspects, as well as promotes the general paradigm of microbiome multi-omics research.
Sequencing-based analysis is not routinely used in clinical or industrial applications mainly due to the data generation by sequencers usually takes at least 2 days [98]. At present, fluorescence-activated cell sorting (FACS) approaches have been developed for rapid functional cell-sorting, which is based on the labeling of cells for target proteins, metabolites, or nucleic acids [99]. A new series of label-free, single-cell-level imaging tools using Raman-activated cell sorting (RACS) are also proposed for the taxonomy or status identification of individual cells in a microbial community [100,101]. Because it is an imaging approach, obtaining the Raman spectrum can be non-destructive to the cell and does not require external labeling or preexisting biomarkers. More importantly, since FACS or RACS only costs seconds to profile each cell, such techniques can be considered as single-cellresolution approaches that monitor microbiome with high throughput and low time cost.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.