Employing toxin-antitoxin genome markers for identification of Bifidobacterium and Lactobacillus strains in human metagenomes

Recent research has indicated that in addition to the unique genotype each individual may also have a unique microbiota composition. Difference in microbiota composition may emerge from both its species and strain constituents. It is important to know the precise composition especially for the gut microbiota (GM), since it can contribute to the health assessment, personalized treatment, and disease prevention for individuals and groups (cohorts). The existing methods for species and strain composition in microbiota are not always precise and usually not so easy to use. Probiotic bacteria of the genus Bifidobacterium and Lactobacillus make an essential component of human GM. Previously we have shown that in certain Bifidobacterium and Lactobacillus species the RelBE and MazEF superfamily of toxin-antitoxin (TA) systems may be used as functional biomarkers to differentiate these groups of bacteria at the species and strain levels. We have composed a database of TA genes of these superfamily specific for all lactobacilli and bifidobacteria species with complete genome sequence and confirmed that in all Lactobacillus and Bifidobacterium species TA gene composition is species and strain specific. To analyze composition of species and strains of two bacteria genera, Bifidobacterium and Lactobacillus, in human GM we developed TAGMA (toxin antitoxin genes for metagenomes analyses) software based on polymorphism in TA genes. TAGMA was tested on gut metagenomic samples. The results of our analysis have shown that TAGMA can be used to characterize species and strains of Lactobacillus and Bifidobacterium in metagenomes.

78 disease development. The effect substantially depended on the particular bacterial strain being 79 administered, including Bifidobacterium and Lactobacillus strains [23,24]. 80 We suppose that TASs can provide additional functional markers for metagenomic analysis 81 of species and strain diversity of the genera Lactobacillus and Bifidobacterium [25]. To this end, 82 we created a database of MazEF and RelBE chromosomal toxin and antitoxin genes in all complete 83 genomes of Bifidobacterium and Lactobacillus genus and TAGMA software which conducts 84 species and strain identification. We tested TAGMA to identify species and strains of lactobacilli 85 and bifidobacteria in 147 metagenome samples from the Human Microbiome Project (available in 86 the Human Microbiome Project database (https://www.hmpdacc.org/HMASM/, subtab "stool")) 87 as well as in 5 in-house samples (see Methods: In-house metagenome characterization). The results 88 were compared with those obtained with other programs, PhymmBL and MetaPhlAn. Based on a 89 limited number of well selected markers TAGMA displays performance at least comparable to 90 that displayed by those programs, may be somewhat underperforming PhymmBL. TAGMA can 91 identify species of lactobacilli and bifidobacteria displaying prediction quality comparable to that 92 of PhymmBL and MetaPhlAn but it is based on the small number of carefully selected markers, 93 which thus can be obtained with very deep targeted sequencing, and also it works much faster than 94 PhymmBL or MetaPhLan. In some cases TAGMA also can identify individual bacterial strains, 95 the option, which is not implemented in MetaPhlAn or PhymmBL. Since some marker position can be not covered by reads, while some other could appear 106 due to sequencing errors, the discernibility matrix G(g x g) is constructed in the fourth stage. Here 107 g is the set of detected genes or gene variants, G(1, 2) = 1 means that gene 1 cannot be distinguished 108 from gene 2 (0 otherwise). Due to fragmental read coverage some genes that are theoretically 109 distinguishable by the complete marker set become indistinguishable with the observed marker 110 set. The TAGMA reports such cases and outputs the smallest possible group of genomes, that can 111 still be distinguished with the observed set of markers ( Fig. 1).

112
In the fifth step St(s x s), the strain discernibility matrix is build. Here s is the number of 113 strains that have at least one detected genetic marker. For instance, St(3,4) = 1 means that strain 114 number three is not distinguishable from strain number four. This matrix is not symmetric. One 115 cannot distinguish strain 1 from strain 2 if the coverage of all detected genes in strain 1 is not 116 enough to distinguish variants of these genes from those in strain 2 (or if these genes are completely 117 identical). But there are cases when strain 2 can be distinguished from strain 1 if strain 2 contains 118 another set of detected genes or at least one gene variant that has a characteristic marker position. 120 metagenomic samples are not fully covered with reads and information can be missing due to 121 sequencing depth deficiency.

122
In the sixth step, sets of indiscernible strains are derived from the strain discernibility 123 matrix. The number of strains that cannot be distinguished from each other and from the target 124 strain is used as the measure of performance. This measure is lower for better detected strains.

125
TAGMA can be used for identification of Lactobacillus and Bifidobacterium species and 126 individual strains in metagenomes. In-house metagenome characterization 166 We used the gut metagenomes (feaces), isolated from people living in the Central region  We performed similar analysis for strains. Distribution of toxins and antitoxin genes is 232 markedly strain specific (see Fig. S1). Strains belonging to the same species of bacteria have 233 similar but not identical sets of T and A genes. Consequently, these genes can be used to identify   Table 2 illustrates that the TAGMA gives more information than MetaPhlan2 but less than 288 PhymmBL. But the advantage of TAGMA is that it can analyze metagenome up to strain level 289 (Table S1, S2) if the metagenome contains a specific set of TA genes or at least to the level of a 290 group of strains if these strains contain identical T and A genes (see Table S1, S2).  (Table S3). Table S3 shows the strains (or groups of strains) 301 of Lactobacillus and Bifidobacterium that satisfied the following conditions: the TA genus 302 coverage is more than 60% and the number of markers detected in a strain more than two (or 50%).

314
In this report, we show that genes of type II TASs can be used as functional markers for 315 computer assisted species and strains characterization of lactobacilli and bifidobacterial in human 316 gut microbiota. The database of toxins and antitoxins genes for these two genera of bacteria has 317 been created and it has been shown that distribution of TAS is species-and strain specific.  Manuscript to be reviewed   Figure 1 The algorithm of the TAGMA software The algorithm of the TAGMA software