AB_SA: Tracing the source of bacterial strains based on accessory genes. Application to Salmonella Typhimurium environmental strains

The partitioning of pathogenic strains isolated in environmental or human cases to their original source is challenging. The pathogens usually colonize multiple animal hosts, including livestock, which contaminate food-producing and environment (e.g. soil and water), posing additional public health burden and major challenges in the identification of the source. Genomic data opens new opportunities for the development of statistical models aiming to infer the likely source of pathogen contamination. Here, we propose a computationally fast and efficient multinomial logistic regression (MLR) source attribution classifier to predict the animal source of bacterial isolates based on “source-enriched” loci extracted from the accessory-genome profiles of a pangenomic dataset. Depending on the accuracy of the model’s self-attribution step, the modeler selects the number of candidate accessory genes that better fit the model for calculating the likelihood of (source) category membership. The accessory genes-based source attribution (AB_SA) method was applied on a dataset of strains of Salmonella Typhimurium and its monophasic variants (S. 1,4,[5],12:i:-). The model was trained on 69 strains with known animal source categories (i.e., poultry, ruminant, and pig). The AB_SA method helped to identify eight genes as predictors among the 2,802 accessory genes. The self-attribution accuracy was 80%. The AB_SA model was then able to classify 25 over 29 S. Typhimurium and S. 1,4,[5],12:i:-isolates collected from the environment (considered as unknown source) into a specific category (i.e., animal source), with more than 85% of probability. The AB_SA method herein described provides a user-friendly and valuable tool to perform source attribution studies in few steps. AB_SA is written in R and freely available at https://github.com/lguillier/AB_SA. Author Notes All supporting data, code, and protocols have been provided within the article and through supplementary data files. Supplementary material is available with the online version of this article. Abbreviations AB_SA, accessory-based source attribution; MLR, multinomial logistic regression; SNPs, single nucleotide polymorphisms; GFF, general feature format; AIC, Akaike information criteria. Data Summary The AB_SA model is written in R, open-source and freely available Github under the GNU GPLv3 licence (https://github.com/lguillier/AB_SA). All sequencing reads used to generate the assemblies analyzed in this study have been deposited in the European Nucleotide Archive (ENA) (http://www.ebi.ac.uk/ena) under project number PRJEB16326. Genome metadata and ENA run accession ID for all the assemblies are reported in the supplementary material. Impact Statement This article describes AB_SA (“Accessory-Based Source Attribution method”), a novel approach for source attribution based on “source enriched” accessory genomics data and unsupervised multinomial logistic regression. We demonstrate that the AB_SA method enables the animal source prediction of large-scale datasets of bacterial populations through rapid and easy identification of source predictors from the non-core genomic regions. Herein, AB_SA correctly self-attribute the animal source of a set of S. Typhimurium and S. 1,4,[5],12:i:- isolates and further classifies the 84% of strains contaminating natural environments in the pig category (with high probability ranging between ∼85 and ∼99%).


57
Tracing the origin of pathogenic microbial strains associated with human diseases or contamination of 58 environmental settings is crucial for identifying targets for interventions in the food production chain from 59 farm-to-fork. The process of estimating the probability that human cases or environmental contamination 60 cases arise from putative sources of infection (i.e. animal reservoir, food product, and environmental) 61 can be referred to as source attribution. A variety of methodological approaches has been developed for 62 source attribution of foodborne pathogens: epidemiological approaches (e.g., outbreak data analysis, case-63 control/cohort studies), microbial subtyping methods, comparative exposure assessment, intervention 64 studies, and expert elicitations (1; 2). 65 Some of the source attribution methods based on microbial sub-typing specifically consider genotypic   trees from a matrix of proximities for each pair of strains, with these genetic proximities being typically 79 calculated using the methods proposed by (4) or (5). Once the tree is built, it can be 'cut' at a certain 80 point (e.g., after three levels of nodes from the root) to define the different clusters of strains (more or less 81 equivalent to sub-types). Visually exploring the composition of the clusters (i.e., isolates from different 82 backgrounds) provides a general overview for inferring sources and transmission. However, this approach 83 has been rarely applied in source attribution as inference by phylogeny relies upon the robustness of 84 the tree built on the genetic diversity between isolates, and strains to attribute and strains from sources 85 are usually phylogenetically intermixed (6). Indeed, closely related strains can be found in multiple 86 hosts challenging the association to a specific source by phylogenetic clustering (7). A particular case 87 showing the utility of phylogenetic methods in the attribution of human salmonellosis to specific sources 88 (e.g., turkey), by using epidemiological and genomic data, has been reported through the investigation 89 of Salmonella Derby genetic diversity (8).

90
A much different approach relies on the assumption that genetic data (e.g., frequency of different allele 91 numbers at a locus) can be explained by a probabilistic model whose parameters are unknown. Comparing 92 genetic data (frequencies) among different strain populations make it possible to establish a link between 93 them, e.g., between strains from human cases and different sources. Two structured population genetics 94 models that are currently widely used for source attribution of foodborne diseases are the so-called 95 STRUCTURE approach (9) and the Asymmetric Island Model (AIM) (10). These two models are based 96 on different principles of genetic structuring of microbial populations, but the overall attribution approach 97 is similar. These approaches have been successfully applied for source attribution of human sporadic 98 3 strains for Campylobacter spp. (11; 12), Salmonella spp. and Listeria monocytogenes (13).

99
Machine learning approaches are gaining interest in identifying the underlying genetic features associa-100 ted with traits of microbial pathogens (14) and their use is also discussed for tracing the origin of an 101 outbreak (15). For source attribution, recent studies consider this approach for predicting the source of genotypic variation (e.g., the differential composition of accessory genes of multi-host lineages) to specific 114 zoonotic niches. The objective of this study was therefore to study the performance of multinomial logi-115 stic regression in source attribution. In particular, the method was used for predicting to which animal 116 reservoir will environmental strains of Salmonella Typhimurium (S. Typhimurium) and its monophasic  High-quality assemblies of 98 bacterial isolates (seeÄpplication to Salmonella Typhimurium and its mo-123 nophasic variant genomes dataset"below) were generated by DTU FoodQCPipeline (https://bitbucket. 124 org/RolfKaas/foodqcpipeline). In short, the FoodQCPipeline trimmed the raw reads using bbduk2.

125
Reads were then de novo assembled using SPAdes v3.11.05 (24) in the last step of the pipeline. FastQC 126 v0.11.5 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was applied in multiple 127 steps of reads processing (e.g., before and after trimming), generating a quality control report for each 128 sample. The quality of the de novo assemblies was finally assessed using Quast v4.5 (25). Besides, the 129 maximum-likelihood phylogenetic reconstruction of the 98 genomes dataset was built on single-nucleotide 130 polymorphisms (SNPs) identified in the core genome alignments to assess the applicability of the dataset 131 in (26). The annotated tree shows that environmental isolates (i.e. unknown source) were intermixed 132 with potential sources and the dataset is therefore eligible for source attribution. The Accessory Based Source attribution (AB SA) method is based on genomics data. The method is a 136 two-step process. First, the accessory genes enriched in the different sources are calculated (see "Prepa-137 ration of input data for multinomial logistic model pangenome analysis" below). Then, a multinomial 138 logistic regression is developed to predict the probability of animal sources membership on environmental 139 isolates based on "source-enriched" accessory genes.

140
Multinomial regression is thus used to explain the relationship between one nominal dependent variable 141 (with more than two levels), that is the source, and one or more independent variables, i.e. the enriched 142 genes. For a source attribution situation with K sources, the multinomial regression model estimates k For the final source, the probability of association is derived from the K-1 other equations:   Feeding the MLR model with the source-enriched genes, a maximum number of genes to consider for 230 predicting the source is arbitrarily selected. In order to select the optimal set of predictors, different 231 numbers of genes (from 1 to 5) were tested, and for each case, accuracy and AIC were assessed (Table   232 1). For further performing an accurate animal source prediction, it is necessary to select the genes set that 236 better discriminates pigs, poultry and ruminant-related genomes. When testing the ability to classify 237 strains with known source by randomly selecting 70% and 30% of genomes for training and testing 238 respectively, accuracy ranged from 0.71 to 0.82 according to the different genes included in the model 239 (Table 1). Yet confidence intervals of the accuracies are large, and they could be considered as equivalent.

240
AIC values help to distinguish the best model among the tested ones. In this study, the model including 241 a total of eight genes as predictors (Table 1) was found to be the best model (with the lowest AIC value).

242
This set of "best" predictive genes is therefore used by the AB SA model to classify genomes with 243 unknown source. The relative importance of each predictor in estimating the model is calculated (Fig.   244 2). This statistical measure relates to the weight of each predictor in making a prediction, not whether 245 or not the prediction is accurate. Fig. 2 presents the values of fitted β k. parameters. It shows that some 246 genes have a higher weight than others. For example, group 6195 presence is strongly associated with 247 bovine. In the same way, group 852 represents the highest coefficients for poultry. sources. Six strains (i.e., 9, 12, 14, 25, 28, and 29) have a very high membership probability, that is, 252 superior to 99%, to one of the three sources. The majority of the strains (n=19) has a high probability, 253 ranging between 85-95%, of being associated to pig sources. For the four remaining strains (i.e., 5, 7, 254 11, and 24), the probability to be associated with a specific source is lower than 80% (e.g., ranging from 255˜3 9% to˜77%). The environment is not a natural reservoir of Salmonella Typhimurium. This study thus focused on at-263 tribution of 29 Salmonella strains isolated from the environment (i.e. river and brackish water, soil, and 264 crab) to potential sources using accessory based genes. The animal sources were grouped into three major  catalysis for ATP synthesis, transport and motility) (33) as well as DNA packaging and lysis (e.g., DNA 300 braking-rejoining protein, lysozyme, prophage tail assembly protein) ( Table 2). 301 Interestingly, the majority (n=7/8) of these predictors were located in mobile genetic elements regions