Classification of Neisseria meningitidis genomes with a bag-of-words approach and machine learning

Summary Whole genome sequencing of bacteria is important to enable strain classification. Using entire genomes as an input to machine learning (ML) models would allow rapid classification of strains while using information from multiple genetic elements. We developed a “bag-of-words” approach to encode, using SentencePiece or k-mer tokenization, entire bacterial genomes and analyze these with ML. Initial model selection identified SentencePiece with 8,000 and 32,000 words as the best approach for genome tokenization. We then classified in Neisseria meningitidis genomes the capsule B group genotype with 99.6% accuracy and the multifactor invasive phenotype with 90.2% accuracy, in an independent test set. Subsequently, in silico knockouts of 2,808 genes confirmed that the ML model predictions aligned with our current understanding of the underlying biology. To our knowledge, this is the first ML method using entire bacterial genomes to classify strains and identify genes considered relevant by the classifier.

Classification of Neisseria meningitidis genomes with a bag-of-words approach and machine learning

Identification of genes driving classification through an in-silico knock-out approach
Strain identification and genome classification based on entire genomes is not yet possible due to the size of genomes, which for bacteria are typically in the order of magnitude of millions of base pairs.

INTRODUCTION
Whole genome sequencing of bacteria has many applications, including strain characterization, population genomics, monitoring antibiotic resistance, and outbreak investigation.The number of bacterial genomes stored in public archives (>1.2 million 1,2 on March 9, 2023) is increasing exponentially due to decreasing costs and the higher resolution of new technologies. 1,3,4However, despite the tremendous number of bacterial genomes available, it is not yet possible to use entire genomes for classification.][7][8][9][10] In the future, we anticipate that ML will also provide statistical approaches for the fine-mapping of data from genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) studies, 11 predict the impact of non-coding variants on disease, 12,13 and predict the regulatory activity of sequences based on cross-species data. 14The greatest challenge related to human and murine genomes, as well as other mammalian genomes, is the length of the sequence provided as the input to the classifier: 3.055 and 2.5 billion base pairs (bp), respectively, for humans and mice. 15,16he smaller size of bacterial genomes, from 0.6 to 8 million bp (Mbp), 17 makes them more manageable than mammalian ones; for example, the Neisseria meningitidis genome contains about 2.3 Mbp. 18,19Recent methods have been proposed to input entire prokaryotic genomes into deep neural networks.However, deep learning methods cannot handle extremely long sequences such as bacterial genomes.At the same time, the number of genomic sequences available is rapidly increasing, albeit not at the same pace as observed for human genomes, to the point that we can now use them for classification purposes.
In addition to difficulties due to the genome size, genome assemblies obtained using short-read sequencing techniques are difficult for ML models to handle because the genome is divided into contigs of which the direction is unknown and which can be incomplete or redundant.This variability makes it even more challenging to use sequential approaches unless multiple alignment is applied; however, multiple alignment is computationally too demanding as the number of genomes in public repositories increases.Furthermore, bacterial genomes exhibit great variability, including accessory genes not shared by the whole population, 20 making it difficult to understand which subsequences should be compared among different genomes.
Comparative genomic methods have been developed to tackle these challenges and enable the use of entire genomes for classification purposes.One of these is the phyletic patterns approach that generates a binary presentation (a series of 1s and 0s) of the presence or ll OPEN ACCESS absence of genes in different genomes and then compares the binary presentations. 21The phyletic patterns approach is useful for analyzing prokaryote evolution 22 or identifying alien genes in a genome. 23Another comparative genomic method is the fuzzy profile method that generates profiles based on sets of genes present in a specific genome and compares these sets between genomes. 24Both methods have a low resolution that does not consider subtle changes in genes that can greatly affect gene function.
Many comparative genomic methods have been developed that do not focus on the presence or absence of genes or sets of genes, but instead use entire genomic sequences for alignment-free comparisons.These are mainly word-based methods, based on the frequencies of subsequences of a defined length, and information theory-based methods, based on the informational content of full-length sequences. 25An example of a word-based method is the comparison of prokaryotic genomes using k-mers to reconstruct phenetic trees. 26An example of an information theory-based method is the comparison of relative information between sequences using Lempel-Ziv complexity for building phylogenetic trees. 27,28None of the available alignment-free tools use whole genomes to predict phenotypes.
We developed a comparative genomic method based on the hypothesis that the frequency of genomic subsequences within a genome (and within a collection of genomes) could be used to predict a phenotype of interest.Inspired by SentencePiece (SP), which was developed to identify words in non-segmented languages like Chinese, Korean, and Japanese, we extracted the frequency information.Hereto, we applied tokenization (replacing subsequences with token words) and a ''bag-of-words'' approach (counting the frequencies of the token words) to encode bacterial genomes in their entirety and use them as inputs to standard ML methods for classification.The bag-of-words approach does not require position information, thus allowing for handling whole genomes without the need for alignment or assembly.In contrast to existing approaches, this approach would allow for rapid classification while using all information provided by multiple, possibly distant, genetic elements as well as from intergenic regions, and not being hampered by variability in gene presence or short sequence reads.
Two text tokenizers were chosen to create the vocabularies: SP and a k-mer representation.SP is an unsupervised text tokenizer mainly used in text generation systems based on neural networks where the vocabulary size is predetermined. 29SP allowed end-to-end English-Japanese translation, with performances similar to ML algorithms, thus overcoming the challenges of non-segmented languages.Similarly, we applied SP to genomic sequences as uninterrupted texts.5][36] After creating the vocabularies with either SP or k-mer tokenization, the bagof-words method counts the occurrence of each word (the bag-of-words) in the genome, regardless of relative position, direction, or order.
Our main objective was to determine whether the bag-of-words approach is suitable for classifying phenotypes of N. meningitidis genomes.For this, two classification tasks were set.The first was to characterize a purely genetic problem, the capsule type classification, and the second was the multifactor problem of invasive phenotype classification.The main difference between our approach and sequential approaches is that we did not consider positional information due to the unknown direction of the contigs; this is not feasible for bacterial genomes, as it would require a computationally expensive genome alignment.The secondary objective was to determine whether our ML model is really learning, by studying genes predicted by the model to be important for invasiveness and determining whether these genes are known virulence factor (VF) genes or could indeed be new VF genes based on the genes' known biological functions.

Model selection and evaluation
Byte pair encoding was used to create token vocabularies from a corpus of sequences.Hereto, an initial vocabulary was created where tokens were single characters, the most frequent pair of adjacent tokens was then merged into a new token, which was added to the vocabulary.All the occurrences of the two merged tokens were then replaced by the new token.This procedure was iterated to expand the vocabulary until a desired size was reached.For both capsule and disease classification tasks, experiments were performed with six variants of SP and six variants of k-mer approaches to represent an N. meningitidis genome as a bag-of-words.For each variant, 50 hyperparameter combinations were trained on the training set, and their performances were validated on the validation set.Both approaches exhibited high accuracy; increased accuracy was observed with increasing vocabulary size.
For the capsule task, the SP approach with 8k, 16k, 32k, and 64k words all returned the highest possible validation accuracy (100%), while the k-mers approach only had a high validation accuracy when using an order of magnitude more features (vocabulary size 87,360) compared with the 8k SP model (Figure 1A; Table S1).Therefore, we decided to use the SP approach with an 8k vocabulary size for independent evaluation and subsequent experiments.When using an independent test set, the 8k SP model scored 99.6% accuracy (Table 1).
Similarly, in the disease task, the SP approach with 32k words had the highest validation accuracy (90.2%), while all k-mer models performed poorly in comparison (Figure 1B; Table S1).Therefore, the SP approach with a 32k vocabulary size was selected for independent evaluation and subsequent experiments.When using an independent test set, this approach scored 90.2% accuracy (Table 1).

Capsule locus removal resulted in a capsule group prediction change
In N. meningitidis, all capsule genes are located in the capsule locus. 37In silico knockout of the entire capsule locus (mean number of bases removed was 25,663 bp +/À 697 bp) was performed on 32 genomes of strains for which initial classification predicted a type B capsule with a high level of confidence (R0.99).Following this knockout, ML classification, that was optimized and trained by lightGBM, was applied to predict phenotypes.ML classification with the SP 8k approach found that most strains (n = 30/32, 93.8%) were now predicted to have a non-B capsule type (Figure 2A).In these 32 strains, the mean probability of having a type B capsule went from 1.000 (+/À 2.16E-6) before to 0.088 (+/À 0.245) after the knockout.
As a control, the same number of bp as the capsule locus were removed from the 32 genomes but from randomly chosen contigs.Following this removal, ML classification found that none (n = 0/32, 0.0%) of these strains were predicted to have a non-B capsule type (Figure 2A).In these 32 strains, the mean probability of having a type B capsule went from 1.000 (+/À 2.16E-6) before to 1.000 (+/À 2.34E-6) after the knockout.
Comparing the probabilities that the model assigned to the two groups (capsule knockout vs. control knockout) showed that the mean probability was significantly lower in the capsule knockout group than in the control group (p value 4.66E-10, Wilcoxon paired t-test).

Removal of VFs resulted in a virulence prediction change in some strains
Similar to the capsule locus knockout, in-silico knockout of the entire set of 155 known VFs (mean removal of 152,203 bp +/À 9,981 bp) was performed on the 436 genomes of strains for which initial classification had predicted invasiveness with a high level of confidence (R0.99).Following this knockout, ML classification using the SP 32k approach found that the majority (n = 230/436, 52.8%) of these strains were now predicted to have a carrier phenotype (with a classification threshold of 0.5).The mean probability of invasiveness assigned by the model to the 436 strains went from 0.999 G 0.001 before to 0.468 G 0.416 after the knockout.
As a control, the same number of bp was removed from each of the 436 genomes but from randomly selected contigs.Following this removal, ML classification found that a much smaller proportion (n = 13/436, 3.0%) of these strains were now predicted to have a carrier phenotype (Figure 2B).Note that because the control bases were removed randomly, some relevant genome portions may have been removed by chance; indeed, on average, 6% of the bases removed affected a known VF (data not shown).The mean probability of invasiveness assigned by the model to the 436 strains went from 0.999 G 0.001 before to 0.969 G 0.129 after the knockout, indicating a more limited removal effect (Figure 2B).
Comparing the probabilities between the knockout and control groups showed that the mean probability of invasiveness assigned by the model to the knockout group was significantly smaller than that of the control group (p value 7.12e-72, Wilcoxon paired t-test).

Knockout of individual genes identified unstudied VFs
Next, to verify learning of the ML model, all 2,808 N meningitidis genes were individually knocked out in all 436 genomes for which invasiveness was predicted with high confidence (R0.99).For 306 (the relevant genes) of these 2,808 genes, the prediction delta was above the 95 th percentile of the prediction deltas in the same genome, and this was true for at least 50 genomes.Of the 306 relevant genes identified, 44  were part of the 155 genes in the known VF list, a significant overlap (p value 4.15e-10, Fisher's exact test).Some of the 306 relevant genes were not annotated in PubMLST; 2 therefore, we restricted the number of relevant genes to 291 in the subsequent analysis (Table S2).
Database for Annotation, Visualization, and Integrated Discovery (DAVID) 38 analysis for functional annotation and enrichment of these 291 relevant genes revealed that genes in four annotation clusters were present significantly more often than would be expected at random: cluster 1 contained genes involved in adenosine triphosphate (ATP) and nucleotide binding (enriched by a factor of 4.52), cluster 2 contained C-terminal helicases (enriched by a factor of 2.55), cluster 3 contained P loop containing genes involved in nucleoside triphosphate hydrolase (enriched by a factor of 2.24), and cluster 4 contained genes involved in the ligation of amino acids for protein biosynthesis (enriched by a factor of 1.99) (Table S3).The top-ten genes with the highest mean delta ranking from these in silico knockout experiments are presented in Table 2; four (siaA, siaB, siaC, and lbpA) were already known VF genes. 39The other six are potential VF genes.
The first is secA, which in N. meningitidis is the first gene of an operon with three promoters (type I, II, and III). 41The translocase that secA encodes is part of the general secretory pathway delivering VFs to the extracellular environment for interaction with the host, 40 and is thereby already known to be indirectly involved in virulence.The second, with locus ID NEIS0044, encodes an oligopeptide transporter.Little is known about this gene except that it flanks the capsule locus, 48 which encodes the major N. meningitidis VF.The third, ygfZ, encodes a transfer ribonucleic acid (tRNA)-modifying protein that plays a role in iron-sulfur metabolism 42 and contributes to the secretion of cytotoxic necrotizing factor 1, a VF, in outer-membrane vesicles (OMVs) of Escherichia coli. 43Whether it plays the same role in Neisseria has not been investigated.The fourth, tamB, encodes the inner membrane-anchored, periplasmic protein TamB, which is part of the translocation and assembly module (TAM) complex.The precise role of the TAM complex remains unclear, but it has been suggested that it has a role similar to the b-barrel assembly machinery (BAM) complex, 44 which plays a role in the translocation of autotransporters across membranes. 49The fifth, with locus ID NEIS0418, encodes a putative cytochrome c-type biogenesis protein.Cytochrome c assembly and maturation are critical for the virulence of Bacillus anthracis 46 and were found in Neisseria to be involved in biofilm formation, a growth form that protects bacteria during infection. 50he sixth, fixN, which encodes the cbb3-type cytochrome c oxidase subunit I, is involved in aerobic energy generation and is overexpressed in the N. meningitidis strain MC58 under invasive conditions. 47Based on the known biological functions of the six potential VF genes highlighted here, their role as VF genes is highly plausible.
The bag-of-words classification and knockout approach for knowledge extraction are more informative than the state-ofthe-art methodology in our datasets We constructed phylogenetic trees based on gene presence or absence for both problems to compare our approach with the current stateof-the-art methodology.The phylogenetic trees are presented in Figure S1.The distance on the trees represents the similarity of genomes in terms of variety of gene sets.We then colored the genomes for their classification label to check whether they were clustering on the tree.
For the capsule classification, we observed defined groups of B and non-B genomes next to each other on the tree.This is expected, as from the literature, 51 we know that some genetic backgrounds (sequence types and clonal complexes) are associated with a capsule type, even if the capsule locus does not reflect the evolution of N. meningitidis, as it is subject to recombination events.We also observed a few mixed clusters, that may correspond to a genetic background shared between B meningococci and other capsule groups (e.g., cc-11 (A) 32 genomes of strains for which initial classification predicted a type B capsule with high confidence (original prediction, in yellow) were selected.In silico knockout of the entire capsule locus (capsule knockout, in magenta) changed the prediction of the type B capsule to a non-type B capsule for 30 of the 32 strains (93.8%).In silico knockout of a random contig of the same size (control knockout, in blue) did not change the prediction of the type B capsule for any of the strains (0.0%).(B) 436 genomes of strains for which initial classification predicted an invasive phenotype with high confidence (original prediction, in yellow) were selected.In silico knockout of 155 virulence factor (VF) genes (VF knockout, in magenta) changed the prediction of the invasive phenotype to a carrier phenotype for 230 of the 436 strains (52.8%).In silico knockout of random sequences of the same size (control knockout, in blue) changed the prediction of the invasive phenotype to a carrier phenotype for 13 of the 436 strains (3.0%).Each dot represents one strain.The horizontal dashed line represents the prediction threshold of 0.5. is shared between MenB, MenC and MenW). 52Rarely, we observed points of different color into homogeneous clusters.These may be cases of capsule switching where the capsule operon recombined. 53or the disease classification, a more heterogeneous situation was observed, as the two labels were present in the majority of clusters.Also in this case, the genetic background has been associated with carrier and invasive states in the literature, 54 identifying clonal complexes more prone to carrier or invasiveness.However, usually, an invasive state was preceded by a carrier state in the nasopharynx, thus, the changes that allow the escape in the blood would not dramatically change the genome, but more likely change a few specific loci.
We then performed random forest classification 31 on the table reporting the presence or absence of genes in each genome for the carriage/invasive problem, where the number of genomes was suitable for gene presence or absence retrieval from PubMLST. 2 The accuracy obtained with the test set for the classification was 84.2%.Our model is, therefore, outperforming the classification based on genes' presence or absence.
Word analysis with Shapley Additive exPlanations (SHAP) or feature importance can sometimes help to derive insights in classification.To compare with the knockout approach, we also performed SHAP analysis.We plotted the distribution of SHAP values for the most important words in the dictionary, for the capsule task (Figure S2A) and the disease task (Figure S2B).As can be seen from the plot, some words influence the classification process.An example is the word TCAACTA: a high value associated with this word pushes the model output to classify the genotype of the Capsule B group, while low values seem linked to other genotype groups.The most important words from the SHAP analysis did not overlap with any transcription factor binding sites reported in public repositories.The reason for this is probably that the words might be several bases longer or shorter than transcription factor binding sites and thus are not recognized in the search.In contrast, the knockout experiments were designed explicitly to remove multiple words simultaneously, related to regions of interest in the genome.Therefore, we conclude that for this particular task of biological knowledge extraction, the knockout analysis is more informative than the word importance analysis, due to the multivariate role of words in the classification models, that cannot be observed in univariate analysis, i.e., when inspecting single words.
To compare genome similarity within and between classes, principal component analysis (PCA) was performed, and the first two principal components were visualized in scatterplots.In the case of the capsule task (Figure S3A), the genomes belonging to different classes were slightly misaligned, which indicated that the genotypes were already divided into the two phenotype classes to a certain extent.In contrast, in the case of the disease task (Figure S3B), the two classes almost completely overlapped, indicating that the classification was more difficult.Nonetheless, the ML model was able to pick up relevant patterns in both cases as the knockout experiments showed that, in the capsule case, the patterns corresponded to words belonging to the capsule locus; in the disease case, the knockout experiments showed that the patterns picked correlated with the presence of the virulence factors.

DISCUSSION
We have shown for the first time how an ML model can be used to classify bacterial phenotypes based on their entire genomic sequences.The ML model we developed consists of tokenizing genomes with SP or k-mers, encoding them as bag-of-words histograms, and then analyzing the entire encoded genomes with a downstream ML model for classification.We tested the model by classifying N. meningitidis genomes for capsule group B assignment and disease phenotype and obtained very high classification accuracies.Subsequently, we verified whether the ML model was indeed learning by determining whether predicted potential VF genes had a known biological function that meant a role as a VF gene was plausible.
The mean length of the N. meningitidis genomic sequences was, in accordance with the literature, 18 around 2.2 Mbp, i.e., too large to be used as input for an ML model.Inspired by the approach proposed for promoter region and chromatin-profile prediction in human genomic sequences, 6 we employed SP's Byte Pair Encoding algorithm to generate a vocabulary that segmented entire N. meningitidis genomes into words.Using this ML model, we were able to classify N. meningitidis strains for either capsule B or disease phenotype with very high accuracy (99.6% and 90.2%, respectively).Of the two tokenization approaches applied, the SP approach performed better than the k-mer approach.One possible explanation for this result is that SP vocabularies can encode longer words than k-mer.For example, the 32k SP vocabulary had a mean word length of 17 (mode 8), and about 66% of words (21,121) were longer than eight bases.The k-mer vocabulary, on the other hand, had limited word length as the vocabulary size grew exponentially with k-mer length, so that greater lengths were computationally unfeasible.In addition, the k-mer approach uses all possible words, so it will also include words that have no impact on the classification.While both SP and k-mer approaches resulted in a very high performance, SP obtained better results using only one-tenth (capsule task) and one-third (disease task) of the vocabulary size needed by the best k-mer approach, hence requiring less computational resources.Therefore, we used SP for subsequent analyses.
The knockout experiments showed that removing the capsule locus almost always resulted in a change in the capsule prediction from B to non-B; this was statistically significant compared with removing a random region of the same length.This aligns with biological intuition, because the non-B group comprises various capsule types, while the B signal is very specific.Similarly, the knockout experiments that removed 155 VFs resulted in most cases in a change in the disease prediction, although the effect was not as clear as for the capsule task.Moreover, the controls also sometimes showed a change in prediction.So, it appeared that the random knockout of such large numbers of bases did not always target only genetic regions that were not relevant to the invasive phenotype.Another explanation might be that some of the strains were misclassified or that their classification was based on genome regions not meaningful for the specific problem.Similarly, even though the change in prediction was significantly more frequent when the known VFs were removed, it was not always observed, suggesting that additional VFs might be active in those genomes, or that loci cooperate.These observations probably occurred because the list of VFs known today, although based on state-of-the-art biological knowledge, 39 is not comprehensive, and factors other than VFs may also affect invasiveness.
The ML method resulted in very high performance for both a classification that was strictly genetic (capsule group B assignment) 39 and for a classification that was partly genetic and partly influenced by a host's immunological response (disease phenotype). 55This suggests that our approach enabled joint consideration of genetic information as well as the underlying regulation and interplay.Experimental evidence will need to confirm that the underlying regulation and interplay were indeed considered.
In addition to applying the ML model for the classification of genomes, we verified that the ML model was indeed learning by using the model to identify potential new VFs and determining whether their role as a VF was plausible.Upon knocking out 2,808 genes, of the 306 relevant genes ranking highest for disease phenotype prediction, 44 were known VF genes.For 291 of the relevant genes, PubMLST data were available; these data indicated enrichment in four clusters.The first cluster contained genes involved in ATP or nucleotide binding, essential for nutrient acquisition and for secretion of toxins and antimicrobial agents and hence essential for invasion of a host. 56,57The other clusters contained genes involved in C-terminal helicases, P-loop-containing nucleoside triphosphate hydrolases, and ligation of amino acids for protein biosynthesis.While protein biosynthesis is a general pathway that is not specific for virulence, activation of this pathway is needed for adaptation to a new environment, such as during invasion.Indeed, genes involved in the biosynthesis of amino acids and proteins were found to be upregulated during the growth of N. meningitidis in the blood. 58Experimental evidence is needed to verify that genes in these last clusters also affect virulence.
The ML method we have developed may be a useful tool for the identification of genes that potentially contribute to a specific phenotype, in this case virulence.Other methods have been used previously to systematically identify N. meningitidis genes.One is the combination of open reading frame prediction and whole genome homology searches that became feasible once whole genomes could be sequenced.0][61][62] Those methods did not, however, allow for the systematic identification of genes involved in a specific process or phenotype.A more recent method is the analysis of entire transcriptomes under various experimental conditions, which enables the identification of genes involved in specific processes. 63An older method that has been updated is signature-tagged transposon mutagenesis screening, which can identify genes involved in a specific phenotype. 64The advantage of our ML method is that it identifies genes that are potentially involved in a specific phenotype without requiring laboratory experiments, thus providing an efficient pre-screening tool.
The objective of this analysis of potential VF genes was to evaluate the method and verify the biological significance of genomic regions that have the greatest impact on classification.Therefore, the results obtained were not intended to precisely clarify the invasive phenotype.However, in future work our ML method could be modified to generate a complete description of the genes important for classification, assisting the next generation of GWAS.
To our knowledge, this is the first ML method for the classification of bacterial strains based on their genomes.This method may in future be useful for strain characterization, population genomics, monitoring antibiotic resistance, and outbreak investigation.We showed that in N. meningitidis, the ML method not only had a high level of accuracy in two different classification tasks but can also be used to identify potential new VF genes.The ML method could thus also be used for pre-screening genes that may be linked to a specific phenotype, to reduce the number of subsequent laboratory experiments required.

Limitations of the study
Our study has some limitations.First, having very good classification performances does not necessarily mean that the ML model was able to identify biologically meaningful information, although our results indicate this to some extent.Laboratory knock-in and knockout experiments should be performed to verify the role of the potential VFs.Second, our work is based on the bag-of-words approach, which purposely discards positional information, meaning that some information about how the genome is structured is lost, such as the positional dependency of regulatory elements.Currently, incorporating positional information in the bag-of-words approach would defeat its strength, namely that it is computationally less costly than alignment-based methods.Until future increases in computational power allow for positional information to be included, the bag-of-words approach is a valuable tool.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:  The two datasets were each split in three by randomly extracting 500 and 1,000 samples for the validation and the test sets, respectively.The remaining genomes were used as training sets (see below table).The validation sets were used to select the model's hyperparameters; the test sets were preserved for independent performance evaluation.Before the vocabulary creation using either tokenization method, all characters not unequivocally identified were masked with ''X''.

Byte pair encoding
Byte pair encoding is an algorithm to create token vocabularies from a corpus of sequences.It starts by creating an initial vocabulary where tokens are single characters in a given alphabet.Then, the most frequent pair of adjacent tokens is merged into a new token, which is added to the vocabulary.Finally, all the occurrences of the two merged tokens are replaced by the new token in the corpus.This procedure is iterated to expand the vocabulary until a desired size is reached.

Vocabulary creation with SP
Tailoring the BigBird procedure, 6 we created a training corpus for SP containing 1,000,000 subsequences, with a length between 500 and 1,000 consecutive bp, randomly extracted from all genomes in the training set.SP generated the vocabulary using Byte Pair Encoding, 70 by repeatedly adding the most frequent word pairs present in the corpora as a new vocabulary term.Starting from a vocabulary containing the four bases A, C, T, and G, along with X to represent unknown characters, SP iteratively added new words to the vocabulary by merging the two most frequent words until the desired vocabulary size was reached.Any token with an ''X'' was subsequently removed from the vocabularies.In this study, we experimented with six vocabulary sizes (2k, 4k, 8k, 16k, 32k, and 64k).

Vocabulary creation with k-mers
For a given k and an alphabet of four characters (A, T, C, and G), the number of possible k-mers is 4 k .We constructed six different k-mer vocabularies by varying k from 3 to 8. The overall dimension of the vocabulary was obtained with the formula: Note that for k = 8, we obtain a vocabulary with 87,360 words: V ðkÞ = P 8 k = 3 4 k .

Bag-of-words approach
A bag-of-words approach transforms any linearly ordered set of words, e.g., a sentence, a document, or, as here, a sequence, into an unordered set, hence the name bag-of-words. 71Each sequence can be seen as a sentence, represented in a simplified way as the bag (multiset) of its words, regardless of relative position, direction, and order, but keeping multiplicity.We developed a bag-of-words approach to encode, after SP or k-mer tokenization, entire bacterial genomes.

TF-IDF vectorization
After creating the vocabularies, we transformed the N genomes into a D ˛RN3V matrix, where V is the vocabulary size.The (i, j)-th matrix entry, indicated as d ij , was obtained as follows: Where tf ij is the term frequency (how many times word i appears in genome j), and df i is the document frequency (how many times word i appears across the entire corpus).This representation is called the Term Frequency-Inverse Document Frequency (TF-IDF) 67 and is used here to represent a genome in numerical form throughout our experiments.In short, genomes are represented as histograms of size V, where each bin corresponds to a word, with each word's numerical value being the number of occurrences in the genome, normalized by its relative frequency across the training set.As the sequencing direction of contigs cannot be known without alignment, we achieved directionality invariance by tokenizing each contig in both directions and summing up the two histograms.We further summed all of the contig histograms to obtain a fixed-length representation of each genome, which was used as the input for the downstream classifier.

Gradient Boosting Machines and lightGBM
Gradient Boosting Machines (GBMs) belong to the family of ensemble methods, i.e., methods that aggregate a pool of ''weak'' (in terms of performance) models such that the combined model performs better than the single models in isolation.In particular, each component of a GBM is trained sequentially to minimize the error residual of the previous.Predictions can be obtained by a weighted average of each component's prediction.lightGBM is a library for Gradient Boosting Trees, 65 that is, GBMs that ensemble multiple Decision Trees.It offers a broad series of optimizations and facilities to train Gradient Boosting Trees on large datasets.Currently, it is one of the best-performing GBM libraries available and is widely adopted by the ML community.

PCA analysis
To compare genome similarity within and between classes and find possible biases, we used PCA.PCA is based on applying an orthogonal transformation to the data to derive a set of linearly independent variables called principal components. 69The most important principal components explain most of the variation in the original data and as such can be used to find explanatory patterns.Specifically, we applied PCA to the genome histograms and kept only the first two principal components, meaning that we transformed each genome into two coordinates.These coordinates were then plotted on a 2D plane and colored according to the genome class, to determine whether the data contained patterns that would bias the classification toward one class or the other.

In silico gene knockout
Knockouts of genes (or genomic regions) were used to determine whether the classifier predictions could be linked to the biological processes underlying each classification task and, if so, to what extent.Knockouts were performed on genomes where the classifier was highly confident (prediction R0.99) that they were capsule B (for the capsule task) or invasive (for the disease task).A knockout entailed masking a sequence of consecutive bases (a gene or genomic region of interest) with an equal number of ''X'' characters, which were then discarded before tokenization.This effectively altered the TF-IDF of the masked words and produced a different histogram for the genome.To rule out confounding effects, we created a control for each knockout by removing a different sequence (with the same number of bases) from the same genome.
For the capsule classification task, we knocked out the intergenic region between NEIS0044 and NEIS0068 (on average 25k bases) from all genomes where it was contained in a single contig (n = 32).As control, we removed from the same genomes an equal portion of contiguous bases taken from a randomly chosen contig.
For the disease classification task, as VFs play a role in invasiveness, we removed from a set of 436 genomes a panel of 155 non-consecutive loci (Table S6) for which a PubMLST annotation was available from a panel of 172 known VFs. 39 As controls, we removed 155 non-consecutive loci (with the same number of bases) from random regions of the same 436 genomes.
For both the capsule and disease tasks, we recorded the prediction before and after removing the knockout and control loci.Finally, we used a Wilcoxon signed rank test (significance level alpha = 0.05) to establish whether the knockout predictions were significantly different from the control predictions.

Identification of unstudied VFs plus annotation
To identify relevant genes that influence the prediction of invasiveness, we further performed single-locus knockouts for 2,808 N meningitidis loci annotated on PubMLST. 2 A locus was considered relevant if its prediction delta (the difference in prediction before and after knockout) on a single genome was above the (arbitrarily chosen) 95 th percentile and this was observed in R50 genomes.These criteria ensured that we considered all relevant genes that had a strong effect both on individual genomes and across genomes.
Each gene was then ranked by the number of times it satisfied both conditions across the 436 genomes.The set of relevant genes obtained was subsequently analyzed for enrichment in the list of 155 VFs (Table S6) (Fisher's exact test, significance level 0.05) and DAVID. 38Functional annotations were based on the literature and pathway enrichment with DAVID's Functional Annotation Chart and Functional Annotation Clustering with the lowest classification stringency for gene ontology (GO) enrichment.The significance threshold for the Benjamini-Hochberg adjusted p value on enrichment was 0.05.

QUANTIFICATION AND STATISTICAL ANALYSIS
Statistical analyses were performed using custom python scripts.Details of all statistical analyses can be found above in the relevant subsections of the method details section.Sample number (n) and statistical methods used to assess differences between groups are indicated in the relevant subsections of the Results section.
Bag-of-words coupled with machine learning allowed the classification of entire Neisseria meningitidis genomes for capsule type (B/not-B) and carriage/invasive phenotypesEntire genomesWith conventional methodsWith genome segmentation, bag-of-words, and machine learning

Figure 1 .
Figure 1.Effect of vocabulary size on validation accuracyPlots of validation accuracy when different vocabulary sizes were used in the k-mer and SentencePiece approaches.For the k-mer approach the genome was segmented in k-mers: only 3-mers (vocabulary size: 64), 3-and 4-mers (vocabulary size: 320), 3-to 5-mers (vocabulary size: 1,344), 3-to 6-mers (vocabulary size: 5,400), 3-to 7-mers (vocabulary size 21,824), and 3-to 8-mers (vocabulary size: 87,360).For the SentencePiece approach, six vocabulary sizes were evaluated: 2k, 4k, 8k, 16k, 32k, and 64k.See also TableS1.(A) Classification accuracies obtained with the test set for the capsule group classification.The blue line depicts the k-mer approach and the yellow line the SentencePiece approach.(B) Classification accuracies obtained with the test set for disease classification.The y axis depicts the accuracy score, the x axis shows the vocabulary size.

Figure 2 .
Figure 2. Prediction change of type B capsule or disease phenotype after knockout In silico knockouts were performed to evaluate the prediction of classification of either capsule or disease phenotype.(A)32 genomes of strains for which initial classification predicted a type B capsule with high confidence (original prediction, in yellow) were selected.In silico knockout of the entire capsule locus (capsule knockout, in magenta) changed the prediction of the type B capsule to a non-type B capsule for 30 of the 32 strains (93.8%).In silico knockout of a random contig of the same size (control knockout, in blue) did not change the prediction of the type B capsule for any of the strains (0.0%).(B) 436 genomes of strains for which initial classification predicted an invasive phenotype with high confidence (original prediction, in yellow) were selected.In silico knockout of 155 virulence factor (VF) genes (VF knockout, in magenta) changed the prediction of the invasive phenotype to a carrier phenotype for 230 of the 436 strains (52.8%).In silico knockout of random sequences of the same size (control knockout, in blue) changed the prediction of the invasive phenotype to a carrier phenotype for 13 of the 436 strains (3.0%).Each dot represents one strain.The horizontal dashed line represents the prediction threshold of 0.5.
genomes in the dataset, then by binning resulting values with k-means clustering, and finally by sampling an equal number of genomes for the two classes for each cluster.The final datasets, comprising 19,498 and 9,402 genomes for the capsule and disease datasets, respectively, were thus balanced for the two classes and had similar quality distributions, as can be seen by the overlap of the curves in below figure.
Genome sets after resampling to reduce bias in datasets Histograms of the fractions of each genome in the dataset covered by the 10 longest contigs.(A) The 25,290 genomes of the capsule dataset, light blue bars indicate capsule group non-B strains, dark blue bars indicate B strains, medium blue bars indicate where light and dark bars overlap.(B) The 19,498 genomes of the reduced dataset after resampling capsule group B samples, bars are color-coded in the same way as for (A).(C) The 20,442 genomes of the disease dataset, light yellow bars indicate invasive strains, dark brown bars indicate carrier strains, medium brown bars indicate where light and dark bars overlap.(D) The 9,402 genomes of the reduced dataset after resampling invasive samples, bars are color-coded in the same way as for (C).The y axis indicates the number of genomes; the x axis indicates the fraction of the genomes covered.

Table 1 .
Capsule and disease classification performance of the SP approach on the independent test set Threshold used: 0.5.AUROC, area under the ROC curve; ROC, receiver operating characteristic; SP, SentencePiece

Table 2 .
Annotation of the top-ten ranking potential virulence factor (VF) genes identified through the in silico knockout approach Based on data in the PubMLST Database. 2b Locus code in specific N. meningitidis strains. a

TABLE
d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d METHOD DETAILS B Genome datasets B Byte pair encoding B Vocabulary creation with SP B Vocabulary creation with k-mers Genome datasets divided into training, validation, and test sets