Machine Learning Reveals Key Glycoprotein Mutations and Rapidly Assigns Lassa Virus Lineages

Lassa fever, caused by the Lassa virus (LASV), has led to numerous fatalities in West Africa and cases exported intercontinentally since its discovery in 1969. Currently, there are no approved vaccines, with recent research focusing on immunotherapy. Lassa virus is grouped into different lineages that circulate in specific geographical areas, elicit varying immune responses, and display distinct pathophysiological effects. Therefore, investigating the genetic differences between these lineages is crucial. Here, we analyzed the LASV glycoprotein, the only surface protein, using statistics, machine learning, and phylogenetics to identify key differences between Nigerian lineages and those endemic to other West African countries. We found that amino acid positions near the stable signal peptide cleavage site and sites impacting immune recognition, such as those between positions 59 and 76, were highly variable among the lineages. Additionally, we discovered that Lineage II and Lineage III sequences are one codon shorter than Lineage IV sequences, due to a codon insertion in positions 178-180, corresponding to amino acid position 60. This may explain the structural and phenotypical differences between the lineages. To quickly identify which lineages cause emerging outbreaks or exported infections, we also developed a highly accurate lineage classification tool based on machine learning.


Introduction
Since its discovery in 1969, Lassa fever (LF) constitutes a major public health threat, with 500,000 cases and around 5,000 deaths annually in West Africa (Garry, 2023;Macher & Wolfe, 2006;Richmond & Baglole, 2003).Although the overall fatality rate in hospitalized population is relatively low (~15 %), it is alarmingly high among pregnant women (80%) and fetuses (up to 100%) (Agboeze et al., 2019;Salami et al., 2022;Simons, 2023).Currently, no approved vaccine nor any appropriate therapeutic measures exist, although efforts are ongoing (Garry, 2023).The infectious agent causing Lassa fever is the Lassa Virus (LASV).Its main reservoir host is M. Natalensis, with several additional hosts with the potential to spread the virus having been discovered recently (Happi et al., 2023;Olayemi et al., 2016).Experimental research involving LASV is confined to Biosafety Level 4 (BSL4) laboratories, and the disease is currently on the World Health Organization's (WHO) top priority list (WHO, n.d.).
Lassa fever, a viral hemorrhagic disease first identified in Northern Nigeria, has caused global concern due to its international spread and pandemic potential.By 1973, notable cases were reported, including the first case introduced to Britain (Woodruff et al., 1973) and a hospital epidemic in Liberia (Mertens et al., 1973;Monath et al., 1973).Between 1974 and1977, a special isolation unit was created in London (Woodruff, 1975), and an imported case was reported in the United States involving a Peace Corps worker (Zweighaft et al., 1977).By the 1980s, cases were reported globally, including Israel (Shlaeffer et al., 1988), Japan (Yanase et al., 1989), and Canada (Mahdy et al., 1989).Fatal cases involving travelers from West Africa to Germany were reported in 2000 and 2016 (Günther et al., 2000;Wolff et al., 2016).A recent review has documented cases worldwide (Wolf et al., 2020), including in Sweden and the Netherlands, emphasizing the need for robust biosecurity measures, rapid diagnosis, effective treatments and vaccines.
To date, there are seven known LASV lineages circulating in distinct geographical locations (Figure 1A).Lineages I, II and III circulate in non-overlapping parts of Nigeria (Ehichioya et al., 2019), while lineage IV circulates in Sierra Leone, Guinea, and Liberia (Andersen et al., 2015), lineage V in Mali and Cote d'Ivoire (Manning et al., 2015), and lineage VII in Togo and Benin (Whitmer et al., 2018;Yadouleton et al., 2020).The strain found in the host H. pamfi in Nigeria was designated lineage VI (Olayemi et al., 2016;Whitmer et al., 2018).
Currently, treatment for Lassa fever relies primarily on ribavirin, a drug with severe side effects (De Franceschi et al., 2000;Salam et al., 2022).Immunotherapy is a promising treatment option.Early attempts at passive immunization included the use of convalescent plasma and immune serum, which showed some success (Clayton, 1977;Leifer et al., 1970).Most recently, a monoclonal antibody cocktail, which is composed of three neutralizing antibodies targeting the LASV glycoprotein complex (GPC), has shown to be efficient in non-human primates against representatives of lineages I-IV (Mire et al., 2017).It is now apparent that even this cocktail may fail in some variants or new lineages, as recently reported in lineage VII (Woolsey et al., 2024).This may be due to mutations in the GPC, which may prevent binding of specific antibodies (Enriquez et al., 2022;Robinson et al., 2016).Thus, it is crucial to determine which lineages carry such mutations and to what extent.While this was attempted in previous work (Ibukun, 2020), it was based on a somewhat arbitrary selection of "representative" strains from various lineages for comparative analysis, instead of including all data available.
The Lassa virus genome is bi-segmented, coding for four proteins: the glycoprotein complex precursor, nucleoprotein (N), matrix (Z), and RNA-dependent RNA polymerase (L).The GPC is post-translationally cleaved into the Stable Signal Peptide (SSP), Glycoprotein subunit 1 (GP1) and Glycoprotein subunit 2 (GP2).Altogether, the GPC is crucial for cell entry, making it a key target for vaccine and therapeutic development (Garry, 2023;Katz et al., 2022).Despite its importance, specific information on LASV GPC, especially related to specific amino acid (AA) differences between lineages, is limited.
Knowing which lineage an emerging LASV outbreak strain belongs to would instantly shed light on the geographical origin of an outbreak, since distinct lineages circulate in distinct locations.Moreover, since the lineages are known to have different immunological (Buck et al., 2022) and pathophysiological effects (Andersen et al., 2015), quick lineage assignment from a patient's sample at the point of care would be beneficial.With the WHO supporting a significant scale-up of the genomic sequencing capabilities in Africa during the Covid-19 pandemic (Akande et al., 2023), genomic pathogen sequencing can become an integral part of the outbreak response.
In this study, we harness robust data, statistics, phylogenetics, and machine learning techniques to determine clade-defining amino acid mutations of possible clinical and evolutionary importance on the LASV GPC.We also provide a rapid lineage assignment tool to enable timely responses to national and international Lassa fever threats.
Figure 1: Description of the data in this study (A) Geographical distribution of the final glycoprotein (GPC) sequences data in this study.The majority of the data is from Nigeria, followed by Sierra Leone.The only non-West African sequence is from a 2016 local spread of an imported case of Lassa fever to Germany from Togo (Wolff et al., 2016).The counts have been log-normalized for clarity.(B) Phylogenetic tree of the nucleotide sequences annotated by lineage (black arrows).Leaf nodes are coloured by geographical location.The tree topology closely resembles the GPC tree described by (Whitmer et al., 2018)

Random forest classification reveals key LASV glycoprotein mutations
To investigate LASV GPC amino acid sites of differences, we downloaded all available LASV nucleotide sequences released before December 1, 2023, and extracted the GPC region (supplementary figure 1).After quality filtering (see Methods), the final dataset contained 753 sequences, including 542 sequences from Nigeria, 141 from Sierra Leone, 11 from Guinea, 1 from Germany, 3 from Togo, 24 from Liberia, 5 from Mali, 13 from Côte d'Ivoire, and 13 from Benin.We translated the nucleotide sequences into amino acid sequences and verified the accuracy of the translation.We then grouped the sequences into two categories: those from Nigeria and those from other countries -consistent with Andersen et al. (Andersen et al., 2015).
We selected the Random Forest (RF) model due to its robustness in handling imbalanced and multi-feature datasets and based on previous work on similar tasks involving viral sequence classification (Kim et al., 2021).Before fitting the model, we encoded the alignment using a one-hot approach, representing each AA as a 21-dimensional vector (20 amino acids plus a gap).
The model achieved precision, recall, and F1 scores of 100% across the two groups on the test set, which aligns with the literature indicating that LASV is generally location-bound (Andersen et al., 2015;Ehichioya et al., 2019).Subsequently, the aggregated feature importance was extracted from the model, revealing that many amino acid positions contributed minimally to the differences between Nigerian and other sequences (Figure 2, supplementary figure 2).This suggests reasonable conservation across the two groups.We ranked the positions and selected the top 15 AA positions for further analysis.
To verify the positions implicated by the RF method, we employed two additional techniques: Manhattan Distance (MD) and Pearson correlation (Pcorr).In this context, the MD is the sum of the absolute differences between AA positions across the two groups.We counted the occurrences of amino acids and gaps per position across the groups, normalized by sample number due to data imbalance, and computed the MD between corresponding position vectors.We ranked the positions and selected the top 15 for further analysis.
Finally, we applied Pearson correlation to assess variant similarity across the groups.Highly correlated positions suggest conservation, while low correlation indicates dissimilarity.We ranked the positions from least to most correlated and selected the top 15 for further analysis.
The application of these three methods revealed that 11 out of the top 15 positions were consistently implicated across all three methods (supplementary figure 3).The positions, sorted from highest to lowest by cumulative ranking of all three sets, were AA 273,60,61,28,44,76,74,31,421,324,and 482.We hypothesize that these AA positions play a role in immunological (Buck et al., 2022) and pathophysiological processes (Andersen et al., 2015).Indeed, positions 74 and 76, for example, have been implicated as important for the functioning of antibodies 20.10C and 36.1F(Enriquez et al., 2022).This suggests that the virus might be attempting to evade these antibodies by varying these positions.In fact, deletion of amino acid positions 60-75, which coincidentally falls within the range of 3 out of the 11 most variable positions across the LASV GPC as implicated by our tests, has been reported to disrupt the functioning of all known anti-GPC antibodies (Robinson et al., 2016).

An indel at amino acid position 60 shortens 98% of Nigerian GP1 sequences to 199AAs
The Manhattan Distance analysis highlighted position 60 as the most divergent between the groups (Figure 1C), and this position was consistently ranked highly by all three methods.Normalized value counts revealed that 98% of sequences from Nigeria have a gap at position 60, compared to less than 8.1% of sequences from other regions.This indel is particularly noteworthy as it is located in the GP1 region, in the second position after the SSP cleavage site (Figure 2B).Consequently, at least 98% of Nigerian GPC sequences are shortened by one codon (nucleotide positions 178-180) at the nucleotide level and by one amino acid at the protein level compared to most sequences from other regions.This results in the GP1 of Nigerian sequences being 199 amino acids long compared to most sequences from elsewhere, which belong mainly to lineage IV.
This discrepancy in length may affect protein folding and structure, potentially leading to functional and immunological variations.Additionally, the length inconsistency may cause issues with sequencing and assembly, as well as with polymerase chain reactions (PCR) targeted at the S segment or the GPC.
6 A normalized count of variants at that position revealed that over 98% of Nigerian sequences lack an amino acid at position 60 (supplementary figure 4), whereas more than 92% of sequences from other countries-denoted as 'elsewhere'-have an amino acid at this position.Given that the Nigerian sequences constitute the largest dataset, this raises questions about the general belief that the LASV GPC is 491 amino acids in total and that GP1 is 200 amino acids.(C) Time calibrated tree showing that the missing position on Nigerian sequences is actually an insertion that predates the emergence of the Sierra Leonean strain.This insertion is seen to be present in all sequences in Lineage IV and Lineage V.

Phylogenetic analysis suggests that position 60 constitutes an insertion that predates the emergence of Lineage IV, the Sierra Leonean strain
To investigate the nature of the indel at position 60 and its potential public health implications, we reconstructed a time-scaled phylogenetic tree of the LASV GPC using the preprocessed data.
The clock rate was estimated by Treetime (Sagulenko et al., 2018) at 8.20e-4 substitutions per site per year.
The results revealed that the indel at position 60 is an insertion in Lineage IV, dating back to when this lineage first diverged from the Nigerian lineages (Figure 2C).The insertion is estimated to have occurred in the 18th century.All sequences on the major branch leading to lineages IV and V possess this insertion.The amino acids observed at this position include the Josiah genotype S60, with variants such as S60T, S60N and S60G.Only two sequences from Sierra Leone have the S60G variant (OQ919514 and KM821773).
On the Nigerian side, lineages II and III almost entirely lack any amino acid at this position -one feature common to their GPC.This is consistent with the understanding that the Nigerian lineages are believed to be ancestral to the Sierra Leonean lineage (Andersen et al., 2015).Lineage VII, the Togo strain, which also includes sequences from the Benin Republic and a 2016 imported case to Germany, also lacks this insertion.However, a sequence closely related to Lineage VI (MK107927) is seen to have the insertion at position 60 (supplementary figure 5).
Lineage I, on the other hand, has the insertion (a T60), but instead appears to have an amino acid deletion at position 62.Interestingly, differences in immune response have been reported for lineage I (Buck et al., 2022).
The insertion appears to have recently reemerged in lineage II, particularly in sublineage IIb (Ehichioya et al., 2019) which is reported to circulate in the southern part of Nigeria (Figure 3).Meanwhile, the same insertion is also seen in GenBank ID MH887782, MH053506, which clusters closely with sublineage lineage 2g, a sublineage with growing incidence in Edo and Ondo States in Nigeria (Happi et al., 2023) (supplementary figure 6).These insertions were estimated to have happened around the last century.
The persistence of the insertion at this AA position 60 over the years and across multiple lineages suggest it may offer a fitness advantage for the virus.In the structural and functional aspects of the glycoprotein, this could have significant implications for virus transmission and virulence, potentially affecting public health measures and strategies for vaccine development.The independent recurrence of insertions at this specific position may suggest that it offers the virus a selective advantage.It could also indicate potential adaptation to an intermediary host.

Rapid LASV Lineage Classification from GPC sequence
Given that different LASV lineages circulate in specific regions and possess distinct clinical properties, the ability to efficiently assign lineages to unknown sequences offers immense clinical and public health benefits.Consequently, we developed a machine learning-based pipeline for LASV lineage assignment.
To test the pipeline's speed, we downloaded all publicly available LASV sequences and processed them through the pipeline.On an average computer using five cores, it took less than two minutes to process and predict the lineages of all LASV GPC sequences in the dataset (supplementary GitHub code).These figures are auto-generated for every run, enabling rapid evaluation of the resulting lineage classification.

Discussion
In this paper, we provide novel information on the amino acid positions that are most variable by geographical location and lineage.We have used a novel application of mathematical methods, which could be used further in the field for motif finding, even beyond pathogens, especially in cancer.Our results also suggest that a recurring codon insertion may confer a natural selection advantage to LASV.Finally, we provide a rapid, accurate, and easy-to-use lineage assignment tool for LASV sequences.
Andersen et al. (Andersen et al., 2015) reported the phenomenon of differing pathophysiology and host codon usage in the LASV circulating in different regions.We have extended this work by showing differing positions of the LASV GPC at the amino acid level.Interestingly, some of the positions are around the cleavage site of the stable signal peptide by the signal peptidase, a protein reported to control the maturation of the GPC (Bederka et al., 2014).The variation in this region may lead to differences in viral replication and maturation.
Although a gap at AA position 60 of the LASV GPC was previously reported (Buck et al., 2022;Perrett et al., 2023), its prevalence was never investigated or discussed.Here, we have shown that this gap is present in over 98 percent of sequences from Nigeria, including all lineage III sequences and almost all lineage II sequences in our dataset.Consequently, Nigerian sequences are shorter than Sierra Leonean sequences (Lineage IV).To the best of our knowledge, this is the first time this has been categorically shown in the literature.Although the majority of Lassa fever cases occur in Nigeria, it is common to use the lineage IV Josiah strain as a reference for genomic analyses.The GPC length of 491 amino acids is widely cited in the literature, and it is important to clarify that the Nigerian lineages, which are the most prevalent, are one codon (nt 178 -180) shorter.This may have profound implications for understanding the structural and phenotypical differences between the lineages.
Generally, the effects of mutations vary based on several factors, including the position of the mutation, the importance of the protein to the organism, the type of mutation, and more.In multicellular eukaryotes, many traits (or phenotypes) are controlled by multiple genes, known as polygenic traits.A mutation in one of these genes may not have a noticeable effect.However, this is not the case for monogenic traits, where mutations can have more severe consequences.For example, in the case of sickle cell anemia, a single nucleotide substitution leads to a single amino acid substitution, which can cause catastrophic consequences if both gene copies inherited by an individual are affected (Inusa et al., 2019).
In viruses, which have fewer proteins, mutations can have even more profound effects.This impact may be exacerbated when the affected protein is the only surface protein and the mutation is an indel.Hence, the insertion at position 60 may be a major contributor to the differences in pathophysiology and immune effects observed among LASV lineages.This mutation occurs on GP1, the subunit responsible for binding both LAMP1 and alpha-dystroglycan receptors of the host during viral entry (Cao et al., 1998;Jae et al., 2014).This could cause differences in cell entry between Nigerian lineages and those from other regions, particularly the Sierra Leone strain.Interestingly, Andersen et al. (Andersen et al., 2015) reported that the Sierra Leonean strain's genome abundance in patients is significantly higher than that of Nigerian lineages.Since viruses rely on host cellular mechanisms for their life cycle, the increased cell entry -potentially influenced by this insertion in GP1 -might be one of the underlying reasons for the difference in genome abundance.
Given the codon insertion at position 178-180 of the GPC, it is advisable to avoid targeting this region with PCR primers, as this could result in inconsistent results across different LASV lineages due to variability in binding affinity depending on the reference sequence used for primer design.Similarly, the design of ELISA (Enzyme Linked Immunosorbent Assay) test kits should take into account the variability at this amino acid position (LASV GPC 60) to avoid inaccuracies in detecting the virus.The presence of this insertion can affect the epitope structure, potentially leading to reduced recognition by antibodies used in ELISA, which are designed based on a reference sequence that may not account for such variations.If targeting these positions is unavoidable, it is essential to design test kits that are specific to the respective lineages to ensure accurate and reliable results.
We also showed that the indel at AA position 60 is an insertion that predates the emergence of the Sierra Leonean lineages from the Nigerian lineages.Additionally, there has been an independent re-emergence of this insertion at this position in lineage II in recent times.The sustenance of this mutation is strongly suggestive of natural selection fitness.This finding will foster a greater understanding of the inter-lineage differences and the evolution patterns of LASV, which is important for surveillance, vaccine, and immunotherapy development.Moreover, an unknown intermediary host could also be responsible for this insertion -a prospect for future research.We hope that future work will clarify the implications of this insertion.
Our finding that positions 60, 61 and 74 are among the top positions varying between lineages, is supported by their role in immune evasion (Robinson et al., 2016).Supported by reports from Enriquez et al. (Enriquez et al., 2022) and Carr et al. (Carr et al., 2024), monoclonal antibody 21.10C may be ineffective in Southern Nigeria due to the high prevalence of the immune-resistant mutation at position 76 (supplementary figure 7).It is essential to monitor these positions closely, along with others such as position 273, which have also been identified as immunologically significant (Ibukun, 2020;Robinson et al., 2016) and have been consistently ranked as highly variable in our tests.
Our machine learning-based pipeline offers a novel ability for researchers and clinicians to quickly assign lineages to LASV sequences.Given its built-in cleaning processes, speed and precision, and ease of use, we believe this pipeline will be invaluable for managing endemic cases of Lassa fever in West Africa.Also, this will assist in management of imported cases of Lassa fever, and for global public health efforts.However, as more data emerges, the model will need to be updated using the workflow we have provided.This tool is a significant contribution to bioinformatics capacity on the African continent, a capacity that several African researchers have been calling for in recent times (Nembaware et al., 2023;Olono et al., 2024;Sharaf et al., 2023).

General preprocessing
LASV sequences released between December 1905 and December 1, 2023, along with their accompanying metadata, were downloaded from NCBI Virus (available at https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/).To ensure the inclusion of only field samples, the 'exclude lab strain' filter was applied.Using the GPC gene from the reference ID NC_004296, we extracted and aligned the GPC regions from all sequences using LAST (Kiełbasa et al., 2011) and MAFFT (Katoh et al., 2002), available at https://mafft.cbrc.jp/alignment/server/specificregion-last.html,resulting in 1021 GPC sequences.Sequences with more than 5% gaps and ambiguous nucleotides of the total alignment length were removed, reducing the dataset to 808 sequences.The final stop codon position was removed from the alignment because the stop codon signals the termination of translation, and it does not encode an amino acid.Sequences lacking sampling dates and locations were excluded, leaving 753 sequences.
Alignment visualization and exploration were conducted using Aliview (Larsson, 2014).Manual curation was performed to ensure codon consistency (Happi et al., 2023).Specifically, misalignment involving codons disrupted by 3 gaps was adjusted by moving a single nucleotide to match the two others to ensure proper translation.Nucleotide to amino acid translation was done using Aliview.

Phylogenetic analysis of GPC
Using a pipeline based on Snakemake (Köster & Rahmann, 2012), Nextstrain (Hadfield et al., 2018), and Augur (Huddleston et al., 2021), we reconstructed both a Maximum Likelihood tree and a time tree (supplementary Figure 9).The pipeline begins by accepting an alignment and metadata, which includes sampling dates, country, and host information.A Maximum Likelihood tree was built using IQ-TREE (Nguyen et al., 2015) with the application programming interface (API) provided by Augur.Similarly using Augur API, the tree was processed by TreeTime (Sagulenko et al., 2018), along with the metadata, to generate a time-calibrated tree.
The trees and accompanying metadata were parsed using the Augur export command into a JSON file, which was subsequently visualized using Auspice (available at https://auspice.us/), a part of the Nextstrain (Hadfield et al., 2018) toolkit.All phylogenetic images were generated using Auspice -which have been edited for clarity in Figma (available at https://www.figma.com/).

Sequence annotation
Sampling locations were extracted from the accompanying metadata for each sequence from GenBank.Sequences were annotated based on the country column to analyze motif differences between Nigerian sequences and those from other countries.The sequences were then grouped into those from Nigeria (542 sequences) and those from other countries (211 sequences).
For lineage annotation, clades in our reconstructed phylogenetic tree were annotated based on information from the literature (Andersen et al., 2015;Ehichioya et al., 2019;Manning et al., 2015;Olayemi et al., 2016;Whitmer et al., 2018;Yadouleton et al., 2020) (see supplementary GitHub data).The leaves of each clade were then collected.Specifically, the sequences were categorized into the following lineages: 480 sequences for lineage II, 59 sequences for lineage III, 194 sequences for lineages IV and V combined, and 16 sequences for lineage VII.Lineages I and VI were excluded due to insufficient data.Lineages IV and V were grouped together based on the recommendation of Whitmer et al. (Whitmer et al., 2018), who pointed out that the distance between the two lineages is similar to those between other sublineages.

Sequence encoding
The amino acid alignment, comprising 753 sequences and 491 positions, was loaded using the Biopython alignment class.Gaps preceding and following any real amino acids were converted to 'unknowns', as these gaps typically indicate either short sequences or sequencing issues and therefore lack biological significance.
To prevent any biases, all features and targets were one-hot encoded.Each base of every amino acid sequence in the alignment was encoded into a 21-position vector.Gaps between bases were similarly encoded, while 'unknowns' were encoded with zero values.The encoded features were then flattened, with each position in the vector transformed into an individual column, resulting in a data matrix of dimensions 753 x 10,311.
The targets were encoded using one-hot variables, facilitated by pandas' get_dummies() method.We trained two models with different targets: country and lineage.

Model training
The dataset was split into training and testing sets using scikit-learn with a random state of 42, ensuring that the test set comprised 20% of the total dataset and stratification was based on the target variable.Using the training set, we trained Random Forest (RF) models with the scikit-learn package.The RF model was chosen for its robustness in handling multidimensional and imbalanced data.The random state for the RF model was set to 80, and the seed state was set to 42 to ensure reproducibility.A total of 100 decision trees were used in the binary country classification, while a total of 1,000 decision trees were used in the lineage classification.

Model Evaluation
The models' performance was evaluated using the test dataset, which comprised 20% of the entire dataset.Standard metrics from the scikit-learn package, such as precision, F-score, and recall, were used to assess the models' performance through the classification_report() function.
Since amino acid sequences can be identical between samples due to extreme conservation, we also investigated possible overlapping data points between the training and test sets.Approximately twenty percent of the test data was found to be present in the training set.However, this overlap does not affect the results, as the model achieved 100% across all test metrics.

Feature importance selection
During the encoding process, each position was represented by a 21-vector, and the resulting matrix was flattened, creating 491 x 21 columns (totaling 10,311 columns).Therefore, each position in the amino acid sequence is represented by 21 features.
The Random Forest algorithm outputs a large matrix containing the column index and the feature importance (10,311 rows by 2 columns).To determine the exact position of an amino acid in the alignment, each column index is divided by 21, and the integer part of the quotient is taken and incremented by 1.The weights of each position were combined into an aggregated feature importance score.Each feature corresponds to an exact position in the GPC amino acid sequence.The positions were subsequently ranked using the aggregated feature importance score from highest to lowest.An overview of the machine learning workflow demonstrates how the two Random Forest models were created using a similar process.(A) The sampling locations were annotated using metadata.(B) The lineages of each sequence in our dataset were annotated using insights from the literature.To achieve optimal performance from the models, hyperparameter tuning, primarily involving varying the number of decision trees, was conducted.

Figure 2 :
Figure 2: Implicated regions of variation on the LASV GPC

Figure 3 :
Figure 3: Re-emergence of insertion on position 60 in lineage II

Figure 6 :
Figure 6: Machine Learning training workflow Adesina et al. (2023);Bangura et al., 2024)ify sequences into four groups: lineages II, III, IV & V, and VII, as permitted by existing data.On the initial test set, which was preprocessed similarly to the training set, the model achieved 100% across all test metrics (Precision, Recall, F1-score).The model was subsequently exported into a pipeline, which was validated using recently published data byAdesina et al. (2023)andBangura et al. (2024)(Adesina et al., 2023;Bangura et al., 2024).These datasets were not included in the training or test datasets, and the sequences fromAdesina et al. (2023)did not cover the length of the GPC.The pipeline accurately assigned the sequences fromBangura et al. (2024)andAdesina et al. (2023)with precisions of 100% (Figure