Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.


Introduction
During the COVID-19 pandemic, whole genome sequencing (WGS) of the SARS-CoV-2 virus has played a crucial role in unraveling important biological information. Through phylogenetic analysis, it has been revealed that SARS-CoV-2 shares 50% and 79% sequence similarity with MERS-CoV and SARS-CoV, respectively, indicating their evolutionary connections [1]. Notably, the genome sequence of SARS-CoV-2 exhibits an 85% similarity to a bat coronavirus, establishing its zoonotic origin within the Coronaviridae family and the Betacoronavirus genus [2]. These genomic data have been instrumental in confirming the virus's source and classification. Recognizing the significance of gathering genetic data from diverse SARS-CoV-2 sequences and variants, researchers worldwide swiftly recognized the need for comprehensive genome information [3,4]. The Centers for Disease Control and Prevention's Office of Advanced Molecular Detection (AMD) released details regarding SARS-CoV-2 whole genome sequencing on various platforms, including PacBio, Illumina, and Ion Torrent. Emphasizing the importance of publicly accessible genome sequences, the World Health Organization (WHO) strongly supports their utilization in developing novel public health strategies and conducting research to combat the spread of COVID-19. A valuable resource in this endeavor is the Global Initiative on Sharing All Influenza Data (GISAID), which hosts one of the largest international databases of SARS-CoV-2 genome sequences [5]. Leveraging GISAID, along with the open-source tools NextStrain and NextClade, researchers have made significant advancements in their investigations [6,7]. These resources have proven instrumental in understanding the evolution and characteristics of the virus, aiding in the development of efficient strategies to mitigate the COVID-19 infection's spread [8][9][10].
Third-generation sequencing technology has emerged as a widely used method for sequencing SARS-CoV-2 during the pandemic. These technologies, known for their ability to generate long reads, are increasingly employed in transcriptomics studies. Advancements in long-read sequencing enable the comprehensive sequencing of RNA molecules, utilizing cDNA or direct RNA protocols from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) [11][12][13]. However, the high error rates associated with long-read technologies pose challenges for accurate and efficient downstream analysis, such as genome assembly. Indels, or insertions and deletions, are the primary error types that complicate alignment processes. While various error correction tools exist, there remains a need for further development in this computational biology domain. To effectively combat the COVID-19 infection and facilitate research, an increased number of SARS-CoV-2 genome sequences are required [14,15]. Researchers worldwide rely on third-generation sequencing technologies to sequence the virus. Cutting-edge technology heavily relies on SARS-CoV-2 genomic sequences for virus tracking. To analyze genomic data effectively, scientists employ machine learning (ML) and deep learning (DL) algorithms along with embedding methods for classification purposes [16][17][18][19]. ML and DL algorithms have become valuable tools even for novice bioinformatics practitioners and core data analysts who may lack prior knowledge of sequencing technologies and associated challenges. These algorithms enable comprehensive analysis of SARS-CoV-2 sequencing data, contributing to advancements in classification techniques and aiding in our understanding of the virus's genetic characteristics and behavior. Therefore, it is crucial to establish a robust benchmark report on SARS-CoV-2 genome sequences generated using third-generation sequencing technology, which will serve as a guide for future genomic research involving long-read sequencing.
The current study aims to evaluate the performance of current classification models in handling third-generation sequencer-specific errors present in SARS-CoV-2 genome sequences. Specifically, the study investigates the effectiveness of various embedding methods under specified levels of disturbance. The evaluation of machine learning models on SARS-CoV-2 genomic sequences remains limited, with only a few existing studies in this area. For instance, a previous study [20] conducted a benchmark of ML and DL models using different embedding methods for classifying SARS-CoV-2 genome sequences that included sequencer-specific errors. However, this study did not identify the best ML model for SARS-CoV-2 genome sequence classification. In line with a similar approach, our current study focuses exclusively on SARS-CoV-2 genomes generated using long reads obtained from third-generation sequencing (TGS) technologies, such as PacBio and Nanopore, while also considering the possibility of random errors occurring by chance. To assess the effectiveness of machine learning algorithms on SARS-CoV-2 genome sequences, we conducted simulations that accounted for various error types. Our simulations employed two primary approaches: one involved generating SARS-CoV-2 genome sequences with platform-specific errors (PacBio or ONT), while the other introduced random errors. The workflow for these simulations is depicted in Figure 1. To analyze the SARS-CoV-2 sequences, we employed six distinct embedding methods, including one-hot encoding (OHE), Wasserstein-distance-guided representation learning (WDGRL), string kernel, spaced k -mers, weighted k-mers, and weighted position weight matrix (PWM). Leveraging these embedding methods, we performed supervised analyses using a variety of linear and non-linear classifiers considering both clean and error-incorporated SARS-CoV-2 sequences. This comprehensive methodology enabled us to evaluate the effectiveness of these methods in detecting errors and classifying sequences. Workflow employed for generating a dataset incorporating long-read (PacBio and ONT) specific errors, which was subsequently used to evaluate the robustness of machine learning models employing different embedding techniques. The workflow comprises four distinct steps: 1. collection of high-quality SARS-CoV-2 reference genome sequences from GISAID, 2. generating long-read sequences tailored to PacBio and ONT sequencers, incorporating sequencer-specific errors (represented as star), utilizing the reference genome for the ORF1a gene, 3. aligning the error-incorporated long reads (reads marked with star) to the reference genome, and 4. creating the final compilation of SARS-CoV-2 genome sequences, encompassing long-read sequencer-specific errors (represented as star) in the ORF1a gene.
The subsequent sections of the current study are described in an arranged manner as follows. Section 2 comprises comprehensive details of the dataset statistics, dataset generation methodology, and various embedding techniques considered to convert SARS-CoV-2 genome sequences to fixed-length numerical representations. Our results for accuracy and robustness are reported in Section 3. Finally, the current study concludes in Section 4.

Dataset and Methodology
This section is devoted to the elucidation of the datasets utilized in this study and the process through which the validation dataset incorporating long-read specific error models (PacBio and ONT) and random errors was generated (refer to Section 2.1). In addition, Section 2.2 provides a succinct overview of the different types of embedding methods employed. The methodology adopted for the development of machine learning classification algorithms and the computation of their accuracy and robustness is presented in Section 2.3. Finally, Section 2.4 expounds on the visualization of the high-dimensional SARS-CoV-2 sequencing data.

Dataset Generation
In this study, four distinct datasets were employed. One of these datasets encompasses all genomes from the Global Initiative on Sharing All Influenza Data (GISAID), which have been meticulously curated to ensure their accuracy. The remaining three datasets were derived from distinct error models. Specifically, two datasets were generated through the use of PacBio and ONT models, while the fourth dataset was produced via a random error model. A detailed exposition of the dataset properties and characteristics is provided in their corresponding sections, namely Sections 2.1.1-2.1.4.

Dataset 1: High-Quality SARS-CoV-2 Genome Sequences
To create a dataset of high-sequencing quality SARS-CoV-2 whole genome sequences, we analyzed 8172 sequences from GISAID between September and December 2021. Our selection criteria focused on complete and high-coverage genome sequences to ensure the collection of high-quality genomes. We specifically limited our sequence collection to those obtained from the human host. Additionally, we gathered lineage information for the sequences, resulting in 41 unique Pango lineages within our dataset. For detailed sequence statistics, please refer to Table 1. To generate the second SARS-CoV-2 genome sequence dataset, we utilized Pacific Biosciences (PacBio) sequencing technology and simulated long reads with PacBio sequencing errors. This was accomplished using PBSIM, a tool specifically designed to simulate PacBio sequencing reads with varying error rates [21]. PBSIM can generate two types of reads associated with the PacBio sequencer: continuous long reads (CLR) and circular consensus sequencing (CCS) short reads. CCS reads generally exhibit lower error rates compared to CLR reads. PBSIM offers two simulation approaches: sampling-based and model-based simulation, which facilitate the generation of PacBio CCS and CLR reads. In the sampling-based simulation, PBSIM considers the quality and length of the input read set, while the model-based simulation incorporates a built-in error model.
In our study, we employed the model-based approach of PBSIM, utilizing the pbsim-v1.0.3 tool to simulate PacBio long reads with errors based on the genomic sequence of SARS-CoV-2. Subsequently, these erroneous long reads were aligned to the SARS-CoV-2 reference genome (GenBank accession number NC_0455122) using Minimap v2-2.24 [22], and variants were called from the aligned reads using bcftools v1.6 [23]. This process resulted in the generation of a consensus sequence, which represents a SARS-CoV-2 genome sequence incorporating typical long-read errors. We generated simulated reads with errors at two distinct depths, namely 5× and 10×, specifically on Dataset 1, leading to the creation of Dataset 2.

Dataset 3: SARS-CoV-2 Genome Sequences Generated from Long Reads
Incorporating Oxford Nanopore Technology (ONT) Sequencing Errors The third dataset was generated using long reads simulated with an Oxford Nanopore Technologies (ONT) sequencing error profile. To simulate long reads with Nanopore sequencing errors, we employed the Badread software tool, known for incorporating realistic artifacts introduced by Nanopore sequencers, including chimeras, junk reads, glitches, and adapters [24]. Badread utilizes a gamma distribution for read length, allowing for user-specified mean and standard deviation parameters.
We utilized Badread v0.2.0 to simulate ONT (Oxford Nanopore Technologies) long reads with errors based on the genome sequence of SARS-CoV-2. Following this, we aligned these error-prone long reads to the SARS-CoV-2 reference genome (GenBank accession number NC_0455122) using Minimap v2-2.24 [22]. By leveraging bcftools v1.6 [23] to call variants from the aligned reads, we obtained a consensus SARS-CoV-2 sequence that incorporated errors typically associated with Nanopore sequencing technologies. In order to ensure a comprehensive analysis, we generated erroneous simulated reads at two distinct depths: 5× and 10×. These steps were specifically performed on Dataset 1, resulting in the creation of Dataset 3.

Dataset 4: SARS-CoV-2 Genome Sequences Generated from Long Reads Incorporating Random Errors
The fourth dataset of the SARS-CoV-2 genome sequence was generated using long reads with random errors. For this purpose, we utilized the random option in the Badread software tool, known for its ability to simulate long reads with various types of errors, including random errors [24].
By utilizing the random option available in Badread v0.2.0, we simulated long reads with random errors. Subsequently, these long reads were aligned to the SARS-CoV-2 reference genome (GenBank accession number NC_0455122) using Minimap v2-2.24 [22]. We then performed variant calling using bcftools v1.6 [23] and derived a consensus SARS-CoV-2 sequence by incorporating the introduced variants caused by random errors. To ensure a comprehensive analysis, we repeated this procedure at two different read depths, specifically 5× and 10×, which aligns with the approach taken for Datasets 2 and 3. These aforementioned steps were executed on Dataset 1, resulting in the creation of Dataset 4.

Embedding Generation Methods
This section delineates the analytical methodologies used to examine the datasets explicated in the previous section. Six distinct embedding methods, namely one-hot encoding (OHE), Wasserstein-distance-guided representation learning (WDGRL), string kernel, spaced k-mers, weighted k-mers, and weighted position weight matrix (PWM), were implemented to transform the sequences into machine-readable, low-dimensional numerical embeddings (also known as feature vectors) in this study. The specifics of each method are elaborated upon in their respective sections, namely Sections 2.2.1-2.2.6, respectively.

One-Hot Encoding (OHE)
One-hot encoding is a common method for generating numerical embedding from a nucleotide sequence (OHE). OHE represents each nucleotide in the sequence as a binary (0-1) vector; in this case, the nucleotides are A, T, C, and G [17,18]. To illustrate this mathematically for nucleotide sequences, consider a mathematical function ( f ) that maps each nucleotide to its appropriate one-hot encoding vector. We can write the function as follows: Suppose in Equation (1), Σ is a one of the nucleotides from {A, T, C, G}. For instance, After this, we can concatenate all the one-hot encoded vectors generated for individual nucleotides using function ( f ) from a DNA sequence to get a conclusive embedding vector.
Using the above concept, suppose a DNA sequence (X) of length (n) can then be represented by creating a final binary vector (φ x ) in Equation (2) by concatenating all the individual components (X) generated using the function ( f ).
In the above equation, X i is the nucleotide in DNA sequence X at position i. The φ x 's dimension in this instance is |Σ|× n (|Σ| is the size of the nucleotide alphabet, i.e., 4 in this case.)

Wasserstein-Distance-Based Generative Adversarial Network for Representation Learning (WDGRL)
The Wasserstein-distance-based generative adversarial network for representation learning (WDGRL) is an unsupervised method intended to generate a low-dimensional embedding [25]. It accomplishes the goal by extracting the features from input data using a neural network-based model that takes advantage of the source and encoded target data distribution. The model determines the Wasserstein distance (WD) between the original high-dimensional vector and low-dimensional representation. WDGRL considers one-hot encoded (OHE) vectors as input and generates a very-low-dimensional representation that consists solely of essential features.
Let us consider X numbers of one-hot encoded (OHE) SARS-COV-2 genome sequences as input data D to the neural network model M θ . The model M θ with parameters θ maps every OHE vector X i from X to a low-dimensional representation h i in R d . During this process, M θ learns to generate an h i that captures the essential features from X i by considering the encoded distribution of the source and the target data.
Suppose the distributions of the encoded representation of h i for the source and target data are P s (h) and P t (h). Here, P s (h) and P t (h) can be estimated by a density estimation method. The loss function for the WDGRL can be written as In Equation (3), f is a Lipschitz-1 function to approximate the Wasserstein distance between the distributions P s (h) and P t (h). Here, the function f (h) can be parameterized by another neural network D α , which considers the encoded representation of h i as input and outputs a scalar value. The loss function is optimized by minimizing the negative of its value with respect to the parameter's θ and α: The neural networks G θ and D α are jointly trained to minimize the WDGRL loss function using gradient descent or other optimization methods. The resulting low-dimensional representation captures important features of the input data that are useful for downstream tasks, such as classification.

String Kernel
The string kernel method operates in the non-Euclidean space and measures the similarity between the SARS-CoV-2 genome sequences by computing a kernel matrix, also known as a Gram matrix [26].
Given a set of SARS-COV-2 genome sequences X = {X 1 , X 2 , X 3 , . . . ., X n }, the string kernel method first computes the k-mers of length k ; for the current scenario, k is 3 for each genome sequence. The term k-mer refers to a substring of length k that occurs in the SARS-CoV-2 genome sequence.
Let us consider M be the matrix of all possible k-mers for a SARS-CoV-2 genome sequence j, and M ij is the number of times the k-mer i occurs in the sequence j. Then, the similarity between two SAR-CoV-2 genome sequences X i and X j can be computed using the kernel function, Equation (5).
Here, M ik and M jk are the number of occurrences of k-mer i and j in sequences X i and X j , respectively.
To reduce the computational complexity, the string kernel method considers a locally sensitive hashing-based approach to estimate the k-mers of two sequences at distance m from each other. This approach hashes k-mers into bins considering their locality and further uses the bin information to estimate the matching k-mers between the two SARS-CoV-2 sequences. The resulting kernel matrix K from Equation (5) is a symmetric matrix, and K ij is the kernel value between the genome sequences X i and X j .
To generate a low-dimensional representation of the input genome sequences, we performed kernel principal component analysis (PCA) on the kernel matrix K. Kernel PCA is a nonlinear dimensionality reduction technique that maps the input data to a new space defined by the eigenvectors of the kernel matrix. The top components are further considered as the reduced dimensional feature vector for the downstream analysis.
Suppose for the kernel matrix K, V is the matrix of eigenvectors and Lambda is the diagonal matrix of the corresponding eigenvectors. Then, the top k principal components are represented by: where V[:,k] is the k-th column of the matrix V and λ k is the k-th diagonal element of Lambda. For this analysis, the top 500 principal components are selected by considering a standard validation approach, and these components are used as the final feature vector for each SARS-CoV-2 genome sequence for downstream tasks such as classification.

Spaced k-Mers
The spaced k-mers method is used to reduce the sparsity and size of k-mers (nucleotide substrings of length k) in the SARS-CoV-2 genome sequence [19]. Given a SARS-CoV-2 genome sequence S, the spaced k-mers method first computes g-mers. Here, g-mers are nucleotide subsequences of length g (where g is an integer > 1). From those g-mers, the method then computes k-mers, where k < g. While generating k-mers from g-mers, the method skips some of the characters (nucleotides) between adjacent g-mers. The size of the gap between adjacent g-mers is determined by g-k.
Formally, suppose S is a SARS-CoV-2 genome sequence of length L. Then, the definition of g-mers is as follows: (Here, S i : i + g − 1 represents the genomic subsequences of S starting at position i and ending at position i + g − 1.) From the above computed set of G mers, the method computes the set of k-mers as follows: (Here, S i : i + k − 1 represents the genomic subsequences of S starting at position i and ending at position i + k − 1.) After computing the k-mers, the method generates a numerical vector of length |Σ| k (where Σ corresponds to the alphabet A, C, G, T).
In our case, the spaced k-mer method considers k = 4 and g = 9, which is determined using a standard validation approach. This means we compute k-mers of length 4 from SARS-CoV-2 genomic subsequences of length 9, with a gap of 5 nucleotide between adjacent subsequences.

Weighted k-Mers
The weighted k-mers-based spectrum method is used to denote biological sequences as fixed-length vectors that capture the occurrences of all possible k-mers (k represents the length of the subsequences). The method allocates weights to the k-mers that are calculated based on their inverse document frequency (IDF), which determines how uncommon a particular k-mer is across all sequences.
Formally, to compute the IDF weights, we first calculate the total number of input SARS-CoV-2 genome sequences, N, and the number of input genome sequences that contain a specific k-mer, n i . The IDF weight for the k-mer i is then provided by: Then, we generate a list of all possible k-mers based on the nucleotide set A, C, G, T of the input SARS-CoV-2 genome sequences. For each SARS-CoV-2 genome sequence, we calculate the frequency of each k-mer and multiply it by the corresponding IDF weight to obtain a weighted frequency, which is given by: Here, w(i, j) is the weighted frequency of k-mer i in SARS-CoV-2 genome sequence j, f (i, j) is the frequency of k-mer i in SARS-CoV-2 genome sequence j, and IDF(i) is the IDF weight of k-mer i.
The above weighted frequency values are then used to construct a frequency vector, where each element represents the frequency of a particular k-mer in the sequence. The frequency vector for sequence j is given by: where m is the total number of possible k-mers. In our experiment, we considered k = 3, which is selected by a standard validation approach.

The Weighted Position Weight Matrix (PWM)
The weighted position weight matrix (PWM) technique generates PWM scores for all possible k-mers (k is the length of subsequences) in SARS-CoV-2 genome sequences [27]. The PWM is produced by a two-step method: In the first step, the method calculates the occurrence of each base at every position of the k-mers. In the second step, it computes a weighted score for each k-mer considering the log-odd ratio of its observed frequency compared to the background frequency. The background frequency is estimated using the LaPlace pseudocount and the equal probability assumption for each nucleotide.
The formula for computing the PWM is as follows: where k-mer is the k-mer sequence, K is the length of the k-mer, f (k−meri, b) is the count of the nucleotide b at position i in the k-mer, b i is the background frequency of the nucleotide b, and log2 is the base-2 logarithm function. The final output is a list of scores for each k-mer in each input sequence, where the score is the sum of the weight scores for each base in the k-mer. For the current experiment, k = 3 was selected using the standard validation set approach. Our objective is to evaluate the performance of these algorithms by employing two different approaches.

Approach 1: Accuracy
Here, we compute the average accuracy, precision, recall, F1 (weighted), F1 (Macro), and ROC-AUC for the entire dataset, including all the class labels mentioned in Table 1. We exclude error sequences from the dataset.

Approach 2: Robustness
Robustness is crucial to machine learning models. It represents their ability to generate reasonable outputs for input examples not included in the training data. As our test set, we consider only PacBio, ONT, and random protocol-specific errors incorporating noisy examples, whereas the training set uses non-errored sequences. We then calculate the average accuracy, precision, recall, F1 (weighted), F1 (Macro), and ROC-AUC for the ML models based on the test set. Overall, these two strategies provide comprehensive evaluations of machine learning algorithms' performance. This allows us to compare and identify the most-suitable algorithm for our classification task.

Data Visualization
In order to ascertain if there exists any inherent clustering in our dataset, we employ the t-distributed stochastic neighbor embedding (t-SNE) approach to produce a twodimensional representation of the feature embeddings [28].

Results and Discussion
This section provides an overview of the outcomes achieved by our methods on the datasets employed in this study. The first subsection, labeled Section 3.1, discusses the accuracy evaluation of machine learning classification algorithms that utilized various embedding methods. The second subsection, labeled Section 3.2, covers the robustness evaluation of machine learning classification algorithms that used different embedding methods. The third subsection, labeled Section 3.3, focuses on the comparison of predictive performance of machine learning models on SARS-CoV-2 sequences with errors obtained from PacBio and ONT sequencers. Lastly, Section 3.4 explores the analysis of coronavirus variants using various embedding vector generation methods with the aid of t-SNE visualization.

Accuracy Evaluation of Machine Learning Classification Algorithms Using Different Embedding Methods
We considered 8172 clean (error-free) full-length SARS-CoV-2 nucleotide sequences from the GISAID database. These sequences were used to evaluate the machine learning models with embedding methods. In order to do that, we split the sequences into training and test sets with a 70/30% ratio. After that, we executed each analysis five times and considered the average results, reported in Table 2 and Figure 2. The results show that the machine learning classification algorithms' performance significantly varies depending on the embedding method employed. Specifically, the one-hot embedding method leads to an accuracy of 0.773 for the SVM algorithm, whereas the WDGRL embedding method only results in an accuracy of 0.327. The spaced k-mers embedding method with the SVM, RF, LR, and DT classification algorithms achieves an accuracy of up to 0.956. This method employs g-mers and k-mers to decrease the sparsity and size of k-mers in the genome sequence. As a result, it generates fixed-length vectors that capture the occurrences of all possible k-mers, which are then used to construct frequency vectors representing the frequency of each k-mer in the sequence. This method performs well with the error-free set of SARS-CoV-2 genome sequences. However, the NB algorithm yields the worst results, with an accuracy of only 0.017 when the weighted k-mers embedding method is used. Additionally, some algorithms, such as SVM and LR, have significantly longer training times compared to others. Thus, while selecting an algorithm and embedding method, one should consider both performance and training time.

Robustness Evaluation of Machine Learning Classification Algorithms Using Different Embedding Methods
We considered 8172 clean SARS-CoV-2 sequences and incorporated errors specific to PacBio, ONT, and the random protocol, as described in the methods section. This approach helped to generate three different types of datasets: genome sequences with typical PacBio sequencing errors, ONT sequencing errors, and random errors. To evaluate the robustness of the machine learning models with embedding methods on the three different datasets, we train the models with clean SARS-CoV-2 sequences and test them on error-incorporated sequences. Table 3 displays the accuracy values of various machine learning classification algorithms that used different embedding methods on SARS-CoV-2 genome sequence datasets simulated at two different depths, 5 and 10, with PacBio sequencer-specific errors incorporated. Furthermore, Figure 3 reveals that the accuracy values for machine learning algorithms ranged from 0.001 to 0.276 across all embedding methods. The spaced k-mers embedding method, in general, performed better than other embedding methods, achieving the highest accuracy value of 0.276 for the maximum number of algorithms for the depth-5 sequencing dataset, and a similar trend was observed for the depth-10 dataset. The reason for this is that the spaced k-mers method employs g-mers and k-mers to decrease the sparsity and size of k-mers in the genome sequence. As a result, it generates fixed-length vectors that capture the occurrences of all possible k-mers, which are then used to construct frequency vectors representing the frequency of each k-mer in the sequence. The accuracy results confirm that as the depth decreases, the error rate increases, resulting in a decrease in the performance of machine learning models. The model's performance did not improve significantly by increasing sequencing depth from 5 between the two SARS-CoV-2 genome sequence datasets. Table 3. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and specific errors associated with the PacBio sequencer. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.

Embed. Method
ML Algo.    Table 4 displays the accuracy values obtained from different machine learning algorithms using various embedding methods on two SARS-CoV-2 genome sequence datasets with depths of 5 and 10, respectively, which were generated from long-reads containing Oxford Nanopore Technology (ONT) sequencer-specific errors. Moreover, Figure 4 presents a heatmap that visualizes the accuracy values, which ranged from 0.001 to 0.276. The weighted k-mers embedding method resulted in the highest accuracy values for the majority of the machine learning algorithms on both datasets, i.e., depths of 5 and 10. Because each k-mer is given a weight depending on its inverse document frequency under the weighted k-mers technique, this method generates fixed-length vectors that capture the existence of all potential k-mers. These vectors are then used to create frequency vectors that indicate the frequency of each k-mer in the sequence. However, due to the lower sequencing depth with ONT sequencer-specific errors, poor-quality SARS-CoV-2 genome sequences were generated, leading to a significant decrease in the predictive performance of machine learning algorithms. Table 4. Provides a comprehensive analysis of the robustness of 8172 SARS-CoV-2 genome sequences under two different sequencing depths (5 and 10) and specific errors associated with the Oxford Nanopore Technology (ONT) sequencer. The results of this analysis, which are based on the identification of optimal values, have been highlighted in bold for ease of interpretation.

The Robustness Results for Random-Error-Incorporated Datasets
In this section, we evaluated the accuracy of various machine learning algorithms using different embedding methods on two SARS-CoV-2 genome sequence datasets. These datasets were generated by incorporating random errors into long-reads at depths of 5 and 10. The results, presented in Table 5 and Figure 5, indicate that the weighted k-mers method achieved the highest accuracy of 0.276 across the majority of machine learning classification algorithms for both datasets. The main objective of incorporating random errors into the SARS-CoV-2 datasets was to compare the performance of machine learning models on datasets generated by different types of errors, including sequencer-specific errors and random errors. Interestingly, we found that there was not much difference in accuracy between these two types of errors.

Comparison of Predictive Performance of Machine Learning Models on SARS-CoV-2 Sequences with Errors from PacBio and ONT Sequencers
Third-generation sequencing (TGS) technologies such as PacBio and Oxford Nanopore Technology (ONT) are widely used for generating long reads with high error rates. However, PacBio technology sequences a DNA molecule multiple times, whereas ONT sequences it only twice, making PacBio generate higher-quality data with lower error rates compared to ONT. Through our analysis, we discovered that the errors specific to the PacBio sequencer have a more significant impact on the predictive performance of machine learning (ML) models on SARS-CoV-2 sequences than errors specific to ONT. Our ML model's predictive performance indicated that PacBio sequences have a lower error rate than ONT, but the low predictive power was due to low coverage. We also compared the predictive performance of ML models on SARS-CoV-2 sequences incorporated with random errors with other datasets and observed that the results were similar to the ONT scenario.

Analysis of Coronavirus Variants Based on Different Embedding Vector Generation Methods Using t-SNE Visualization
The t-distributed stochastic neighbor embedding (t-SNE) method is a widely used data visualization technique that preserves the pairwise distances between high-dimensional vectors in a lower-dimensional space. In this study, we employed t-SNE to visualize the clustering patterns of different coronavirus variants using various embedding vector generation methods, including one-hot encoding (OHE), Wasserstein-distance-guided representation learning (WDGRL), string kernel, spaced k-mer, weighted k-mer, and weighted position weight matrix (PWM). Our analysis, as depicted in Figure 6, reveals the remarkable effectiveness of t-SNE in capturing the pairwise distance information and unveiling the distinct grouping patterns of coronavirus variants in a two-dimensional space. Specifically, the t-SNE plot based on the OHE vector demonstrated that AY.44 variants were more clearly grouped than the other variants, while the WDGRL vector maintained a smaller group of variants than OHE vector. Furthermore, the string-kernel-vector-based t-SNE plot exhibited clearer grouping patterns of AY.44 and other variants than the OHE vector. Additionally, the spaced k-mer vector method showed a more distinct grouping of variants compared to other embedding vector generation methods. The weighted k-mer vector exhibited grouping of the variants similar to the WDGRL vector, whereas the weighted PWM vector showed grouping patterns more similar to the string kernel vector.

Conclusions
In summary, the COVID-19 pandemic has emphasized the importance of transitioning from second-generation to third-generation sequencing technology. Long-read sequencing has emerged as a critical tool for unraveling various genomic features of the SARS-CoV-2 virus. With the ability to read longer DNA fragments, ranging from 5000 to 30,000 base pairs, long-read sequencing addresses a major challenge faced by short-read sequencing methods. This extended read length has enabled researchers to detect complex structural variations, including large insertions/deletions, inversions, repeats, duplications, and translocations. Additionally, long-read sequencing has facilitated the phasing of SNPs into haplotypes and facilitated de novo genome assembly. However, it is important to acknowledge that the high error rate associated with long-read sequencing may impact the interpretation of SARS-CoV-2's biology.
In this study, we have demonstrated that the accuracy of machine learning classification algorithms in analyzing SARS-CoV-2 genome sequences greatly depends on the selection of appropriate embedding methods. Our analysis of simulated SARS-CoV-2 viral sequences underscores the value of employing robust embedding techniques capable of effectively managing errors and accurately categorizing genome sequences considering both long-read sequencer-specific errors and random error types. Specifically, we have identified certain embedding methods, such as WDGRL and weighted PWM, as superior in detecting errors and classifying sequences. These findings highlight the potential of machine learning in analyzing SARS-CoV-2 genomic data, contributing to a deeper understanding of the virus's evolution and spread.
In the future, we want to explore more sequence embedding and advanced deep learning methods on SARS-CoV-2 genomic sequences generated at different long-read sequencing depths with third-generation sequence-specific errors. These experiments will help us develop robust models to improve our ability to adapt long-read sequencing technology (PacBio and ONT) to produce error-free SARS-CoV-2 genome sequences to understand and answer critical biological questions.

Data Availability Statement:
The data utilized in this research study were acquired from the publicly accessible database known as the Global Initiative on Sharing All Influenza Data (GISAID) (https: //www.gisaid.org/ (accessed on 19 May 2023)) for the period spanning September to December 2021. To facilitate replication of the findings, the source codes and pipelines employed in the analysis can be accessed at https://github.com/sarwanpasha/Long_Read_Noisy_Sequences (accessed on 19 May 2023).

Conflicts of Interest:
The authors declare no conflict of interest.