Exploration of hybrid deep learning algorithms for covid-19 mRNA vaccine degradation prediction system

Coronavirus causes a global pandemic that has adversely affected public health, the economy, including every life aspect. To manage the spread, innumerable measurements are gathered. Administering vaccines is considered to be among the precautionary steps under the blueprint. Among all vaccines, the messenger ribonucleic acid (mRNA) vaccines provide notable effectiveness with minimal side effects. However, it is easily degraded and limits its application. Therefore, considering the cruciality of predicting the degradation rate of the mRNA vaccine, this prediction study is proposed. In addition, this study compared the hybridizing sequence of the hybrid model to identify its influence on prediction performance. Five models are created for exploration and prediction on the COVID-19 mRNA vaccine dataset provided by Stanford University and made accessible on the Kaggle community platform employing the two deep learning algorithms, Long Short-Term Memory (LSTM) as well as Gated Recurrent Unit (GRU). The Mean Columnwise Root Mean Square Error (MCRMSE) performance metric was utilized to assess each model’s performance. Results demonstrated that both GRU and LSTM are befitting for predicting the degradation rate of COVID-19 mRNA vaccines. Moreover, performance improvement could be achieved by performing the hybridization approach. Among Hybrid_1, Hybrid_2, and Hybrid_3, when trained with Set_1 augmented data, Hybrid_3 with the lowest training error (0.1257) and validation error (0.1324) surpassed the other two models; the same for model training with Set_2 augmented data, scoring 0.0164 and 0.0175 MCRMSE for training error and validation error, respectively. The variance in results obtained by hybrid models from experimenting claimed hybridizing sequence of algorithms in hybrid modeling should be a concerned.


Introduction
Treatments for SARS-CoV-2 have caused millions of deaths since the end of 2019, and this pandemic is yet to be invented [1]. Therefore, authoritative vaccines are needed to control the outbreak. Unfortunately, even though mRNA vaccines have shown a promising effect, it has a drawback of rapid degradation. Wadhwa et al. [2] showed that degradation has significantly reduced the mRNA yields during the in-vitro transcription. Note that the half-life of mRNA vaccines might also be 900 days in cold chain conditions, having a rate of more than 2% degradation every 30 days [2]. Moreover, it is important to note that the half-life of vaccines can be drastically shortened to 5 and 10 days, accordingly, having a temperature digression to about 37°C or with a 2 unit drift with respect to the pKa value [2]. The invitro transcription of the vaccines can be conducted at a temperature of 37°C with magnesium ions (Mg2+), which subsequently reduces the half-life to no longer than 2 hours [2]. The outcome, in which the mRNA vaccine is unstable since degradation still happens throughout the transcription process, is similar despite lowering the Mg2+ temperature, or concentration, pH value to reduce the hydrolysis [2]. Besides, Abbasi [3] claimed that this restriction might be circumvented with a second and perhaps booster dose regimen of the vaccine. However, the degradation concern must not be disregarded since a vaccine's effectiveness cannot be replaced or recovered after it has been compromised. Stabilizing mRNAbased vaccines has always been a great challenge. Looking for an optimum solution is like facing an enigma with no end and has caused headaches to vaccine scientists and researchers for decades. Parenthetically, vaccines that become unstable have induced countless losses of lives [4], especially during a pandemic. This study is crucial to address the safety concerns to ensure no adverse impacts on the potency of a vaccine. Vaccine functionality and characteristics are easily affected by a minor degradation [5], and its degradation rate is easily altered by both intrinsic and extrinsic factors.
It is important to research the degradation of the mRNA vaccine. However, few studies were performed on predicting mRNA or vaccine degradation, especially concerning COVID-19 mRNA vaccines. By the end of 2020, research by Singhal on the topic of COVID-19 mRNA vaccine degradation prediction utilizing Graph Convolution Network (GCN), Gated Recurrent Unit (GRU), as well as Long-Short-Term-Memory Cells (LSTM) algorithms assessed with root mean square error (RMSE) revealed that GCN-based model (0.249) is the finest for reactivity prediction. Meanwhile, the GRU-based model with an accuracy of 76% is marked as the premium predictor when considering all the target variables [6]. Imran et al. used a regularized LSTM model to forecast the degradation rate with respect to the mRNA vaccine, and it showed better performance than tree-based algorithms. Different activation functions, including linear, hyperbolic tangent (Tanh) as well as a rectified linear unit (ReLU), were taken into consideration for each layer of the model during model development to converge the Mean Columnwise Root Mean Square Error (MCRMSE) losses [7]. GRU-related models were also proposed and considered [8] [9] [10]. A modified GRU with a multi-head attention mechanism was developed by Wang et al. to train the model by having 3 GCNs to deal with three adjacency matrices, i.e., basepairing probability (BPP), structure adjacency and distance matrices, respectively. The model performance is measured with MCRMSE, achieving a passable score of 0.3489 [8].
Muneer et al. and Qaid et al. used hybrid models to predict the degradation rate [9] [10], and in tandem with Convolutional Neural Network (CNN), the authors came up with GCN_GRU and GCN_CNN models [9]. Between these models, GCN_GRU pre-trained embedding model showed the best performance with a score of 0.938 for the Area Under the Curve (AUC) performance metric, which indicated its suitability in the base-wise reactivity prediction studies compared to CNN. On the other hand, the three models proposed by Qaid et al. include LSTM, GRU, and hybrid LSTM_GRU. Different from other research, the authors suggested two different encoding methods, i.e., the base (0 -13) encoding method and the codon (1 -434) method of encoding. Results presented that LSTM trained with the codon encoding method is the best model among all the proposed models [10]. However, the authors suggested that the base encoding method is much more preferable compared to the codon encoding method since it has a lesser tendency to overfitting.
Other than deep learning algorithms, Ing et al. proposed three machine learning algorithms (Random Forest, Light Gradient Boosting Machine, as well as Linear Regression) with respect to this prediction study [11] [12], showing that both theories of machine learning and deep learning are comfortable with this study. Referring to all the past studies reviewed, it is found that GRU and LSTM are the two widely used algorithms in this field of research. Besides, it deduced that hybridizing algorithms to form a hybrid model for prediction is conducive to reducing the error. However, it is noticed that researchers failed to demonstrate the results of hybrid models concerning a different sequence of hybridization. Hence, this paper presents the prediction results of hybrid models considering the hybridizing sequence with GRU and LSTM algorithms, utilizing the concept proposed by Qaid et al. in [10].

Method
Besides developing reliable models that have the ability to forecast the rate of the COVID-19 mRNA vaccine degradation, this paper also focuses on discovering the relationship between predicted results with the hybridizing sequence of algorithms in hybrid modeling. To ensure comparability, this research utilized the same datasets and performed the same concept and theory as executed by Qaid et al., except excluding the codon encoding method. Since researchers' main objective is to develop degradation prediction models with absolute accuracy with low error rates, only the training and bpps datasets are extracted from the Kaggle community and Eterna platforms. This was bolstered by Stanford University [13] to perform a supervised-based study instead of a semi-supervised. Several features in the forms of aggregate functions (Exponential Weighted Average, Maximum, Normalize, Average Position Value, and Summation) were engineered from the BPPs dataset to represent the numerical features. The preprocessing step handles eliminating noises and data organization for training and evaluating purposes. Completion of the pre-processing data stage will generate well-encoded, clean data that is ready to be fed into models, followed by the training of 5 models with the trained dataset for model development.
Model performance was evaluated on the validation set with MCRMSE. Fig. 1 illustrates the method workflow.

Dataset
The empirical data obtained from Kaggle sourced by Stanford University [14] for this research gave rise to a regression study, allowing the prediction of the degradation rate of the mRNA vaccine to be studied. Together with the augmented dataset and at the same time having the training dataset contributing 2400 samples to the total amounts, two sets of data, one with 4800 samples and another with 21600 samples, are gained.

Augmented Dataset
Deep learning algorithms have an innate defect of tending to overfit the data [15] [16] [17]. To avoid the overfitting issue, analysts suggest increasing the number of training samples to ensure diversion. Still, data collection requires a lot of procedure and resources and is undeniably time-consuming. Therefore, augmenting existing data by modifying the samples will usually be the preference for most practitioners in circumventing overfitting [18] [19]. For this research, we utilized two different sets of augmented data. The first set of augmented data (Set_1) is the attached augmented data generated with the ARNIE package offered by Qaid et al. [10] to ensure utter comparability is attained. In contrast, the second set of augmented data (Set_2) is a public dataset retrieved from [20].

BPPs Dataset
Kaggle platform presented a set of data that comprised 6034 BPPs symmetric square matrix NumPy file in forming the BPPs dataset. Summing the 2400 samples from the training dataset and 3634 samples from the testing dataset resulted in the amount of 6034 BPPs files. However, this research focused on supervised learning, utilizing only the training samples. This BPPs dataset will be engineered to generate useful aggregate function features, also known as the numerical features, by Qaid et al.

Data Pre-processing
Fallacious, abominable, or nugatory data will alter the prediction accuracy and quality [21] [22] [23] [24]. Therefore, to eschew undesired complications, cleaning and simplifying noisy, crude data to ease data handling and minimizing the reduction of data quality by conducting procedures of data preprocessing is a crucial step.

Data Cleaning
Practicing the first phase of exploratory data analysis with the '.isnull' command, discovered no missing value in the dataset extracted; however, the 'signal_to_noise' field uncovered that the dataset subsumed noisy samples. Therefore, the dataset is filtered with stipulated SN_filter criteria as proclaimed in Table 1 to ensure solely refined samples are preserved. After filtering, a total of 304 noisy samples are removed from the training dataset 2.

Label Encoding
Data could come in multiple data types, i.e., categorical, ordinal and numerical, but recommended to be modified into numerical since some algorithms that could not manage non-numerical data exist [25]. Label encoding is suggested to encode the three non-numerical inputs: 'predicted_loop_type,' 'structure,' as well as 'sequence.' Here, the characters are base encoded as depicted in Table 2.

Feature Engineering
The quality of inputs, also known as features, will determine the aptitude of a model. Processing time and storage space can be greatly saved with the presence of first-string quality inputs. To process the raw BPPs matrix dataset into a more apposite form of inputs, the dataset is feature engineered into a quinary of aggregator-function inputs, that is, the numerical features introduced in [ [12]. However, since this study engaged only the training dataset in which all the samples hold an equal number of bases for each sequence, referred to as 'seq_length,' this research harnessed all the numerical features as inputs.

Deep Learning
GRU and LSTM are the two deep learning algorithms implemented for this degradation rate prediction study. The five models developed with these two algorithms are evaluated with the MCRMSE performance metric.

Gated Recurrent Unit (GRU)
GRU may be presented as a spinoff of LSTM [26], a type of RNN. Although GRU is lucid and more compact than LSTM, not only the competency in mastering context is not omitted, but on the contrary, reducing the training time [27] [28]. Alluded to research conducted by [6] as well as [8] [9] [10] on predicting COVID-19 mRNA vaccine degradation rate, it is deduced that GRU is indeed an applicable algorithm for this bioinformatics-related artificial intelligence-based research.

Long Short-Term Memory (LSTM)
Compared to GRU, which has only two gates (update gate as well as reset gate) in modulating information flow, LSTM has higher gates (output gate, forget gate, as well as input gate) for information winnowing [29] [30] [31], leading LSTM to have a higher complexity but better accuracy than GRU. If accuracy was of priority and a large dataset was practiced, LSTM used to be cherry-picked by researchers more than GRU. The off-the-rack results presented in [10] by Qaid et al. have had this argument testified.

MCRMSE Performance Metric
The performance and effectiveness of the proposed model will then be evaluated with a performance metric. This study is a regression related-study that aims to forecast the mRNA vaccine degradation rate with respect to COVID-19. Therefore, regression error is analyzed to study the models' prediction performance, and MCRMSE, which stands for Mean Column-wise Root Mean Squared Error, is proposed. The square root of the mean of the squared variations between the predictions and the ground truth is factored by the regression performance metric known as RMSE to determine the average magnitude of errors [32] [33]. The RMSE metric formula is provided in (1), in which n denotes the number of occurrences.
Meanwhile, MCRMSE can be deduced as an average across all RMSE values for each predicted target to obtain an individual number evaluation metric from multiple outputs. The formula for MCRMSE is presented in (2), where Nt will be inputted with the number of targets for prediction scoring. (2) Equations (1) and (2) attested that both RMSE and MCRMSE are negative-oriented scoring techniques. Graced with the presence of a square in equations, the error ranged from zero to positive infinity

Results and Discussion
Other than determining and developing models concerning COVID-19's mRNA vaccines' degradation rate prediction, this paper concentrates on discovering the sequence of hybridizing effects on the prediction results. MCRMSE is engaged as the performance metric for models' prediction performance evaluation across the five outputs.

Hybridization
This is in contrast to several ensemble approaches with algorithms to serve independently to produce several outcomes followed by polling systems like max voting, weighing, averaging and determining a single final result. On the other hand, hybridization has algorithms that serve dependently to produce a single result with no polling system involved [34].
This research utilized the approach suggested in [10] GRU and LSTM in model development.
Taking the three models proposed by Qaid et al. with the additional two hybrid models suggested in this research for hybridization sequence exploration, a total of five models are engaged for this study. All the models constituted three bidirectional layers, with each direction having 256 hidden layers. The first and second models, i.e., the GRU and LSTM models, have all three layers congregated with GRU and LSTM, respectively. The remaining three models were made up of two GRU and one LSTM. The hybrid models are given their appellation name dependent on the layer where the LSTM is occupied. For illustrative purposes, if the hybrid model has had LSTM occupy the first layer and the remaining two layers by GRU, it is named Hybrid_1. The hybridizing sequence of each model is detailed in Table  3.

Prediction Performance
The overall filtered dataset is classified into a validation set as well as a training set with a 90:10 percentage split to ensure this research achieves utter comparability with Qaid et al.'s research. When the augmented data is not included, the split produces training and validation data of 1886 and 210 samples, respectively. However, when Set_1 augmented data is considered, 3772 samples of training data and 420 samples of validation data are split from the 4192 samples of the training dataset that have had 608 noisy samples removed. In addition, when Set_2 is involved, removing 3076 noisy samples from 21600 samples, the remaining 18524 samples in the training data are divided into validation data with 1853 samples as well as training data with 16671 samples for model development. That aside, several parameters and hyperparameters are initialized as tabulated in Table 4 to configure the models.  Table 5 shows the deep learning models' prediction performance evaluated with MCRMSE. Referring to Table 5, it is observed that although the overall results obtained are slightly better than the results presented by [10], the ±0.005 difference is too paltry to be considered when compared with the gained loss errors. Nevertheless, this study addresses the effects of hybridizing sequence in the model on the prediction performance with the mRNA vaccines rate of degradation dataset. Meanwhile, probing the contribution of the numerical BPPs inputs to the prediction.
From Table 5, regardless of the presence or absence of the BPPs numerical inputs when the dropout value is set to 0.5 and involves Set_1 augmented data, the LSTM model scored better than the other four models. However, when the Set_1 augmented data is not committed in the experiment, although haunted with overfitting issues, hybrid models have shown lower error rates than the LSTM model. Setting the dropout value to 0.5, even though the MCRMSE loss of the LSTM model (0.1378) on training data is much lower than the other four models when both Set_1 augmented data and numerical BPPs inputs are absent, its validation error loss is the highest. In short, a deduction on the LSTM model can outshine the other four models when interacting with conversant samples but not with unacquainted samples that can be drawn from these results under the criteria. These results indicate that even if the overfitting issue is lifted, the LSTM model may not be qualified as the wistful model to be considered.
Worth noting that when no augmented data is involved, although overfit, the hybrid models show lower loss errors than the GRU model (constituting three bidirectional GRU layers) and the LSTM model (which comprises three bidirectional LSTM layers). This result presents that hybridization is indeed practicable for better model performance at the same time, showing the claim that LSTM can achieve better performance than GRU is only applicable when big data is involved.
This research involved two sets of augmented data to study the hybridization sequence of models for predicting the degradation rate of the mRNA vaccine. Besides Set_1, the prediction errors of generated models trained with Set_2 augmented data are also available in Table 5. It is discovered that when the dropout value is set to 0.5, all models are wiped out by an underfitting issue when trained with the Set_2 augmented data without the presence of BPPs numerical inputs. The results have presented that when models are trained with Set_2 augmented data, involving numerical inputs is no better than excluding them. Observation from the prediction errors tabulated in Table 5 discovered that, besides the GRU model and Hybrid_3 model, all the remaining three models are being whipped.
Valuing dropout with 0.5, when Set_1 augmented data is engaged, no overfitting nor underfitting issue arises, but when engaging model training with Set_2, virtually all models face an underfitting issue. Therefore, to allow the proceeding of the research, the dropout value is tuned to zero and experimented with Set_2 augmented data on all the five generated models. Dropout is a class of stochastic approaches introduced originally by Hinton et al. [35] to be employed in practice, such as regularisation, model compression, handling overfitting, and more [36] [37] [38]. Tuning the dropout value has outlined its ability to handle underfitting besides solving overfitting.
Dropout is a process involving neurons of a neural network [39] [40], while neurons can be described as some weight-linked processors [41] [42]. Weights and activation functions are the two main components in neurons besides inputs and outputs [43] [44], but the number of neurons is arbitrary. There are no specific rules for prior determination of the number of neurons occurring in each layer with respect to a model . The number of neurons will determine the degree of complexity of a model [45]. Although overfitting can be solved by dropping some neurons [46], dropping out too many neurons will induce underfitting, like those results when trained with Set_2 augmented data with dropout value 0.5 shown in Table 5.
After assigning zero to the dropout value, prediction errors show that all the models manage to have better performance with the presence of BPPs inputs when Set_2 augmented data is involved. Again, the LSTM model surpasses the other models by scoring and achieving the lowest error rate. Meanwhile, among Hybrid_1, Hybrid_2, as well as Hybrid_3, it seems that Hybrid_3 possesses the lowest errors and manages to rank second, in tow to the LSTM model. With the difference in prediction errors scored by these three hybrid models, the message that delivers the importance of hybridizing sequence is once again stressed.
Moreover, as observed from the loss errors tabulated in Table 5, taking the numerical inputs engineered from the BPPs dataset alone does not solve overfitting or underfitting problems. In virtue of augmented data, it is observed that the numerical features have improved the performance trivially by reducing the errors by at most 0.002 with Set_1 augmented data and 0.004 with Set_2 augmented data. However, focusing on Set_1 augmented data, surprisingly, LSTM fits better with the dataset without the numerical inputs. Although the effect is weeny, the numerical inputs bring no good impact to the LSTM model. Even with Set_2 augmented data, although the presence of BPPs manages to help in reducing the error, the improvement is merely just ±0.002 compared to without it. With the results, reconsideration on implementation of numerical features that show low competency (±0.002 or ±0.004) that is too pittance to be discerned compared to the losses error is required when taking computational time and complicity into consideration.
Among the hybrid models, when augmented data is involved, regardless if the augmented data is Set_1 or Set_2, results show that the Hybrid_3 model performed better than Hybrid_2, followed by Hybrid_1. When there is a presence of numerical inputs but an absence of augmented data, Hybrid_2 scored better than Hybrid_1 and Hybrid_3 when BPPs numerical inputs are considered; but, Hybrid_3 has a lower prediction error when both augmented data and BPPs numerical inputs are absent. These results have proven that both the training factors and the sequence of hybridizing algorithms in model formation influences prediction performances.

Conclusion
Referring to the results obtained, it may be established that both GRU and LSTM are applicable for this mRNA vaccine's degradation rate prediction research. Notice that when the data augmentation process is not practiced, the overfitting issue is more severe in the LSTM model than in all the other developed models. But, when the sample size is doubled or more, the LSTM model outdid the other models, proving that LSTM is more suited for big data prediction. Over and above that, theorized that achieving a good result can only be granted if the complexity of the model is in jibed with the dataset.
Better pattern recognition and easier model fitting to the dataset can be achieved with fine features and inputs. Still, it is essential to be prudent with the implementation of additional engineered features. For example, suppose the features show no promising merits to the model in prediction performance. In that case, it is recommended to exclude the features as inputs for model training as they will magnify the intricacy of models and lengthen the computation time, which is pyrrhic. The results in Table 5 have validated the argument that hybridization is a good approach for performance improvement. Furthermore, the difference in results presented by Hybrid_1, Hybrid_2, and Hybrid_3 attested to the claim that the prediction performance of a hybrid model is not solely dependent on the factors in the training stage but also on the sequence of algorithms being hybridized for model development. Therefore, experimentation, along with trial and error, is required to examine the sequencing effect of the algorithms with respect to the performance involving hybridization. As concluded, the results obtained construed that doubling the amount of the original samples resolved the overfitting predicament, highlighting that increasing the amount could further improve the prediction performance by reducing the loss errors. However, further increasing the sample size could burden a model, and underfitting will be induced if the model cannot afford the complexity of the data. Therefore, multiplying the amount of sample is hereupon recommended for future research. Still, at the same time, it should never overlook the compatibility between model and data to avoid both underfitting and overfitting issues. Moreover, this study only compared the hybrid models suggested in [10]. However, LSTM surpasses GRU in accuracy with its complexity, justified by the results tabulated. Hence, we suggest replacing one of the bidirectional GRU layers with a bidirectional LSTM layer along with hyperparameter tuning to improve the hybrid models proposed by [10]. Furthermore, for future work, it is proposed to hybridize other machine learning models with deep learning models to lessen the complexity of a hybrid model but ascent the prediction performance.