Developing Novel Robust Loss Functions-Based Classification Layers for DLLSTM Neural Networks

In this paper, we suggest improving the performance of developed activation function-based Deep Learning Long Short-Term Memory (DLLSTM) structures by employing robust loss functions like Mean Absolute Error <inline-formula> <tex-math notation="LaTeX">$(MAE)$ </tex-math></inline-formula> and Sum Squared Error <inline-formula> <tex-math notation="LaTeX">$(SSE)$ </tex-math></inline-formula> to create new classification layers. The classification layer is the last layer in any DLLSTM neural network structure where the loss function resides. The LSTM is an improved recurrent neural network that fixes the problem of the vanishing gradient that goes away and other issues. Fast convergence and optimum performance depend on the loss function. Three loss functions (default<inline-formula> <tex-math notation="LaTeX">$(Crossentropyex)$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$(MAE)$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$(SSE))$ </tex-math></inline-formula> that compute the error between the actual and desired output for two distinct applications were used to examine the effectiveness of the suggested DLLSTM classifier. The results show that one of the suggested classifiers’ specific loss functions <inline-formula> <tex-math notation="LaTeX">$(SSE))$ </tex-math></inline-formula> works better than other loss functions and does a great job. The suggested functions <inline-formula> <tex-math notation="LaTeX">$Softsign$ </tex-math></inline-formula>, Modified-Elliott, Root-sig, Bi-tanh1, Bi-tanh2, <inline-formula> <tex-math notation="LaTeX">$Sech$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$wave$ </tex-math></inline-formula> are more accurate than the tanh function.


I. INTRODUCTION
In recent years, the machine learning (ML) community has come to regard the deep learning (DL) computer paradigm as the Gold Standard. It has also steadily become the most popular computational strategy in the field of machine learning. This is because it does several difficult cognitive tasks as well as or better than humans' performance [1]. A Deep Neural Network (DNN) is a particular type of neural network represented as a multilayer perceptron (MLP), which is trained using algorithms to learn representations from data sets without the need for manually designing feature extractors [2].
As the name DL suggests, it has more or deeper levels of processing than a shallow learning model, which has fewer The associate editor coordinating the review of this manuscript and approving it for publication was Liang-Bi Chen . layers of units [3]. With a deeper knowledge and use of the backpropagation algorithm, self-directed learning was made possible [4]. Deep learning neural networks (DLNNs) are used in a variety of industries for three reasons [5]. First, since DLNN-based classifiers are data-based, they are more resistant to imperfections in real systems. Second, DLNN-based classifiers have a minimum of computational complexity, requiring only a few simple matrix and vector operations at various levels. Thirdly, with the fast development of parallel processing capability in specialized processors like graphic processing units (GPU) [6]. DLNN-based techniques are significantly more efficient because DLNN implementation is simple to parallelize on parallel architectures and easy to implement with low data type accuracy. These benefits helped DLNNs to look the way they did and do well in many fields [7].
RNNs are a widely used and well-known algorithm in the field of DL. RNN is primarily utilized in contexts related to speech processing and Natural Language Processing (NLP) [8]. RNN uses sequential data in the network, as opposed to traditional networks. Since the inherent structure in the sequence of the data provides essential information and is necessary for a few different applications, it is important to know the context of the sentence in order to determine the meaning of a specific word. The RNN can therefore be thought of as a short-term memory unit [9].
Hochreiter and Schmidhuber Long Short-Term Memory (LSTM) structure has been proven to be efficient for a variety of learning issues, especially those necessitating large data sets. The LSTM structure is composed of ''blocks,'' which are collections of units that are repeatedly connected to one another. Develop LSTM techniques, structures, and transfer functions to deal with the issue of disappearing or exploding gradients. These make the network more accurate as it trains deeper [10]. One of the best things about DL is how flexible it is when it comes to architectural design. This means that there are many ways to put priors over data into the model and find the best activation functions, learning algorithms, or loss functions [11].
Frank et al. [12] reviewed some of the recent developments and efforts that used ML in science and engineering. They believed that despite its enormous success over recent years, ML is still in its infancy and will play a significant role in scientific research and engineering over the upcoming years. Poulinakis et al. [13] presented a work that shed light on the limitations of various ML and cubic spline methods when data is sparse and noisy. As a result, they discovered the true function hidden under the noise, thus making ML a valuable tool in practical applications. Additionally, they focused on hypothetical generalized functions with and without noise. The conclusions from this study are beneficial in guiding further research regarding the splines and ML modeling.
A critical component of training a deep learning model is the loss function, one of the hyperparameters that can be adjusted. They are employed to determine the difference between the actual and desired outputs' accuracy and loss. As a result of the loss, the DL network changes the weights for the connections between the neurons or classifiers [14]. The performance of the resulting DL model can be affected by the loss function selection. In fact, the recent works on customized loss functions exhibit strong offensive performance on the selected datasets. It should be noted, nonetheless, that a loss function that performs well in one offensive context need not perform similarly in another. In other words, the number of traces, model architecture, and the initialization of the weight are just a few of the many factors that affect both the offensive performance and the loss function [15].
Farzad et al. [16] compared 23 different activation functions in which the three gates (the input, output, and forget gate) changed activation functions while the block input and block output activation functions were held constant with the hyperbolic tangent so that the activation functions of the block could be compared (tanh). The authors have recommended altering the hyperbolic tangent function on the block input and block output as a better alternative for altering the activation functions in the three gates. In addition, they suggested that additional research should be done on other components of an LSTM network. Ali et al. [17] presented qualitative research to improve the performance of LSTM-based classifiers by developing the internal structure of LSTM neural networks using 26 state activation functions as alternatives to the traditional hyperbolic tangent (tanh) activation function and only using default loss functions (Crossentropyex).
In this paper, we expand on our preceding research work [17]. In [17], as an alternative to the conventional (tanh) activation function, we have created a conceptual framework for brand-new LSTM-based classifiers that exploit the internal organization of LSTM networks. The findings demonstrate that the suggested LSTM classifiers outperformed the conventional (tanh)-based LSTM classifiers and made some progress. In this paper, we present qualitative research to improve the performance of previously developed activation function-based LSTM structures by using robust loss functions like (MAE) and (SSE) to build new classification layers. The classification layer is the last in any LSTM neural network structure where the loss function resides and use the best suggested DLLSTM-based classifiers from the preceding research work [17].
More precisely, we systematically compare commonly used loss functions (Cross-entropy) and proposed loss functions(MAE), (SSE) in the DLLSTM base-classifiers. We evaluate the attack performance (Crossentropyex), and the number of trainable parameters. Different loss functions on two available datasets are evaluated. The proposed DLL-STM classifiers will be trained using an adaptive moment estimation (adam) optimizer and different loss functions to get the most reliable and accurate performance under the conditions of the classifiers. To the best of our knowledge, this is the first time the DLLSTM neural network has been used to build classifiers for different loss functions. These loss functions are critical in style transfer because they determine how much the accuracy has been altered. We will discuss the limitations of the loss functions already used and propose different combinations of loss functions for better accuracy.
The contributions of this study are as follows. 1) We started by compiling a multitude of functions that can be utilized in DLLSTM networks in place of the conventional (tanh) function.
3) Developing ''(hard − sigmoid)'' gate function-based DLLSTM classifiers and comparing their performance with the commonly used (sigmoid) gate function-based DLLSTM classifiers in the presence of the suggested state functions and (adam) optimizer. 4) Employing the recently developed DLLSTM networks to solve a wide range of real-world classification tasks, including vowel classification and image classification. This Paper is organized as follows. The DLLSTM structure and activation functions are provided in Section II. Providing the methods is Section III. Section IV presents the simulation results of the proposed approach. The conclusion is presented in Section V.

II. DLLSTM STRUCTURE AND ACTIVATION FUNCTIONS (AF)
The parts that follow will provide a quick explanation of the DLLSTM structure and the activation functions used on the network.

A. DLLSTM STRUCTURE
The LSTM network is a recurrent neural network with the ability to detect long-term relationships between the time steps of sequence data [18]. Numerous LSTM-based methods have been created to fix problems, including handwriting recognition [19], audio recognition [9], and online translation using tools like Google neural machine translation [20] and the Facebook translation system [21]. The simplest DLLSTM with a single hidden neuron, batch normalization, and output units is used to classify the data. The DLLSTM structure, which consists of the input, single hidden neurons, and output units, is shown in Figure 1. The elements in each cell are identified using (1) through (6). Gate function (sigmoid) and state function (tanh), both found in DLLSTM memory cells as Figure 1, are the two most prevalent activation mechanisms for the neurons in memory blocks and the state activation function (tanh) [19].
The variables of the DLLSTM memory cell are specified by equations (1) to (6).
Equations (1) to (3) describe the forget, input, and output gates for each DLLSTM cell, where i t refers to the input, O t denotes the output, and f t is the forget gates. C ′ t in (4), the block input specifies the volume of data that should be saved in the cell at computing time. C t an update of the state of time t. Lastly, h t is the output blocks at the appropriate time [23].

B. ACTIVATION FUNCTIONS (AF)
To support the detection of complicated datasets and to allow the insertion of non-linearity into network without any of the requirement for coding, AF is introduced to an ANN.
The AF determines what information should be provided to the following neuron at the end of the process in the cell model of human brains. Using this cell, the output is collected from the cell before and changed it into a format that may be utilized as an input for the cell after [2].
A bad choice of functions can cause the NN's gradients to vanish or explode, as well as cause input data to be lost. The training process, the AF used in NN, and the network structure between cells are the three main factors that affect how well networks operate. The effectiveness of the network is significantly impacted by each of these factors [24]. The relevance of the learning algorithm has dominated research on NNs, whereas the activation functions that are used in these networks have been ignored [25].
In this study, the DLLSTM network is reconstructed by substituting one of the functions indicated in Table 1 for each of the (tanh) activation functions found in Equations (4), (5), and (6). Furthermore, we evaluate the effects of employing 17 various functions in (tanh) gates of a fundamental DLL-STM cell on network performance. The tanh formula is given in (7).
The formula of the sigmoid function is given below [26].
We have compiled a comprehensive list of 17 functions, as shown in Table 1. We experimentally observed that adding a value of 0.5 to some functions makes them applicable as activation functions in the network. Changing the range of the activation functions has been previously observed in other studies [27].

III. LOSS FUNCTIONS AND METHODOLOGY A. LOSS FUNCTIONS
In this paper, supervised learning is implemented using both LSTMs and RNNs. The loss, determined by a loss function, is the difference between the model's predicted label and the actual labels that really correspond to the input. The output of the loss function is used to change the network's weights so that the difference between the expected and actual labels is less [36].
We use three different loss functions in the proposed network, compare how well they work, and look at how each one works to find out which one gives the best results. 1) Crossentropyex [37] is commonly used in machine learning as a loss function and is a measurement of the difference between two probability distributions for a given random variable or a set of events. Crossentropyex Loss, also known as log loss, measures how well a classification model performs when producing a probability between 0 and 1.
(Crossentropyex) Loss develops as the predicted probability departs from the label. Targets and outputs are given, and the (Crossentropyex) function uses additional parameters and optional performance weights to figure out how well the network is doing.
The Crossentropy function has the formula is given by: 2) Sum Squared Error (SSE) [38] is a measure of accuracy in which the errors are squared and added. Find the dataset mean by adding up all the values and dividing them by the total number of values to determine the sum of squares for error. The deviation for each value is then determined by subtracting the mean from each value. Square each value's deviance next. A network performance function is SSE. The sum of squared errors is used to measure performance. The regularization of the errors and the normalization of the outputs and targets are controlled by two optional function arguments in the formula perf = SSE (net, t, y, ew, Name, Value).
The Sum Squared Error function has the formula is given by: 3) Mean Absolute Error (MAE) [39] is calculated by taking the difference between the predictions made by your model and the actual data, adding the absolute value to that difference, and then averaging the result across the entire dataset. MAE is used to measure network performance. A loss function expression has the formula is given by: where N is the quantity of observations, c is the number of categories, X ij is the ith categorized data for the jth c amount of categories class andX ijXij is the output of sample ii for a category j [40]. Loss functions provide more than just a static illustration of how well your model is doing; they also act as the foundation upon which your algorithms fit data. Most machine learning algorithms have some kind of loss function that is used to find the best parameters (weights) for your data or to optimize them [41]. Importantly, the choice of the loss function is directly related to the activation function used in the output layer of your neural network. These two design elements are connected.

B. METHODOLOGY
We updated the function (tanh) that is used to choose the cell input and update the output in order to examine the impacts of various loss functions on the performance of DLLSTM classifiers. The suggested DLLSTM classifiers will initially be trained using the standard function (sigmoid) and then they will be trained using a (hard − sigmoid) function. In each combination, two identical gates are used, and they are chosen from the AF list in Table 1 for each structure to compare the effects of various loss functions ((Crossentropyex), (MAE) and (SSE)) on the accuracy of the DLLSTM classifiers.

IV. SIMULATION RESULTS
The proposed DLLSTM classifiers are trained using the BPTT approach [42] and (adam) optimizer with a variety of loss functions, such as (Crossentropyex), (MAE), and (SSE). The classifiers are developed using 100 hidden neurons for each trial, and the initial parameters are selected at random. Based on the outcomes of the two datasets, the losses and efficiency of each DLLSTM-based classifier are determined.
The evaluation requirements for the classifiers includes accuracy. Accuracy is what determines how much testing information has been correctly recognized. It matches the definition given below: Accuracy = number of true classified samples number of total test samples × 10 (12)

A. FIRST SET OF EXPERIMENTS
We used data sets from the Japanese vowels dataset in this experiment for the initial set of trials. 9 male users speaking 2 consecutive Japanese vowels (ae) in a multivariate time series made up the initial vowel set from the University of Southern California. It was done in a variety of ways: a linear prediction study with a sampling rate of 10 kHz, a frame length of 25.6 ms, and a shift length of 6.4 ms. In other words, each time the speaker speaks, a time series between 7 and 29 is created, with a total of 12 features present at each point in the series (12 coefficients). 640 time series make up the entire collection, which is a round number [43]. Table 2 lists the structure variables, training possibilities, various hidden neuron numbers, and loss functions for the suggested DLLSTM-based classifiers. In order to increase performance, the batch size has been determined based on research. For each design, the outcomes of the two tests are used to report the accuracy and loss. According to Table 3, suggested loss function (SSE)-based classification layers outperform conventional loss functions Crossentropyex and suggested loss function (MAE)-based classification layers at (sigmoid) functions, enabling DLLSTM classifiers to attain the maximum efficiency. In addition to the tanh function, which gets an efficiency of 93.5432%, 11 others suggested DLLSTM classifiers provide high accuracy with efficiencies in the ranges of 93-98.2378%. Based on experiments, the wave-DLLSTM classifier outperformed tanh function with an efficiency of 98.2378%.  The performance curves for the suggested wave DLLSTM classifier, which has the best performance, and the conventional (tanh) DLLSTM classifier are shown in Figures 2 and Figure 3 at (sigmoid) gate function. Table 4 shows how well each classifier performed when the hard−sigmoid function was used in place of the (sigmoid) VOLUME 11, 2023  function using the suggested (SSE) classification layer, which outperformed the default function (Crossentropyex) and suggested function MAE-based classification. In comparison to the (tanh), which achieves an accuracy of 94,323%. The other 17 suggested DLLSTM classifiers provide high accuracy with efficiencies in the range of 93-98.0162%. According to tabulated results, 15 of the 17 DLLSTM-based classifiers that have been proposed exceeding the (tanh) function, with the wave function having the highest efficiency (98.0162%). The performance curves for the suggested wave DLLSTM classifier, which has the best performance, and the conventional tanh classifier are displayed in Figure 4 and Figure 5 at (hard − sigmoid) function.
In general, the DLLSTM classifiers using SSE-based classifications perform better than the loss functions Crossentropye and MAE-based classification, with a hard − sigmoid function outperforming the sigmoid function. Figure 6 and Figure 7 illustrate and analyze more highly AF DLLSTM classifiers, which use the sigmoid and hard − sigmoid functions respectively and use a variety of loss functions ((Crossentropyex),(MAE) and (SSE)) for training.
The wave function outperforms the (tanh) function by achieving a high classification rate of 98.2378%, as compared to the latter's 93.4054% when using the (SSE) loss function. The wave DLLSTM classifier is also the most effective of the suggested classifiers. It appears to work but performs substantially worse when using the suggested (MAE) loss function. The wave function is the most effective among the suggested classifiers as well. The suggested Modified-Elliott, Rootsig, Sech, Wave, Bi-tanh1, Bi-tanh2, and Softsign based DLLSTM classifiers often perform better than tanh function. Additionally, the examined functions that employ the hard − sigmoid function outperform the standard function.

B. SECOND SET OF EXPERIMENTS
The second dataset of the experiments will be built upon the Weather Reports Classification System. The dataset illustrates how to use bag-of-words models to develop a simplistic text classifier using word frequency values. By following the guidelines below, word frequency count can be used as a variable in a straightforward classifier. The dataset demonstrates how to develop a straightforward classifier model to identify the category of weather reports using the available text descriptions. Table 5 lists the structure variables, training possibilities, various hidden neuron numbers, and loss functions for the suggested DLLSTM-based classifiers. In order to increase performance, the batch size has been determined based on research. For each design, the outcomes of the two tests are used to report the accuracy and loss. Table 6 and Table 7 show the efficiency classifier rates for each DLLSTM classifier used to describe Weather Reports, utilizing optimizer adam, sigmoid, and hard − sigmoid functions. And loss functions (Crossentropyex), (MAE) and (SSE)-based classification layers, respectively at 100 hidden neurons. The classifier receives all the dataset in small batches at each trial, serving as the standard function of the DLLSTM structure. The observed DLLSTM classification results serve as a baseline for comparison.   According to Table 6, the suggested loss function (SSE)-based classification layer outperforms conventional loss functions Crossentropyex and suggested loss function (MAE)-based classification layers at (sigmoid) functions,   enabling DLLSTM classifiers to attain the maximum efficiency. In addition to the tanh function, which gets an efficiency of 86.1%, 14 others given DLLSTM classifiers provide high accuracy with efficiencies in the ranges of 84-89.503%. According to experimental studies, the Softsign DLLSTM classifier outperformed the (tanh) function with an VOLUME 11, 2023   efficiency of 89.503 %. The performance curves for the given wave DLLSTM classifier, which has the best performance, and the conventional (tanh) DLLSTM classifier are shown in Figure 8 and Figure 9 at (sigmoid) gate function. Table 7 shows how well each classifier performed when the hard − sigmoid function was used in place of the (sigmoid) function using the suggested (SSE) classification layer, which       The performance curves for the suggested Bi−tanh1 DLL-STM classifier, which has the best performance, and the conventional tanh classifier are shown in Figure 10 and Figure 11 at (hard − sigmoid) function. Overall, the DLLSTM classifiers using SSE-based classification performs better than the loss functions Crossentropye and MAE-based classification, with a hard − sigmoid function outperforming the sigmoid function. Figures 12 and 13 illustrate and analyze more highly AF DLLSTM classifiers. And using the sigmoid, hard −sigmoid functions respectively, and using a variety of loss functions ((Crossentropyex), (MAE) and (SSE)) to train. The Softsign function clearly outperforms the (tanh) function by achieving a high classification rate of 89.5% as compared to the latter's 86.5%, when using the (SSE) loss function. The Softsign DLLSTM classifier is also the most effective of the suggested classifiers. It appears to work but performs substantially worse when using the suggested (MAE) loss function. The suggested Modified-Elliott, Root-sig, Sech, Wave, Bi-tanh1, Bi-tanh2, and (Softsign) based DLLSTM classifiers often perform better than the (tanh) function. Additionally, the examined functions that employ the (hard − sigmoid) function outperform the standard function.

V. CONCLUSION
In this study, a novel robust loss function-based classification layer for Deep Learning Long Short-Term Memory (DLLSTM) has been proposed. The suggested classifiers are initially trained with sigmoid function and then with hard − sigmoid function, to visualize the classification issues. Also, a comparative study was performed using three distinct loss functions ((Crossentropyex), (MAE) and (SSE)), and (adam) optimizer. Our tests with different data sets showed that the suggested (SSE) worked much better than the (Crossentropyex), and (MAE) loss functions. Another recently suggested loss function, called (MAE) loss, VOLUME 11, 2023 performed substantially worse and appears to be limited to certain neural network architectures. The results additionally indicated that the suggested classifiers that apply hard − sigmoid function were better than that use dsigmoid. The analysis indicated that certain less well-liked AFs, including Modified-Elliott, Root-sig, Sech, Bi-tanh1, Bi-tanh2, wave and Softsign, exhibited lower rates of losses than the most well-liked AFs. This means that classifiers that use these less popular AFs are more likely to get good results than those that use the tanh function. Finally, the choice of (SSE) loss could be indeed confirmed as a strong option, which improved the accuracy of DLLSTM-based classifiers, achieving 98.24 % wave accuracy in the Japanese Vowels dataset and 89.5% Softsign accuracy in the Weather Reports dataset.
The following ideas are suggested for future research: 1) Analyzing the effectiveness of the suggested DLLSTMbased classifiers using a variety of optimization methods, such as RMSPropSgdm, ADadelte, Adagrad, AMSgrad, AdaMax, and Nadam.
2) Analyzing the effectiveness of the DLLSTM using additional loss functions like Huber and Cauchy to provide more robust loss functions rather than the crossentropyex loss function, the suggested classifiers will perform better under the constraints of classification in real systems.