Real-World Driver Stress Recognition and Diagnosis Based on Multimodal Deep Learning and Fuzzy EDAS Approaches

Mental stress is known as a prime factor in road crashes. The devastation of these crashes often results in damage to humans, vehicles, and infrastructure. Likewise, persistent mental stress could lead to the development of mental, cardiovascular, and abdominal disorders. Preceding research in this domain mostly focuses on feature engineering and conventional machine learning approaches. These approaches recognize different levels of stress based on handcrafted features extracted from various modalities including physiological, physical, and contextual data. Acquiring good quality features from these modalities using feature engineering is often a difficult job. Recent developments in the form of deep learning (DL) algorithms have relieved feature engineering by automatically extracting and learning resilient features. This paper proposes different CNN and CNN-LSTSM-based fusion models using physiological signals (SRAD dataset) and multimodal data (AffectiveROAD dataset) for the driver’s two and three stress levels. The fuzzy EDAS (evaluation based on distance from average solution) approach is used to evaluate the performance of the proposed models based on different classification metrics (accuracy, recall, precision, F-score, and specificity). Fuzzy EDAS performance estimation shows that the proposed CNN and hybrid CNN-LSTM models achieved the first ranks based on the fusion of BH, E4-Left (E4-L), and E4-Right (E4-R). Results showed the significance of multimodal data for designing an accurate and trustworthy stress recognition diagnosing model for real-world driving conditions. The proposed model can also be used for the diagnosis of the stress level of a subject during other daily life activities.


Introduction
Successful driving activities always require both mental and physical skills [1][2][3]. Acute stress reduces the driver's ability to fix hazardous situations, which causes significant damage to humans and vehicles every year [4][5][6][7][8]. Dangerous driving situations are triggered due to human errors, individual factors, and ambiance conditions [9]. According to the National Motor Vehicle Crash Causation Survey (NMVCCS) in the United States (US), human errors caused 94% of crashes alone, while vehicle defects, ambiance conditions, and other factors collectively caused 6% of crashes during 2005-2007 [10]. Human errors are linked to the driver's perceptual conditions, so a complete understanding of these conditions is crucial for preventing traffic accidents.
To detect and diagnose drivers' different stress levels, physiological, physical, and contextual information are widely utilized [11]. Moreover, different traditional machine learning models based on handcrafted feature extraction methods are utilized for the classification of stress. Extracting the best features using these approaches is always a challenging task, as the quality of extracted features has a significant effect on the classification performance [12]. These approaches are laborious, ad hoc, less robust to noise, and need thorough skill [13]. To come through these challenges, deep learning models have been utilized to automatically produce complex nonlinear features reliably [14][15][16]. In addition to automatic feature extraction from raw data, these models offer noise robustness and better classification accuracy [17][18][19]. Different deep learning algorithms are used in recent research, e.g., CNN, RNN, DNN, and LSTM.
The models proposed in the current work are based on 1D CNN and hybrid 1D CNN-LSTM networks. The proposed models are separately trained using multiple physiological signals (SRAD) and multimodal data (AffectiveROAD) including physiological signals and other information about the vehicle, driver, and ambiance. Multimodal fusion of data based on deep learning approaches can be used to develop a precise driver stress level recognition model with improved performance and reliability.
Contributions of this research study include: (1) proposing 1D CNN and hybrid 1D CNN-LSTM-based real-world driver stress level recognition models using fused physiological signals (SRAD dataset) and fused multimodal data (AffectiveROAD dataset) and (2) ranking the assessment of the proposed models for the two and three levels of stress based on the fuzzy EDAS approach.
The organization of this research article is given below. Analysis of the existing stress recognition models is presented in Section 2. The proposed methodology is elaborated in Section 3 in terms of datasets, data pre-processing, architectures of the proposed CNN and hybrid CNN-LSTM models, and the fuzzy EDAS approach. Performance evaluation of the proposed models is conducted in Section 4. A fuzzy EDAS-based rank estimation of the proposed models for the driver's two and three levels of stress is also presented in this section. Section 5 gives a detailed assessment of the proposed and existing stress recognition schemes. Finally, Section 6 concludes the paper and gives future directions to further explore this research area.

Related Work
This section provides a review of the existing work in the driver's stress analysis domain and underscores the current contribution. Several driver stress level recognition schemes exist in the literature based on simulated and real-world driving environments. These schemes can be broadly categorized as conventional machine learning or deep learning models.
All the mentioned studies are based on feature engineering techniques, and various conventional machine learning algorithms were employed to classify levels of stress. How-ever, handcrafted features are less robust to noise and subjective changes, and need a considerable amount of time and hard work [8,13,19,34,35,44]. Moreover, capturing the features' sequential nature is difficult due to the absence of explicit features and high dimensionality despite using complex feature selection methods. Likewise, the dependence of the model on past observations would make it impractical to process all the information due to the growing complexity. The feature-level multimodal fusion models proposed by Chen et al. [4], Healey and Picard [26], Haouij et al. [23], Lee et al. [31], Bianco et al. [30], Sun et al. [41], and Can et al. [36] mainly concentrate on pattern learning in individual signals instead of multiple simultaneous signals [18]. Thus, these models are inappropriate to obtain the nonlinear correlation across multiple signals appearing simultaneously. Various linear and non-linear methods employed in these conventional machine learning models have not been able to perform the vigorous investigation of such manifold time series signals [19].
To address the issues faced by conventional machine learning models, deep learning methods have been introduced. Deep learning models are developed based on signal preprocessing (noise filtering), designing a particular deep neural network based on the area of interest, network training, and model testing. Deep learning models learn and classify raw data using multilayer deep neural networks [45]. The last fully connected (FC) layers are utilized to obtain the final output. Contrary to feature engineering techniques used in conventional machine learning approaches, deep learning models automatically produce steady features [14,15]. Moreover, deep learning models are more robust to noise and achieve improved classification accuracy [19]. Different deep learning algorithms are used in recent research, e.g., the recurrent neural network (RNN), deep aeural network (DNN), LSTM, and CNN. Rastgoo et al. [11], Zhang et al. [46], Kanjo et al. [17], Lim and Yang [47], Yan et al. [48], Hajinoroozi et al. [49], and Lee et al. [50] presented different deep learning models to identify different driver states. Rastgoo et al. [11], Kanjo et al. [17], Lim and Yang [47], and Yan et al. [48] proposed deep learning models based on multimodal data. On the other hand, the models proposed by Hajinoroozi et al. [49] and Lee et al. [50] are based on physiological signals only. The stress recognition model proposed by Zhang et al. [46] is based on facial images only. Apart from driving scenarios, Masood and Alghamdi [51], Cho et al. [52], Seo et al. [53], Hwang et al. [54], and He et al. [55] proposed stress recognition models based on deep learning techniques and physiological signals in academic, workplace, and lab settings. Most of these studies including [46,49,50,[52][53][54][55][56] are based on two levels of stress only. Moreover, the schemes presented by [46,50,52,55,56] are based on images. Likewise, the schemes proposed by [49,[52][53][54][55][56] are either based on physiological signals or a single modality. On the other hand, the model proposed by [11] is based on multimodal data collected during simulated driving.
The models proposed in this study are based on the fusion of multimodal data collected during real-world driving (SRAD and AffectiveROAD datasets). Moreover, these models are based on 1D CNN and 1D CNN-LSTM networks to detect driver's two (stressed and relaxed) and three levels (low, medium, and high). The fuzzy EDAS approach is also used to find the performance ranks of the proposed models based on different classification metrics.

Materials and Methods
The proposed unimodal and fusion models for real-world driver stress level recognition are based on physiological signals and deep learning approaches, such as CNN and hybrid CNN-LSTM. The proposed models are implemented in the latest MATLAB 2022a platform. The proposed stress recognition models are based on the fusion of ECG, HR, HGSR, FGSR, EMG, and RESP signals collected from the PhysioNet SRAD database, and breathing rate (BR), GSR, BVP, HR, TEMP, ACCEL, posture, and activity data are collected from AffectiveROAD database. Data input mechanisms used in this research are based on raw signals. These raw signals are preprocessed to obtain cleaned signals.

SRAD Dataset
The ECG, HR, GSR, EMG, and RESP signals analyzed in the current work belong to the SRAD PhysioNet public database [57]. Experiments were performed while driving a customized Volvo S70 series station wagon. Five different sensors were used to acquire physiological signals from the nine drivers during twenty-four drives. The sensors were connected to an embedded computer through an analog-to-digital converter (ADC). The ECG sensor was placed using a modified lead II configuration to decrease the motion artifacts. The EMG sensor was positioned on the shoulder near the trapezius muscle to record the emotional stress. Two GSR sensors were located on the driver's sole and palm of the left foot and hand. Expansion of the chest cavity was used to measure the RESP signals through an elastic Hall effect sensor fastened around the diaphragm.
All drives comprise rest, highway, and city driving phases on a specific route 31 km in length in Boston, US. These rest, highway, and city driving phases are assumed to trigger low, medium, and high levels of stress, respectively. Initially, the drivers are informed about the travel plan and compliance with certain guidelines regarding the speed limits and tuning out the radio. To avoid rush hours, all drives were performed in the midmorning and afternoon. Two rest intervals of 15 min in the parking area were added at the start and end of each drive to collect the driver's low-stress baseline. Due to stop-and-go traffic in the city area, drivers usually observe high-stress situations. After passing the toll booth, the city road then turns into the highway. Uninterrupted highway driving normally indicates medium-stress conditions. The trip completes after returning to the starting position using the same highway and city routes. The total length of all drives varies from 50 to 90 min, including two 15 min rest intervals.
The dataset contains information about 17 drives, but some drives have incomplete signals and markers. These incomplete drives are removed from the experiments. Figures 1-5 separately show the ECG, HR, GSR, EMG, and RESP waveforms for the three levels of stress. The figures show that all five signals have distinct waveforms for the three levels of stress.

AffectiveROAD Database
Experiments were performed using wireless sensors networked together inside different cars to collect physiological signals and additional information about the vehicle, driver, and ambiance. The Zephyr Bio-harness (BH) chest strap was placed on the driver's chest to collect HR, breathing rate (BR), posture, and activity information. Two Empatica E4-Left (E4-L) and E4-Right (E4-R) wearable devices were mounted on the driver's left and right arms to capture GSR, BVP, inter-beat interval (IBI), HR, TEMP, and ACCEL data. The Intel Edison developer kit-based environmental platform was placed in the car's rear seat for collecting luminosity, temperature, pressure, and humidity information. A sound meter and microphone were used to obtain sound amplitude and audio signal. Two cameras were placed on the windshield of the car to record inside and outside events. A real-time continuous subjective metric was prepared by an experimenter during each drive to monitor the driver's stress level. The stress metric along with two video recordings were then used by the drivers to correct and validate the experimenter's ratings.
All drives comprise rest, highway, and city driving phases on a fixed route 31 km in length in the Grand Tunis area. Fourteen driving experiments were performed by 10 experienced drivers with valid driver's licenses. Each drive included two 15 min rest periods at the start and end of the session. The whole experiment normally took about 86 min to travel through the zone, city1, highway, and city2, and then travel back in the opposite direction to reach the starting point. The rest, highway, and city drives were supposed to yield low, medium, and high levels of stress, respectively.

Pre-Processing
Physiological data are normally derived from the human body in the form of lowamplitude signals with different frequency ranges. These signals are mostly polluted by different noises and artifacts. To model the driver's stress levels accurately, it is necessary to preprocess the ECG, HR, HGSR, FGSR, EMG, and RESP signals first.
ECG signals normally contain different unwanted components including baseline wander, powerline interference (PLI), and high-frequency EMG noise [58]. Moreover, the PLI adds 50-60 Hz noise components in ECG signals [59]. Likewise, high-frequency EMG noise components caused by muscle contractions contaminate the ECG signals [58]. HR signals are commonly derived from ECG signals, so they inherit some noise and artifacts from ECG signals. To remove the baseline wander and other artifacts form ECG signals, a band-pass Butterworth filter (5-15 Hz) was used to eliminate the baseline wander. Similarly, a finite impulse response (FIR), Notch filter (59-61 Hz), and FIR band-pass filter (1.5-150 Hz) were used for noise removal. The min-max normalization approach is then utilized to remove the subject-specific baseline and motion artifacts.
A GSR signal is an effective stress measure that is comparatively less susceptible to noise [60]. Yet, the authors of [61] used a low-pass filter (4 Hz) and a Gaussian filter for denoising the GSR signal. These filters are used in this study too to obtain cleaned GSR signals. The signals are also normalized to the maximum value.
The EMG signal is contaminated by several unwanted signals including motion artifacts, PLI, capacitive effects, and ECG artifact signals. In this work, a band-pass Butterworth filter (0.5-500 Hz) is used to remove the low-and high-frequency noises in the EMG signals. Likewise, PLI is eliminated using a 60 Hz Notch filter. The min-max normalization is performed to remove the subject-specific baseline and motion artifacts. EMG signals in the SRAD dataset were initially collected at a lower sampling frequency of 495 Hz. Although, the EMG signal contains information up to 450 Hz. As per the Nyquist theorem, at least a 900 Hz sampling frequency is required for the EMG signals.
The RESP signal is normally polluted by different undesirable signals including baseline wander, PLI, and motion artifacts. To remove high-frequency noise and baseline signal from the RESP signal, we applied Butterworth high-pass (0.05 Hz) and low-pass (0.70 Hz) filters, respectively.

1D CNN Models
CNN-based models were originally developed to learn the internal representation of 2D images and then classify them into certain output classes. The same approach can be utilized for automatic feature learning and classification of time series sequenced data [62]. A 1D CNN uses several filters to perform 1D convolution (Conv1D) operations for constructing feature maps from such data. These networks can better match the 1D characteristic of different physiological signals. Increasing the convolutional layers can help CNN models to gradually extract unique and vigorous higher-level features. The 1D CNN models used in this research are based on the signal fusion of the SRAD and AffectiveROAD datasets for both two-stress and three-stress classes. Thus, all signals in the SRAD dataset are combinedly trained using the 1D CNN model. The AffectiveROAD dataset consists of multimodal data collected using BH, E4-L, and E4-R devices. So, different 1D CNN models are trained using the BH, E4-L, E4-R, E4-(Left+Right) (E4-(L+R)), and BH+E4-(L+R) datasets. A sliding window approach is used to convert each cleaned signal into equal size segments. These segments are then fed to a 1D CNN as new training data. The CNN-based driver stress recognition performs both feature learning and classification tasks.
A 1D CNN architecture is defined using multiple Conv1D blocks each containing convolution, ReLU, and layer normalization (LN) layers. One-dimensional CNN architectures based on SRAD, E4-L, E4-R, E4-(L+R), BH, and BH+E4-(L+R) datasets are shown in Table 1. The convolution layer utilizes trainable filters (kernels) to convolve the low-level features of each segment or the previous layer's output to produce a feature map. The number of filters in each convolutional block is set differently depending on the dataset. Causal padding is used in all convolutional layers to produce outputs with the same length. It pads the layer's input with zeros to predict the values of early time steps in the frame. The convolutional layer is followed by the ReLU layer, which is based on a piecewise linear function. This function returns output for positive inputs and is zero otherwise, thus alleviating the vanishing gradient problem [63]. Moreover, the function adds nonlinearity to the model to learn complex patterns in the data. A GAP layer is added after the four convolutional blocks to produce a single vector output. This layer finds the average output of each feature map generated by the convolutional layers and provides a substitute for the flattening block. The last three layers including FC, softmax, and classification layers perform the classification task. The vector output of the GAP layer is fed to the FC layer, which is also known as the hidden layer. The FC layer is used to map the output classes to a vector of probabilities. The output of the FC layer is utilized by the softmax layer to perform the final classification decision by allocating probabilities to low, medium, and high classes of stress. The final classification layer uses a cross-entropy loss function to evaluate the performance of the classification model. An increase in cross-entropy loss reflects the divergence of the predicted probability from the actual label and vice versa. The classification layer assumes the number of classes from the FC and softmax layers. Network's Training Before starting the training process, several parameters need to be settled. These pa-rameters include the training algorithm, mini-batch size, validation frequency, initial learning rate, and maximum epochs. Parameter settings for different 1D CNN models are shown in Table 1.
A training algorithm is used to reduce the loss function of a learning model iterative-ly based on a training dataset. Adaptive moment estimation (Adam) is used as a training algorithm. It combines the benefits of RMSProp and AdaGrad by calculating the individ-ual adaptive learning rates based on the parameters estimated for the first and second moments of gradients. The mini-batch represents a subset of segments used in a single training iteration. Min-batch size is set to a small value to ensure the uniform distribution and utilization of the full dataset during a single epoch. The validation frequency repre-sents the training iterations between evaluations of validation metrics, while training iter-ation is a single step performed by an optimization algorithm to reduce the loss function for a mini-batch. The network's validation frequency is set to 10. The epoch represents the maximum iterations completed by the optimization algorithm to reduce the loss function for the entire dataset. All datasets are divided into 80% for training data and 20% for vali-dation data.

Hybrid 1D CNN-LSTM Models
The LSTM is a particular type of RNN developed by Hochreiter and Schmidhuber [64]. It is useful to discover and remember long sequences of data efficiently. Generally, the LSTM is a chain of repeating cells of neural networks, such as a RNN, but both have different cell structures. The RNN's cell consists of a single neural network based on tangent hyperbolic function, while LSTM's cell has four interacting neural network layers based on sigmoid functions and pointwise multiplication operations. The LSTM has several cells connected to each other horizontally. Information can be added or removed from the cell state using four different gates. Each LSTM cell consists of an input gate, cell state gate, forget gate, and output gate. The forget gate is based on the sigmoid function, which determines which information needs to be forgotten from the cell state. The information is removed if the gate generates zero output and it is retained if the gate produces one output. The cell state gate determines the cell state based on the new information. First, the input gate based on the sigmoid function determines the values to be updated. Next, a vector is created for the new candidate values by the tangent hyperbolic activation function. The cell state is updated by combining the results of the two functions. To generate the output of a cell, the output gate first applies the sigmoid function to the part of a cell state. Next, a tangent hyperbolic function is applied to the cell state and the resulting value is multiplied by the output of the sigmoid function.
The hybrid CNN-LSTM model utilizes both 1D CNN and LSTM networks to classify sequenced data. In such a model, the CNN is used as a front end to extract features from physiological data followed by the LSTM layers to perform learning and classification tasks. The hybrid CNN-LSTM model has a similar architecture to the CNN model with additional LSTM cells after the FC layers. The architecture of the CNN model is already discussed in the previous section. The hybrid 1D CNN-LSTM architectures based on the SRAD, BH, E4-L, E4-R, E4-(L+R), and BH+E4-(L+R) datasets are shown in Table 2. Moreover, parameter settings for the proposed models are also shown in the same table.

Fuzzy EDAS Approach
The fuzzy EDAS approach is used to evaluate the performance of the proposed realworld driver stress level detection models based on different modalities. This approach performs the rank estimation of the proposed models in terms of accuracy, recall, precision, F-score, and specificity. Fuzzy EDAS is an eight-step process where each step performs some sort of calculations, which in turn is used by the coming steps, as elaborated below: Step 1: First, the "solution of the average value (ψ)" is calculated for all matrices, as shown mathematically in the equation below: where: The aggregate solution of Equations (1) and (2) can be found as the average value (ψ) against every criterion's estimated quantity for each performance metric. Step 2: The positive distances from the average (P I ) of each signal for the driver's each stress level is calculated using the following equation: The (P I ) αβ in Equation (3) is the positive distance of β th model from the average value for the α th parameter. It can be found using either of two ways. If β th criterion is more favorable, then it is calculated using the equation below: On the other hand, if the β th criterion is not favorable, it is calculated by the following equation: Step 3: The negative distances from the average (N I ) of each signal for the driver's stress level is calculated using the following equation: The (N I ) αβ in Equation (6) is the negative distance of the β th model from the average value for the α th parameter. It can be found using either of two ways. If the β th criterion is more favorable, then it is calculated using the equation below: On the other hand, if the β th criterion is not favorable, it is calculated by the following equation: Step 4: The weighted sum of (P I ) αβ is calculated using the following equation: The aggregate (P I ) is estimated for each signal evaluated using the proposed model for each stress level.
Step 5: The weighted sum of (N I ) αβ is calculated using the following equation: The aggregate (N I ) is estimated for each signal evaluated using the proposed model for each stress level.
Step 6: The normalized values of (SP I ) α and (SN I ) α of each signal for the driver's stress level are found using the following two equations: Step 7: The appraisal score (λ) of each signal for the driver's stress level is calculated using the equation given below: The appraisal score (λ α ) lies are given as 0 ≤ λ α ≤ 1.
Step 8: Each signal for the driver's stress level is ranked according to the decreasing values of the appraisal score (λ α ). Thus, the signal with the lowest appraisal score (λ α ) for a particular stress level has the highest performance among the other signals.

Results
The SRAD and AffectiveROAD datasets are randomly distributed into two groups, with 85% and 15% for training and validation, respectively. Results are acquired for the 1D CNN and 1D CNN-LSTM models trained using the SRAD, BH, E4-L, E4-R, and BH+E4-(L+R) datasets. A performance assessment of the proposed driver stress recognition models for the low, medium, and high classes of stress is carried out using different classification metrics. These performance metrics include accuracy (ACC), recall (RCL), precision (PRC), F-score (F1), and specificity (SPC).

Models' Evaluation for the Two-Stress Class
Results of the proposed driver stress recognition models for the two-stress class are shown in Table 3. These results are based on the training data obtained from the SRAD and AffectiveROAD datasets for real-world driving. The training graphs of the proposed CNN models are shown in Figures 1-6. Similarly, the training graphs of the proposed hybrid CNN-LSTM models are shown in Figures 7-12. Results show that the BH+E4-(L+R)based CNN model outperformed other models based on the SRAD, Bio BH, E4-L, E4-R, and E4-(L+R) datasets by 2.9%, 6.5%, 9.1%, 7.3%, and 3.25%, respectively, with an overall validation accuracy of 95.6% for the two-stress class. The proposed BH+E4-(L+R)-based hybrid CNN-LSTM model outperformed other models based on SRAD, BH, E4-L, E4-R, and E4-(L+R) datasets by 4.79%, 1.1%, 7.76%, 5.94%, and 1.94%, respectively, with an overall validation accuracy of 96.59% for the two-stress class.
Confusion matrices of the proposed CNN and hybrid CNN-LSTM models are shown in Figures 13 and 14. In Figure 13f, 214 relaxed instances are predicted correctly, while 5 relaxed instances are incorrectly predicted as stressed by the model. Thus, the total correct prediction for the relaxed class is 97.7%. Similarly, for the stressed class, 291 out of 309 instances are correctly predicted, which amounts to a total accuracy of 94.2% for the stressed class. In Figure 14f, 233 relaxed instances are predicted correctly, while 15 relaxed instances are incorrectly predicted as stressed by the model. Thus, the total correct prediction for the relaxed class is 94%. Similarly, for the stressed class, 277 out of 280 instances are correctly predicted, which amounts to a total accuracy of 98.9% for the stressed class.

Models' Evaluation for the Three-Stress Class
Results of the proposed driver stress recognition models for the three-stress class are shown in Table 4. These results are based on the training data obtained for the SRAD and AffectiveROAD datasets for real-world driving. The training graphs of the proposed CNN models are shown in Figures 15-20. Similarly, the training graphs of the proposed hybrid CNN-LSTM models are shown in Figures 21-26. Results show that the proposed CNN model based on the BH+E4-(L+R) datasets outperform the other models based on the SRAD, BH, E4-L, E4-R, and E4-(L+R) datasets significantly by 6.16%, 6.76%, 9.16%, 8.87%, and 1.72%, respectively, with an overall validation accuracy of 85.66%. Similarly, the proposed hybrid CNN-LSTM model based on the BH+E4-(L+R) datasets outperform the other models based on the SRAD, BH, E4-L, E4-R, and E4-(L+R) datasets significantly by 2.15%, 0.15%, 11.22%, 5.89%, and 3.82%, respectively, with an overall validation accuracy of 87.95%.
Confusion matrices for the proposed CNN and hybrid CNN-LSTM models are shown in Figures 27 and 28. In Figure 27f, 190 low instances are predicted correctly, while 3 and 6 low instances are incorrectly predicted as medium and high by the CNN model. So, the total correct prediction for the high-stress class is 95.5%. Likewise, for the mediumand high-stress classes, 34 out of 50 and 224 out of 274 instances were correctly predicted, which amounts to total accuracies of 68% and 81.8% for medium-and high-stress classes, respectively. Similarly, in Figure 28f, 214 low instances are predicted correctly, while 3 and 8 low instances are incorrectly predicted as medium and high by the CNN-LSTM model. Therefore, the total correct prediction for the low-stress class is 95.1%. Similarly, for the medium-and high-stress classes, 47 out of 61 and 199 out of 237 instances were correctly predicted, which amounts to total accuracies of 77% and 84% for the mediumand high-stress classes, respectively.

Rank-Based Performance Evaluation
The eight-step fuzzy EDAS procedure [65] defined in Section 3.6 is utilized here to evaluate the ranks of the SRAD, BH, E4-L, E4-R, E4-(L+R), and BH+E4-(L+R)-based CNN and hybrid CNN-LSTM models for the two-stress and three-stress classes. This procedure is separately followed for each the driver's stress level. The classification metrics calculated in Tables 3 and 4 are regarded as a criterion for the proposed CNN and hybrid CNN-LSTM driver stress level classification models for the two-stress and three-stress classes.

Rank Estimation of the CNN Models for Two Levels of Stress
A rank estimation of the CNN models for the two-stress class (relaxed state) is performed in a series of steps. The results of each step are shown in Tables 5-10. The first step determines the cross-efficient values ψ β using Equations (1) and (2), as shown in Table 5. In the next two steps, the positive distance (P I ) and negative distance (N I ) are separately determined based by Equations (5) and (8), as given in Tables 6 and 7. In the fourth and fifth steps, the weighted sum of (P I ) and (N I ) are separately calculated with the help of Equations (9) and (10), as shown in Tables 8 and 9. The sixth step normalizes the weighted sums (SP I ) α and (SN I ) α independently to obtain the aggregate scores of the models based on Equations (11) and (12), as indicated in Table 10. Finally, the appraisal score (λ α ) is determined based on the aggregate scores N (SP I ) α and N (SN I ) α in the seventh step with the help of Equation (13), as given in Table 10. The eighth step uses the appraisal scores (λ α ) to determine the ranks of the proposed CNN models based on the BH, E4-L, E4-R, E4-(L+R), and BH+E4-(L+R) datasets. The model with the lowest appraisal score (λ α ) has the highest performance among the candidate models. Table 10 shows that the proposed BH+E4-(L+R), E4-L, E4-R, SRAD, E4-(L+R), and BH-based CNN models achieved first, second, third, fourth, fifth, and fifth positions for the relaxed state. Likewise, the same eight-step procedure is utilized for the stressed state, and the resulting ranks of each CNN model are given in Table 11. Table 11 shows that the proposed BH+E4-(L+R), SRAD, E4-L, E4-R, E4-(L+R), and BH-based CNN models achieved first, second, third, fourth, fifth, and fifth positions for the stressed state.

Rank Estimation of the CNN-LSTM Models for Two Levels of Stress
For the rank estimation of the SRAD, BH, E4-L, E4-R, E4-(L+R), and BH+E4-(L+R)based hybrid CNN-LSTM models, the same eight-step procedure is utilized for the relaxed state and stressed state, and the resulting ranks are given in Tables 12 and 13, respectively. Table 12 shows that the proposed BH+E4-(L+R), BH, E4-L, E4-R, E4-(L+R), and SRAD-based CNN-LSTM models achieved first, second, third, third, fourth, and fifth positions for the relaxed state. Similarly, Table 13 shows that the proposed fused BH+E4-(L+R), BH, E4-L, E4-R, SRAD, and E4-(L+R)-based CNN-LSTM models achieved first, second, third, third, fourth, and fifth positions for the stressed state.

Comparison of the Proposed 1D CNN and 1D CNN-LSTM Models
Comparisons of the proposed 1D CNN and 1D CNN-LSTM models for two and three levels of stress based on training time, accuracy, and fuzzy EDAS ranking are shown in Tables 20-23. Execution environments for all proposed models are based on a single CPU. The fuzzy EDAS approach performs a comprehensive rank estimation of the proposed models in terms of accuracy, recall, precision, F-score, and specificity. The model with the lowest appraisal score (λ α ) has the highest performance among the candidate models.
A comparison shows that there is a tradeoff between the training time and performance of various models, with the exception of the SRAD dataset. For example, Table 20 Table 22 shows the comparison of the 1D CNN model for the three-stress class. As usual, the proposed BH+E4-(L+R) secured the best EDAS ranks at a maximum cost of 72 min and 51 s. On the other hand, the BH-based model has the worst EDAS rank for a minimum computational cost of 3 min and 45 s. However, the SRAD-based model has average performance with the highest training time of 279 min and 6 s. Similarly, Table 23 reveals that the proposed SRAD-based 1D CNN-LSTM model secured the top EDAS rank at the cost of a maximum training time of 171 min and 31 s. The BH-based 1D CNN-LSTM model achieved an average EDAS rank with the lowest training time of 2 min and 2 s. However, the E4-(L+R)-based model has the worst EDAS rank despite the high training time of 71 min and 9 s. The proposed models have a high training time due to the usage of a single CPU. Utilizing GPUs may reduce the training time of the proposed algorithms.

Discussion
The proposed CNN and hybrid CNN-LSTM models are analyzed using the SRAD and AffectiveROAD datasets in the previous section. AffectiveROAD is based on the BH, Empatica E4-L, Empatica E4-R, and E4-(L+R) datasets. Both BH and Empatica E4 datasets are individually and combinedly used to train the proposed models for the two-stress and three-stress classes. It is evident from the previous tables that the models trained on multimodal data (AffectiveROAD) achieved the maximum performance compared to SRAD data. This shows the importance of physiological, physical, and contextual information in the domain of stress recognition. Moreover, it is also clear that the hybrid 1D CNN-LSTM models achieved better performance than the 1D CNN models. The fuzzy EDAS procedure also shows that the proposed CNN and hybrid CNN-LSTM models achieved the first rank based on the fused AffectiveROAD BH+E4-(L+R) datasets. The achieved performance of the proposed real-world driver stress recognition models is greatly enhanced compared to the existing schemes. A comparison of the proposed driver stress recognition models with the existing schemes is shown in Table 24. It is clear from the table that the proposed models achieved the highest performance for both two and three levels of stress compared to the existing schemes. Rastgoo et al. [11] achieved a higher performance than the proposed models for the three-stress class but their study was based on simulated driving conditions, while the proposed models were based on real-world driving conditions.

Conclusions
This paper concludes that in addition to physiological signals, other information regarding the driver, vehicle, and ambiance has an important role in designing reliable and accurate driver stress recognition systems. It is also evident that the hybrid CNN-LSTM models have better performance than the CNN models. Moreover, the fusion of the AffectiveROAD datasets (BH, E4-L, and E4-R) achieved the best performance with the least computational cost compared to the SRAD dataset. This is due to several factors including the utilization of multimodal information (physiological signals and information regarding the driver, vehicle, and environment), quality of the hardware and software tools used for capturing the data, and accurate sampling of the signals. Thus, hybrid deep learning models and multimodal data have a key role in designing an accurate and reliable stress recognition model for real-world driving conditions.
The fusion models based on 1D CNN and hybrid 1D CNN-LSTM produced promising results, but these models may be further improved by utilizing more complex CNN and LSTM architectures. Moreover, a joint CNN-LSTM architecture may be used to further improve stress level recognition. The current study is based on the driver's stress level, thus in the future, these models may be utilized for drowsiness, cognitive workload, activity, fatigue, and feeling recognition. The AffectiveROAD and PhysioNet SRAD datasets used in this study are based on real-world driving conditions. Such datasets are usually contaminated by different noises and artifacts. Enhanced pre-processing techniques can further improve the performance of the models. Future work may also include stress recognition by training the proposed models using physiological signals acquired using non-contact sensors and smart watches.