Multimodal Hierarchical CNN Feature Fusion for Stress Detection

Stress is one of the most severe concerns in modern life. High-level stress can create various diseases or loss of focus and productivity at work. Being under stress prevents people from recognizing their stress levels, so early stress detection is essential. Recently, multimodal fusion has enhanced the performance of stress detection models using Deep Learning (DL) techniques. The low, mid, and high-level features of a Convolutional Neural Network (CNN) are discriminative. A comprehensive feature representation can be obtained by fusing all three levels of CNN’s features. This study mainly focuses on detecting stress by exploiting these advantages using a multimodal hierarchical CNN feature fusion. The two multimodal physiological signals used in this study are Electrodermal activity (EDA) and Electrocardiogram (ECG). We develop a hierarchical feature set by concatenating multi-level CNN features for each modality. Multimodal fusion on both hierarchical feature sets is performed using the Multimodal Transfer Module (MMTM). The experiments are carried out with raw frequency domain data and the features from the frequency bands to study the effectiveness of both. The model’s performance is compared to the different combinations of hierarchical features from low, mid, and high levels. To verify the generalizability, the proposed approach has been evaluated on four benchmark datasets - ASCERTAIN, CLAS, MAUS, and WAUC. The proposed method showed its effectiveness by outperforming existing models by 1-2%, respectively, on frequency band features. It is observed that the hierarchical feature set from all three levels performed better than all other combinations by 2-4%. As a result, this strategy can be a useful addition to stress detection.


I. INTRODUCTION
Stress is a way of responding to overwhelming demands or challenges from a scenario that manifests as emotional, physical, or behavioural changes by the human body [1]. The way an individual views the scenario has a significant impact on how stressed they are. When an individual faces a challenge in achieving their goal, they evaluate the scenario in two stages-(i)the need to achieve the desired goal and (ii) the external and internal resources to meet the challenges [2]. Human stress is classified as positive and negative. Positive or acute stress is the stress that lasts for a short time when an individual's The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Crippa .
capabilities are sufficient to meet the challenge [3]. Negative or chronic stress is the stress that lasts for a long time when a challenge exceeds an individual's capabilities [4]. At some point in life, every individual is exposed to a stressful scenario and will react accordingly. If an individual can cope with stressful scenarios, the next time a similar scenario arises, the individual won't have as much of a stressful impact [5]. Similarly, if an individual cannot cope with a stressful situation and is repeatedly exposed to a similar situation, the individual will develop chronic stress [6]. Each time the body encounters a stressful scenario, the brain triggers the stress response to visual input from the ears, nose, and eyes. This response is known as ''fight-or-flight'' [7]. Instantly, the hypothalamus receives a distress signal from the brain. The hypothalamus VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ is the brain's command center. The hypothalamus regulates involuntary body activities through the Autonomic Nervous system (ANS) [8].
During stress, most organs are controlled by ANS without human knowledge [9]. The ANS is divided into two main divisions: the Sympathetic Nervous System (SNS) and the Parasympathetic Nervous System (PNS). The stress response is controlled by the complementary interaction of SNS and PNS in different physiological conditions [10]. The SNS initiates the fight or flight stress response, which results in a series of changes, including physiological, behavioural, and so on [11]. On the other hand, the PNS plays an essential role in reducing stress responses in individuals by suppressing the SNS [12]. The initial symptoms that emerge from a stressful scenario are called acute stress reactions [13]. These symptoms are visible within minutes during a stressful scenario and settle down quickly. Sweating, difficulty breathing, palpitations, nausea, chest pain, headaches, etc., are the physical symptoms of acute stress reactions [14]. If the symptoms last longer, they will cause chronic stress reactions. Depression, anxiety, memory loss, heart attack, stroke, high blood pressure, cholesterol, ulcer, weight loss, shortness of breath, weak immune system, etc., are the long-term health effects linked to chronic stress [15]. Because of the negative impacts of stress, it is crucial to build an effective stress detection system. A timely and accurate diagnosis of stress can improve an individual's life as productive, healthier, and happier [16].
Stress is detected through physiological, psychological, and behavioural markers [17]. Psychological interactions include increased negative feelings like anger, anxiety, depression, etc [18]. Self-report questionnaires or an examination with a psychologist are used to conduct a psychological assessment of stress. The disadvantage of such assessments is that they are only performed once the affected person or those around them recognize the intensity of the stress, which is usually too late [19]. In as short as 24 hours, people can experience memory lapses regarding the day's emotional mood, which lead to inaccurate stress level measurements using self-reports or questionnaires [20]. An individual's behaviour is affected by stress. Emotions like irritation, anger, sadness, etc., are the resulting changes. But they are hard to measure, as individuals can hide these emotions [21]. Physiological signals can reveal an individual's inner affect's strength and quality without any manipulations [12]. These physiological changes are non-voluntary responses that are difficult to notice externally. Hence, hormone monitoring is widely considered reliable for assessing stress [22].
The physiological aspects have several distinct advantages, like reliability, simplicity, continuous readings, cost-effectiveness, user-friendliness, non-maskability, noninvasiveness, etc., which makes them popular among researchers for stress detection [23]. Common physiological signals for stress detection are EDA, electroencephalography (EEG), ECG, respiration pattern, electromyogram, skin temperature, blood pressure, etc. [24]. For most physiological signal-based stress detection research, ECG and EDA signals are widely used either separately or in combination [25]. The ECG signal determines the electrical activity of the heart. As the ANS directly affects the heart rate, there will be variations in the heart rate during stress [16]. The EDA signal determines the change in the electrical characteristics of the skin. During stress, the body sweats more, which leads to increased skin conductance [26].
An innovation that right away benefits society in healthcare is the growing application of machine learning (ML), deep learning (DL), and wearable technology [27]. For different tasks using physiological signals, ML or DL models are trained using benchmark physiological datasets [28]. Support Vector Machine (SVM), random forest, K-Nearest Neighbour, decision tree, linear discriminant analysis, etc., are common ML methods used for the study [29]. ML approaches are frequently employed and get state-of-the-art for most stress detection studies, whereas DL methods are less extensively used because of the need for large data [30]. CNN, Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), etc., are the DL algorithms commonly used for stress detection [31]. Recently, multimodal fusion using the DL approach was found effective for stress detection [32]. There are three different levels of multimodal fusion: early, late, and intermediate [33]. The early fusion method merges feature representations of each modality at the feature level and starts training. After being trained separately, the different models are integrated at the decision level in the late fusion technique [34]. Intermediate fusion begins training by fusing higher-level feature representations of each modality from independent modality models [35]. The multimodal fusion model learns the highly linked representation across multiple modalities simultaneously, which enhances the model's performance over unimodal approaches [36].
In the last few years, the popularity of CNN has significantly increased. CNN extracts the most discriminating characteristics from the data while learning from it. Recent studies have proven that CNN can generate statistically relevant results for various applications [37]. To link input layers to the output layer, a CNN model consists of several layers, including convolution, pooling, dense, etc. Deep CNN's several layers can encode various low, mid, and high-level features. The deep layers are used to learn high-level features, and the shallower layers are used to determine low-level features [38]. The availability of the most discriminating features is one of the most important factors for increased classification accuracy [39]. Furthermore, the probability of getting a high classification accuracy with just one conventional feature extraction method is relatively low. The model's performance could be better with more information. Researchers also point out that information loss in the network may increase as the layers increase [40]. Due to these reasons, in recent studies, feature fusion methodologies like hierarchical features at each level are integrated and used for training. The relevant information can be retained, and information loss is minimized by this hierarchical feature fusion [41]. In most end-to-end CNN networks, the last convolution layer's feature maps, mainly global features without hierarchy features, serve as discriminative features. However, low and mid-level features from the initial layers have discriminative features. The model can learn more efficient quality-aware feature representation with the help of hierarchical features (low, mid, and high-level features) [42].
Recent research shows that integrating features are often more efficient than independent features. This motivates us to apply the concept of feature fusion to enhance the efficiency of CNN-based stress detection models. We propose a multimodal hierarchical CNN feature fusion model that uses complementary features from various layers to enhance the performance rate of stress detection models. To the best of our knowledge, the proposed methodology has not yet been systematically addressed for stress detection. Hence, this paper presents a multimodal hierarchical CNN feature fusion model for stress detection using EDA and ECG signals. Initially, frequency domain features from EDA and ECG frequency bands or the raw frequency domain data are given as input to the CNN model. Inspired by the effectiveness of hierarchical feature fusion on CNN from the literature's [38], [39], [40], [41], [42], we concatenate the high, mid, and low-level features from the convolutional layers of EDA and ECG separately to form a hierarchical feature set. Unlike single-level fusion, gradual fusion has shown better performance [43]. So, each hierarchical feature set is used for multimodal fusion using MMTM (gradual fusion). Finally, we perform late fusion on the classification probabilities of each modality. This study also explores the performance of the distinct combination of hierarchical features concatenation from the low, mid, and high-level features. The proposed method is examined on four standard datasets-CLAS [44], ASCERTAIN [45], MAUS [46], and WAUC [47] The following four folds provide a summary of the major contributions of this study: 1) Multimodal hierarchical CNN feature fusion: The low, mid, and high-level features from the initial convolutional layers are concatenated separately for each modality, and multimodal fusion is performed on the hierarchical feature set using MMTM. 2) Combinations of hierarchical features: Examine the performance of the concatenated distinct combination of hierarchical features from the low, mid, and high-level. 3) Raw data and frequency band feature: Compare the effectiveness of the raw frequency domain data, and the features from the frequency bands. 4) Generalization ability: To ensure generalizability, the proposed stress detection model has been evaluated on four benchmark datasets-CLAS, ASCERTAIN, WAUC and MAUS.
Organization:The remainder of this paper is organized as follows. Section II examines recent works on hierarchical feature fusion and the identified research gap. Details of the proposed framework are provided in Section III. The experiment results are presented in Section IV and compared with existing works. The paper is concluded in Section V. We have defined the key terms used in this paper for better understanding and clarity. The list of abbreviations used in the paper is shown in Table 1.

II. RELATED WORKS
Recently, hierarchical CNN feature fusion methods were frequently used in image classification tasks. An overview of such works and it's effectiveness is briefly discussed in this section.
In order to classify fruit diseases, Akram et al. [38] proposed a hierarchical pipeline for deep feature fusion and selection.Pre-trained models were used to extract deep features, which were then fine-tuned via transfer learning. Multilevel fusion was performed before feature selection. Fruit diseases were classified with Multi-SVM using the selected features from the plant village dataset [48]. The proposed method's efficiency was revealed in the classification results in terms of accuracy as 97.8%, sensitivity, G-measure and precision as 97.6%.
A face recognition algorithm with hierarchical feature fusion was proposed by Zhang et al. [41]. The proposed framework learned shallow and deep facial aspects using supervisory information. The features are combined to enhance face recognition efficiency in the face of occlusion and illumination. The visual geometry group network and lightened CNN are both altered using this method. The proposed approach provided significant recognition results in VOLUME 11, 2023 both the AR face database [49] and the labelled faces in the wild [50] database.
In the wild images, blind quality assessment using hierarchical feature fusion was proposed by Sun et al. [42]. The features from the intermediate layers to the final feature representation were hierarchically integrated using a staircase structure. The proposed method allowed the model to fully use visual data at all levels, from low to high. An iterative mixed database training approach was proposed to train the model simultaneously on multiple datasets. The proposed model benefited from the additional training samples and the capacity to learn a more generic feature representation. Experiments were conducted on six real-world image quality assessment datasets, and the results revealed that the proposed model performed significantly better than other state-of-theart models.
A multiple hierarchical feature fusion for an end-to-end steel surface flaw detection is presented by He et al. [51]. The developed method uses a baseline CNN to produce feature maps to attain good classification abilities at each level. A feature fusion network with multiple levels merges several hierarchical features into a single feature with more details. A region proposal network creates regions of interest based on these multilayer properties. The final detection results are generated for each ROI by a detector composed of a bounding box regressor and a classifier. A defect detection dataset called NEU-DET [52] is compiled to evaluate the proposed method. Using baseline networks, the proposed technique yields 74.8/82.3 mean average precision on the NEU-DET dataset.
A selective feature connection mechanism for concatenating CNN features from multiple layers is proposed by Du et al. [53]. A feature selector created by high-level features links low-level features to high-level features. The proposed method shows universal acceptance, superiority, and efficacy on various challenging computer vision tasks. Ma et al. [54] proposed a multi-layer feature fusion on CNN to classify satellite image scenes. Since combining feature maps of various scales is not practical, the proposed method first transforms each feature map to fit its dimensions. Instead of just the final convolution layer, two methods for fusion were created to combine feature maps of various layers, and these features were given to the next layer or a classifier. Empirical findings showed that the proposed methods perform efficiently on public datasets.
A multiscale and hierarchical feature aggregation network is proposed for segmenting medical images by Yamanakkanavar et al. [55]. Two modules for feature aggregation are used to effectively combine data across end-toend network layers: Hierarchical Feature Aggregation (HFA) and Multiscale Feature Aggregation (MFA). To learn deeper fusions of the feature hierarchy, the HFA module blends the features iteratively and hierarchically, and the MFA module gradually accumulates features and enriches feature representation. Having a 0.97 average accuracy score on the UFBA-UESC, PH2, and ISIC-2018 datasets [56], [57], [58], it is noted that the suggested model outperforms conventional methods for skin-lesion segmentation in terms of segmentation performance.
Li et al. [59] proposed a hierarchical feature aggregation network for deep image compression. Two approaches-inter and intra-stage feature aggregation-are put forth. Incorporating multiscale data into the inter-stage feature aggregation results in the production of more contextual features. To enhance representations of a single resolution, intra-stage aggregation joins features from the same stage. According to extensive experiments, the proposed method outperformed SOA methods, showing its effectiveness.
For robust cross-resolution face recognition, a representation learning method using a hierarchical deep CNN feature set is proposed by Gao et al. [60]. The proposed approach adaptively fuses the contextual features from different layers to learn more reliable and discriminative features. A feature set-based representation learning technique was developed to collectively describe the hierarchical features for improved recognition to exploit contextual information effectively. The hierarchical recognition outputs from several phases are combined to enhance recognition performance. Experimental results on several face datasets have proved the efficiency of the proposed approach.
In light of the studies above, hierarchical feature fusion and multimodal feature fusion effectively enhance model performance. Recent Figure 1 depicts the novelty of this study on hierarchical feature fusion. Figure 1-(a) shows the traditional end-to-end deep learning approach. Features from the very last layer are only used as the identification feature in end-to-end networks. These features are frequently more general features without using hierarchical features. For this reason, we built a hierarchical feature learning model for stress detection using physiological signals. As shown in Figure 1-(b), we combined deep and shallow features to suit a hierarchical feature set. Later, these hierarchical features are used for multimodal fusion. We first describe the datasets used for this study in the following subsections. In the following subsections, we first give details about the datasets used for this study. Then we go into detail about multimodal hierarchical CNN feature fusion's architecture and feature extraction.

A. DATASET DETAILS
This research makes use of the following four benchmark datasets-ASCERTAIN [45], CLAS [44], MAUS [46], WAUC [47], which contain multimodal physiological signals such as ECG and EDA. A detailed explanation of each dataset is given below.

1) ASCERTAIN
The dataset contains 58 subjects physiological signals and face activity recordings. The physiological signals of the subjects were captured while they watched emotional video clips. Emotional video clips of 36 from [61] were used in the study. Based on the previous studies of stress detection using the ASCERTAIN dataset [62], we also used subjective ratings of valence and arousal for stress labelling. In the 2-D valence arousal plane, high arousal values along with low valence values are considered stressed and others as unstressed [63]. The average of the valence and arousal scores are used to decide whether it's high or low [45].

2) CLAS
The dataset contains 62 subjects' physiological data. Emotional video clips were used to evoke the subject's physiological signals. Emotional video clips of 16 from [64] were used in the study. After removing those subjects that didn't have all the data, we were left with 59 subjects. Stress labels have been fixed using the stimulus annotations described in the dataset [44].

3) MAUS
The dataset has recorded physiological data under different cognitive circumstances. The N-back task was used on 22 participants to generate a cognitive load. At the start of the trial, there was a five-minute rest interval. The N-back task required the participant to recall the last N single number from rapidly displayed digits. Whenever a signal matched the N-th number before the stimulation number, the subject was asked to reply by touching the space bar on the computer keypad. After a short rest period, the N-back task with six testing cases was completed. The complexity of the task serves as the ground truth [46].

4) WAUC
The study included 48 subjects who did activities at three different levels of exercise. The speed of a non-rotating cycle or rowing machine was changed to manipulate physiological tasks. During the exercise, sensory signals were captured. The subject's responses to the NASA Task Load Index questionnaire were encoded into binary values. They are classified as high or low cognitive load using the mean score as a cutoff provided in the dataset. After removing those subjects that didn't have all the information, we were left with 45 subjects [47].
To increase the sample count, each signal (EDA/ECG) is split into five-second segments. Subject IDs were established for training and testing to ensure subject independence. The first 36,18, 43 and 42 subject samples from WAUC, MAUS, CLAS and ASCERTAIN datasets are used for training. The remaining 9 WAUC, 4 MAUS and 16 CLAS and ASCER-TAIN subject samples are employed for the testing.
The class imbalance affects the CLAS, ASCERTAIN, WAUC, and MAUS datasets. Real-world datasets frequently have a class imbalance when one class has fewer samples than the other class [65]. For more than two decades, this has been a topic of interest. To solve this problem, continuous enhancement is carried out at the data level, algorithmic level, and through hybrid methods [66]. Sampling techniques have received more attention in the data-level approach to enhance classification performance. Undersampling and oversampling are two categories of sampling methods [67]. Since oversampling creates additional samples from the minority class to compensate for the lack of samples, it is the most effective technique among these [68]. One of the most popular techniques in the literature to generate these new samples is the Synthetic Minority over-sampling VOLUME 11, 2023 Technique (SMOTE) [69], [70], [71]. It's based on the simple generation of data points on the line segment joining a randomly chosen data point, and one of its K-nearest neighbours was used to sample data from the minority class [72]. This strategy is widely used since it is pretty simple and works incredibly well in reality [73], [74]. We also used SMOTE to train data in line with the literature to address the class imbalance.

B. FEATURE EXTRACTION
The following subsections explain the frequency domain features of EDA and ECG on raw data and in the frequency bands.

1) RAW DATA
The Discrete Cosine Transform (DCT) converts the raw EDA and ECG dataset to the frequency domain. Using the DCT method, a signal can be broken down into essential frequency components [75]. The input signal is more specifically encoded in the DCT as a linear sequence of weighted basis functions connected to its frequency elements. The DCT is given as input to the model.

2) FREQUENCY BAND FEATURES
Based on previous research [76], [77], we have identified three main bands in the frequency spectrum for ECG, as follows: The low-frequency and high-frequency bands are impacted by ANS activities. Therefore, these bands' features will be useful for stress detection [80]. The power spectral density of the Heart Rate Variability (HRV) derived from each band of the ECG is calculated (using Welch's technique). The frequency module pyHRV [81] from the Python library is used for this purpose. We collected 51 frequency-domain measures from these PSDs, including a relative, absolute, peak, and so on. The full list of measures is presented in [81]. Each EDA's power spectral density band is calculated (using Welch's technique). We retrieved 40 statistical characteristics from these PSDs (5 bands with eight features each), including max, min, standard deviation, variance, skewness, kurtosis, median and min.

C. ARCHITECTURE DETAILS
The proposed architecture for stress detection is shown in Figure 2. Phases 1, 2 and 3 are the different levels of features (low, mid and high-level) from the convolutional layers. In each modality, hierarchical features from different levels of convolutional layers are concatenated and given as input to MMTM [43] for multimodal fusion. Multimodal feature information is combined, and the features are recalibrated using MMTM. MMTM makes advantage of the computationally efficient and light-weight squeeze and excitation block [33]. A joint representation is generated in the MMTM module by combining ECG and EDA hierarchical features. The joint representation is used to predict the excitation signals, as explained in [43]. For the excitation, two independent, fully connected layers are used for each modality. One fully connected layer uses ReLU activation, while the other uses sigmoid activation. The excitation output is multiplied by the original features of each modality, which we gave as input to the module.
Four convolution layers consist of filters 32, 64, 128, and 256 with 3 × 3 as kernel size and ReLu as activation function. Batch Normalisation (BN) and Max Pool (MP) layers are applied after the convolutional layers. The architecture is completed by fully connected FC1 and FC2 and a sigmoid output layer. The Adam optimizer is used for the model training, with the default learning rate and a batch size of 64. As the loss function, Binary Cross-Entropy is used. An earlystopping strategy is used to shorten the training period if after 30 epochs in a sequence the loss does not decrease. The maximum classification probabilities from each model are used to perform the late fusion.
Based on our previous study [82], we perform a multimodal hierarchical feature fusion on the highest performed feature band of ECG ((0.15-0.40 Hz-High-frequency band)) and EDA ((0.15-0.25 Hz-band b). For the experiments, the architecture follows the same as shown in Figure 2 excluding the max-pooling and the kernel size as 2 × 2.

IV. RESULTS AND DISCUSSION
The experimental findings are presented and discussed in this section. We ran our studies on raw data and frequency band features, since we considered their effects. The proposed model's performance is evaluated using accuracy and F1-score, as shown in equations 1 and 2. In our first set of experiments, we compared the performance of raw data against frequency band features. Results obtained from different concatenation combinations on ASCERTAIN, CLAS, MAUS, and WAUC datasets are shown in Table 2. In our second set of experiments, the performance of the highest-performing band features of the ECG and EDA on the proposed models using the WAUC dataset is shown in Table 3.

A. MULTIMODAL HIERARCHICAL CNN FEATURE FUSION
Hierarchical feature fusion and multimodal fusion are the two fundamental processes that compose the proposed multimodal hierarchical CNN feature fusion model. Convolutional layers encode information at various levels using different layers; for hierarchical feature fusion, we use this concept. As shown in Table 2 and Table 3, the result shows the effectiveness of the proposed methodology because the complementarity between low-level information and high-level information is completely utilized by our efficient hierarchical feature fusion method. It also proves that, besides the hierarchical feature fusion, the multimodal fusion on gradual level and decision level helped to enhance the model's performance by learning across modalities. The promising results suggest that clinical practitioners can use the proposed model for stress detection.

B. DIFFERENT COMBINATIONS OF HIERARCHICAL FEATURES
We performed the proposed multimodal CNN feature fusion on all hierarchical CNN feature fusion combinations. This experimentation phase is essential to show that the proposed model is stable and to identify the best hierarchical CNN feature combination. The features of each convolutional layer are utilized in the proposed architecture to extract the shallow, intermediate, and deep features. As shown in Table 2 and  Table 3, first, when comparing the performance from phase1 to phase 1,2 and 3 (concatenation combinations), we observe a consistent increase in performance as the features extracted from phase 1 to phase 3 are added in sequence to the model. Concatenating all level features (phases 1, 2, and 3) enhanced the model's overall performance more than other combinations (phases 1, 2, and 3 alone and its combinations) by 12-15% on raw data, 9-15% on band features and 12.15% on highest band features of ECG and EDA on WAUC dataset. This proves that shallow features are also important to endto-end networks, along with deep features, and the features extracted from all stages make contributions to enhance the model's performance.

C. RAW DATA AND FREQUENCY BAND FEATURES
We compared the performance of raw data and frequency band features on the proposed model. In the overall study, it is observed that the frequency domain features retrieved from VOLUME 11, 2023  the EDA and ECG frequency bands influenced more for the performance enhancement of the model more than raw data. As shown in Table 2, in all the datasets, we have observed the same shift in the performance of raw data and frequency band features by 2-4%, respectively. We also made a performance comparison on the highest performed band features of EDA and ECG with the latest dataset-WAUC. As shown in Table 3, the results were not encouraging compared to the whole frequency band features. This suggests that features across the entire frequency band influence performance enhancement more than the highest-performing EDA and ECG band features.

D. GENERALIZATION ABILITY
Applying DL techniques in the healthcare industry has several benefits, especially when it comes to predictive modelling. The validity and generalizability of a model are being given more consideration as the development of DL-based models continues to advance. In the healthcare industry, this is particularly important because algorithmic results directly impact patient treatment and clinical judgement. We proposed a subject-independent multimodal hierarchical CNN feature fusion stress detection model. Four benchmark datasets gathered from four separate scenarios are used to validate and examine the generalizability of the proposed methodology. As shown in Table 2, the results prove that the presented framework does not overfit a dataset obtained in a specific setting. In all four datasets, we observed a similar performance shift.

E. T-SNE VISUALIZATION
In DL, we keep seeking data insights; to achieve that, we visualize the data. To visualize the impact of the proposed models, we qualitatively evaluate the proposed hierarchical fusion strategy with the network's feature visualization. This part uses t-distributed stochastic neighbor embedding (tSNE) to assess the network's visual cognition. Features from the FC-16 layers of the frequency band of ECG modality are taken and used for visualization. It is evident from figure 3 that the hierarchical features of the ECG following multimodal fusion are discriminatory enough to classify stressed and unstressed. We can see similar clusters in all the datasets. The plot demonstrates the groups created based on similarity, illustrating the potential capability of the proposed approach for stress detection. The critical t-SNE visualization map's highlighted a distinct separation between stressed and unstressed conditions. We have also noticed similar clusters for EDA frequency band features.

F. COMPARISON STUDY
Nowadays, a critical healthcare challenge is the quick and precise diagnosis of stress. Accurate stress detection is a challenge that has been addressed using various techniques. The traditional DL and ML approaches have shown to be the most successful. This work mainly focused on detecting stress using physiological signals-ECG and EDA using DL. We proposed an efficient multimodal hierarchical CNN feature fusion model for stress detection and compared its performance with several classical ML and DL techniques. This section analyzes the proposed method's findings with those of existing stress detection research using the four datasets. Table 4 shows a summary of the performance measures. Only a few studies have used the most recent datasets, WAUC and MAUS, in their analyses. Existing works show that the majority of the works are carried on timefrequency domain [44], [62], [83], [84], [85], [86], [87], [88], subject dependent [44], [62], [77], [83], [84], [89], and using machine learning models [44], [62], [83], [84], [89], [90]. Few researchers used traditional deep learning techniques [82], [85], [86], [87], [88]. Compared with the existing works, our work focused on utilizing the full features of an end-to-end network, not only on the last layer features. The proposed approach performs better than all the reported state-of-the-art subject-independent and subjectdependent studies, except for the CLAS dataset. The results of our predictions confirm that our multimodal hierarchical VOLUME 11, 2023 feature fusion model is highly effective for detecting stress in a subject-independent way.

V. CONCLUSION
This paper presents a multimodal hierarchical CNN feature fusion for stress detection. EDA and ECG signals raw data and frequency domain features are used utilized in this study. For identification tasks, the convolutional layer's shallow feature as well as deep feature are useful. We integrate the features on each phase of the end-to-end network to increase efficiency and better utilise the retrieved features in each phase. Low, mid, and high-level features of convolutional layers are concatenated to obtain different combinations, and multimodal fusion is conducted on each hierarchical feature set. Additionally, the combination of features can more effectively convey the characteristics of the physiological signals. The proposed approach is tested on four benchmark datasets -ASCERTAIN, CLAS, MAUS and WAUC. Experimental results show that the proposed approach outperforms previous studies in terms of stress detection in a subject independent manner. Among the different combinations, concatenating all the phases (low, mid and high-level features) yields optimal performance. The proposed approach to feature fusion is a general one that works well in end-toend networks. To enhance the ability of feature extraction in neural networks, we can use the end-to-end networks deep, medium and shallow features of end-to-end networks and perform feature integration. Thus, in the future, we intend to: (i) expand the studies on hierarchical feature fusion and iidifferent multi-modal fusion techniques on hierarchical features.