Human Activity Recognition Method Based on FMCW Radar Sensor with Multi-Domain Feature Attention Fusion Network

This paper proposes a human activity recognition (HAR) method for frequency-modulated continuous wave (FMCW) radar sensors. The method utilizes a multi-domain feature attention fusion network (MFAFN) model that addresses the limitation of relying on a single range or velocity feature to describe human activity. Specifically, the network fuses time-Doppler (TD) and time-range (TR) maps of human activities, resulting in a more comprehensive representation of the activities being performed. In the feature fusion phase, the multi-feature attention fusion module (MAFM) combines features of different depth levels by introducing a channel attention mechanism. Additionally, a multi-classification focus loss (MFL) function is applied to classify confusable samples. The experimental results demonstrate that the proposed method achieves 97.58% recognition accuracy on the dataset provided by the University of Glasgow, UK. Compared to existing HAR methods for the same dataset, the proposed method showed an improvement of about 0.9–5.5%, especially in the classification of confusable activities, showing an improvement of up to 18.33%.


Introduction
HAR is a significant research area in artificial intelligence, with broad applications in human-computer interaction, intelligent surveillance, and other fields. HAR primarily acquires information about human targets through cameras or sensor devices and employs machine learning [1] or deep learning algorithms [2]. Currently, wearable electronic devices, cameras, radars, and other devices are the mainstream devices used for HAR [3][4][5].
Wearable electronic device sensors have the potential to acquire a vast amount of information about human movement [6]. However, these sensors have limitations as they must be attached to the human body, and individuals must wear them at all times for proper use. In contrast, image-based HAR primarily relies on cameras or other image-acquisition devices to obtain information about human activities [7]. High-resolution cameras can accurately identify human activities, but they have limitations in functioning under allround and all-weather conditions, regardless of the environmental conditions. Additionally, the use of image-based techniques can compromise privacy.
In comparison, FMCW radar sensors can overcome and improve upon the limitations of the devices mentioned above. An FMCW radar sensor can guarantee users' privacy and has the advantage of working under any lighting conditions, as well as having the ability to function in all-weather, even in various harsh environmental conditions such as

•
We propose the multi-domain feature attention fusion network (MFAFN) model for HAR based on the FMCW radar sensor, which enhances the VGG13 architecture by fusing TR and TD maps as the multi-domain feature fusion baseline network (MFFBN) model. Specifically, we introduce the MAFM to more comprehensively unite the 2-D domain spectrum by combining single-domain shallow, medium, and deep attention-weighted features; • We replace the traditional cross-entropy loss function with a multi-classification focus loss (MFL) function to improve the weight of confusable samples in the MFFBN and the MFAFN models; • We evaluate the effectiveness of our proposed algorithm on a publicly available dataset and found that it slightly improves the recognition accuracy compared to existing methods for HAR.

Related Work
This section provides a comprehensive review of prior work on HAR using radar sensors, divided into three parts: HAR based on single-domain features, HAR based on multi-domain feature fusion methods, and HAR based on attention mechanisms.

HAR Based on Single-Domain Features
In the field of HAR, researchers commonly utilize TD maps as the 2-D domain spectrum to classify. Taylor et al. combined three machine learning and three deep learning algorithms with principal component analysis (PCA) to implement HAR using TD maps as input. The best results were achieved by combining CNN and PCA in the results of six methods [11]. Saeed et al. used a ResNet network to classify TD maps for six activities: falling, sitting down, standing up, walking, drinking, and picking up an object. They achieved 100% accuracy for falling activity recognition by using TD maps for classification [17]. Zhu et al. proposed a deep learning model that combines 1D-CNN and long short-term memory (LSTM) [18]. Shrestha et al. presented a recurrent network architecture based on LSTM and Bi-LSTM [19]. Both studies treat TD maps as time series rather than 2-D images, which differs from how CNNs operate. However, it should be noted that training an LSTM model takes longer [18,19].
Wang et al. utilized a graph convolutional network (GCN) to fuse TR, TD, and RD maps of human activities and then performed HAR in graph classification [29]. Bai et al. proposed a dual-channel deep convolutional neural network (DCNN) based approach for radar-based human gait recognition. They fused two TD maps generated using short-time Fourier transform (STFT) with different sliding window sizes to achieve fine human gait recognition [30]. Jia et al. extracted hand-crafted feature maps, phase maps, and TD maps of human activities and fused them using SVM, resulting in two and three map fusions, respectively. The experimental results revealed that the fusion of hand-crafted features and TD maps led to better performance [31].
Numerous MFFNs have been proposed for HAR based on various radar 2-D domain spectral features. Zhao et al. extracted TD maps and CVD of human activities. They fused both feature maps using the CentralNet network, which links the relationship between the Sensors 2023, 23, 5100 4 of 25 two features and more effectively combines TD maps and CVD. However, this approach is only suitable for feature fusion with correlated features [32]. In another study, Chen et al. designed a pre-trained MobileNetV3 lightweight network model and a feature pyramid network (FPN) based multiscale feature extraction model to overcome the challenge of insufficient data [33]. A hierarchical fusion network (HFN) with multi-domain features was proposed in the literature using narrowband radar. The HFN contains an intra-domain network and an inter-domain network. The intra-domain network reduces redundant features, and the inter-domain network fuses high-level features from different domains [34]. Helen et al. proposed a tower CNN that inputs three channels of red-green-blue (RGB) TD maps of human activities separately, with each color channel image as a parallel input layer. They utilized a splicing operation after the convolutional module and then learned the fused features using a 7-layer dense neural network [35]. Ding et al. proposed a novel MFFN model with a summation approach. To extract features, a combination of a 1-D CNN and LSTM network is utilized for TR and TD maps, while a 2-D CNN is employed for RD maps. Finally, the three 2-D domain spectra are fused using adaptive weight fusion, and the network model effectively classifies activities [28].
In addition, decision-level fusion is also utilized by training separate models for each 2-D domain spectrum and then combining the results via voting. Chen et al. conducted six preprocessing steps for the TD maps of human activities and utilized CNN models with different training architecture parameters for each feature. They implemented a weighted voting method to fuse the information, and the weight matrix was estimated based on the classification results of the training dataset. This method ultimately improved HAR accuracy effectively [36]. Jokanovic et al. used a stacked autoencoder-based feature extraction for each feature, including the TD, TR, and RD domains. The three parts were combined via weighted voting to obtain activity classification results [37]. Kim et al. proposed a range-distributed convolutional neural network (RD-CNN) architecture for HAR by combining range-time-Doppler (RTD) maps [38]. The TD map of the range dimension is utilized as input to the network. The final classification result is obtained by calculating the sum of probabilities across the range dimensions. In summary, there are various MFFN methods for HAR based on radar sensors, including feature-level fusion and decision-level fusion. These methods combine multiple 2-D domain spectral features to achieve better performance in HAR. However, traditional neural networks treat all inputs equally, which may not effectively distinguish different input information. The introduction of an attention mechanism can help neural networks focus more precisely on the parts that have a greater impact on the results, thus improving the model's performance.

HAR Based on Attention Mechanism
Recent literature has explored the use of attention mechanisms to enhance the accuracy of HAR models. Abdu et al. employed AlexNet and VGGNet networks to extract TD maps of human activities, respectively [14]. They proposed an efficient channel attention module to enhance the extraction process's efficiency. Finally, the canonical correlation analysis (CCA) module was used for feature fusion. DU et al. utilized a feature refinement module based on channel and spatial attention to enhance the accuracy of multi-channel feature fusion for HAR [39]. A method was proposed in [40] to extract TD maps of human activities. It combines a multi-head attention mechanism, which captures global information from TD maps, with locally extracted feature maps from a convolutional auto-encoder (CAE). The introduction of the multi-head attention mechanism resulted in higher classification accuracy.
The literature mentioned above faces two main issues. Firstly, the attention mechanism approach only operates on a single layer of features and does not account for the relationship between multiple layers of features. Secondly, accurate recognition of human activity based on FMCW radar sensors is highly challenging, mainly due to the high similarity between confusable activities. Although using multiple sensors or radar 2-D domain spectra can effectively improve the accuracy of HAR by exploiting complementary information, which needs to address the issue of confusable activities. Therefore, in this study, we propose the  The network first incorporates  attention-weighted features from single-domain features at shallow, middle, and deep layers. Then, the fusion is completed for the two-domain features after a layer of pooling. Instead of the conventional cross-entropy loss function, the MFL function is utilized. This approach reduces the weights of easily classified samples while increasing the importance of easily confusable samples, resulting in improved accuracy in HAR.

HAR System Overview
This section presents an overview of the proposed HAR system using the MFAFN model, as shown in Figure 1. The system consists of three main parts: the FMCW radar sensor architecture, the data preprocessing phase, and the activity recognition phase. The FMCW radar sensor architecture and the data preprocessing phase can be collectively referred to as the 2-D domain spectra extraction process. The activity recognition phase in this section focuses on the convolutional neural network and the attention mechanisms used. The dashed part in Figure 1 indicates the innovative part of the proposed method in this paper, which will be described in detail in Section 4. domain spectra can effectively improve the accuracy of HAR by exploiting complementary information, which needs to address the issue of confusable activities. Therefore, in this study, we propose the MFAFN model that addresses the abovementioned issues. The network first incorporates attention-weighted features from single-domain features at shallow, middle, and deep layers. Then, the fusion is completed for the two-domain features after a layer of pooling. Instead of the conventional cross-entropy loss function, the MFL function is utilized. This approach reduces the weights of easily classified samples while increasing the importance of easily confusable samples, resulting in improved accuracy in HAR.

HAR System Overview
This section presents an overview of the proposed HAR system using the MFAFN model, as shown in Figure 1. The system consists of three main parts: the FMCW radar sensor architecture, the data preprocessing phase, and the activity recognition phase. The FMCW radar sensor architecture and the data preprocessing phase can be collectively referred to as the 2-D domain spectra extraction process. The activity recognition phase in this section focuses on the convolutional neural network and the attention mechanisms used. The dashed part in Figure 1 indicates the innovative part of the proposed method in this paper, which will be described in detail in Section 4.

2-D Domain Spectra Extraction
This section describes FMCW radar sensor architecture and data preprocessing, as illustrated in Figure 1. The FMCW radar sensor transmits a continuous frequency-modulated signal to the target through the TX antenna and receives the reflected signal from the target through the RX antenna. The intermediate frequency (IF) signal is then converted into a digital signal by an analog-to-digital converter (ADC) in the radar equipment, which undergoes several digital signal processing (DSP) steps to obtain the raw radar data.
To obtain a 2-D spectrogram that can more accurately represent information on human activity, further processing of the raw radar data is necessary. Firstly, a TR map is obtained by performing a fast Fourier transform (FFT) on the raw radar data, which is a matrix of size sampling point and chirp. A Butterworth high-pass filter is applied to this TR map to remove noise. As human activity is a signal collected within a specific range from the radar, the range selection unit needs to be manually set to select the primary range of human activity. Subsequently, the TD map characterizing people's activity is extracted using the short-time Fourier transform (STFT) [41]. The STFT determines the analysis period by selecting the window function, and the spectrum at different time intervals is obtained by moving the window function graph. Ultimately, the spectrogram is expressed as follows: ( where m represents the time index, n represents the frequency index, ω(·) denotes the window function, and N denotes the length of the window.

Convolutional Neural Network
CNNs have shown excellent results in various computer vision tasks. Since the emergence of the AlexNet network in 2012 [42], many CNNs have been developed, including VGGNet [43], GoogleNet [44], and others. These models have demonstrated impressive results in image classification.
In the field of HAR based on FMCW radar sensors, the 2-D domain spectrum is commonly utilized as a feature for activity classification. This feature map can be viewed as images with different-sized pixel values. As a result, CNNs are suitable for implementing HAR using radar 2-D domain spectrum.
We have chosen the VGGNet network model as the reference model for our research. The VGGNet model was the runner-up of the ImageNet competition in 2014 and had lower model complexity than GoogleNet, which has the highest classification accuracy. Because the HAR dataset based on the FMCW radar sensor is less affluent than the image domain, it is unsuitable for complex network models with smaller datasets. To design the baseline model for the MFAFN model, we chose the VGG13 model, which is one of the models in the VGGNet group.

Attention Mechanism
Typically, deeper neural network structures often lead to better performance, but at the cost of increased time and memory consumption. We have tackled these issues by utilizing an attention mechanism that concentrates on the most informative regions of the input data. Specifically, we utilize the SENet architecture [45], which has low computational complexity and is widely used in practice. The output of the attention mechanism is a probability map that assigns different weights to different regions of the input features, depending on their importance for the task of HAR. Figure 2 illustrates that the input feature assumes a feature channel of C, a width of W, and a height of H, denoted as F in c . Initially, the feature channels undergo compression using the squeeze operation. The next step involves converting the 2-D features of each Sensors 2023, 23, 5100 7 of 25 channel into actual numerical values. This process generates a feature map F c avg with a size of 1 × 1 × C.
where F c avg denotes the result of applying global average pooling to the input features. After the squeeze operation, we use the excitation operation to generate weights for each feature channel. Instead of a fully connected layer, we employ a 1 × 1 convolutional layer to reduce the parameters and computational costs. The channel compression rate is set to eight to achieve this reduction. The weight A avg c for each feature channel is calculated as follows: where A avg c denotes attention weight, δ(·) denotes rectified linear unit (ReLU) activation function. σ(·) represents the sigmoid function, and f 1×1 represents the 1 × 1 convolution operation. Finally, the weighting operation is performed, where the weights generated in the previous step are weighted channel by channel into the input features F in c to obtain F out c : where F out c denotes the attention-weighted features.
where avg c F denotes the result of applying global average pooling to the input features.
After the squeeze operation, we use the excitation operation to generate weights for each feature channel. Instead of a fully connected layer, we employ a 1 × 1 convolutional layer to reduce the parameters and computational costs. The channel compression rate is set to eight to achieve this reduction. The weight avg c A for each feature channel is calculated as follows: where out c F denotes the attention-weighted features.

HAR Architecture Based on the Multi-Domain Feature Attention Fusion Network
This section details the MFAFN model, whose network architecture is illustrated in Figure 3. The model requires two inputs, TR and TD maps. They are treated as optical images with distinct pixel values, serving as inputs to our proposed HAR model. The TR map effectively captures changes in range over time as the human body moves, making it suitable for distinguishing activities characterized by significant alterations in range. In contrast, in situ activities tend to exhibit minimal range changes, which can lead to confu-

HAR Architecture Based on the Multi-Domain Feature Attention Fusion Network
This section details the MFAFN model, whose network architecture is illustrated in Figure 3. The model requires two inputs, TR and TD maps. They are treated as optical images with distinct pixel values, serving as inputs to our proposed HAR model. The TR map effectively captures changes in range over time as the human body moves, making it suitable for distinguishing activities characterized by significant alterations in range. In contrast, in situ activities tend to exhibit minimal range changes, which can lead to confusion between different activities. In contrast, the TD map depicts variations in velocity for each activity throughout the temporal sequence, making it proficient in distinguishing activities that involve pronounced speed changes. However, activities with similar speed alterations may also cause confusion when relying solely on the TD map.
To comprehensively characterize human activities, we incorporate both the range and velocity information of the target. By simultaneously analyzing and learning from the TR and TD maps, our HAR model extracts motion features and accomplishes HAR. We emphasize that the TR and TD maps play crucial roles in the HAR model, effectively fusing the two features and resulting in higher accuracy in HAR.  To comprehensively characterize human activities, we incorporate both the range and velocity information of the target. By simultaneously analyzing and learning from the TR and TD maps, our HAR model extracts motion features and accomplishes HAR. We emphasize that the TR and TD maps play crucial roles in the HAR model, effectively fusing the two features and resulting in higher accuracy in HAR.
Its backbone network follows a symmetric structure designed based on the VGG13 network. In addition, the model incorporates the SENet attention mechanism module, which is applied after the second convolutional layer following the three blocks. The process of feature fusion involves combining three attention-weighted features and using a pooling layer to preserve salient feature information. The fused features are then further extracted using two convolutional layers and an additional pooling layer. Finally, the model's feature vector output is fed into a classification module consisting of fully connected and softmax layers, which classifies the features and produces the HAR results. This section introduces three components of the MFAFN model: the multi-domain feature fusion baseline network, the multi-feature attention fusion module, and the multi-classification focus loss function.
As shown in Figure 4, we visualize the features at the following stages: after the first pooling layer, after the first, second, and third attention mechanisms, after the first concatenation operation, and after the fifth pooling layer. For each visualization, we arrange the data from four channels. Its backbone network follows a symmetric structure designed based on the VGG13 network. In addition, the model incorporates the SENet attention mechanism module, which is applied after the second convolutional layer following the three blocks. The process of feature fusion involves combining three attention-weighted features and using a pooling layer to preserve salient feature information. The fused features are then further extracted using two convolutional layers and an additional pooling layer. Finally, the model's feature vector output is fed into a classification module consisting of fully connected and softmax layers, which classifies the features and produces the HAR results. This section introduces three components of the MFAFN model: the multi-domain feature fusion baseline network, the multi-feature attention fusion module, and the multi-classification focus loss function.
As shown in Figure 4, we visualize the features at the following stages: after the first pooling layer, after the first, second, and third attention mechanisms, after the first concatenation operation, and after the fifth pooling layer. For each visualization, we arrange the data from four channels.   Figure 5 illustrates the proposed MFFBN model, which consists of two input channels for TR and TD maps. The network has a symmetric structure, and we present only the details of the first channel network. The network comprises five modules, each consisting of two convolutional layers and one pooling layer. The convolutional kernel size in the network is 3 × 3 with a step size of 1. Each convolutional operation reduces the channel size by half. The pooling layer window size is set to 2 × 2 with a step size of 2, reducing the image size by half after each pooling operation. These five modules differ only in the number of convolutional kernels, which are 32, 64, 128, 256 and 512, respectively.  Figure 5 illustrates the proposed MFFBN model, which consists of two input channels for TR and TD maps. The network has a symmetric structure, and we present only the details of the first channel network. The network comprises five modules, each consisting of two convolutional layers and one pooling layer. The convolutional kernel size in the network is 3 × 3 with a step size of 1. Each convolutional operation reduces the channel size by half. The pooling layer window size is set to 2 × 2 with a step size of 2, reducing the image size by half after each pooling operation. These five modules differ only in the number of convolutional kernels, which are 32, 64, 128, 256 and 512, respectively. To prevent gradient explosion and disappearance, a Batch Normalization (BN) layer is included after each convolutional layer. This also helps to speed up the training and convergence of the network. Additionally, the ReLU activation function is applied to all convolutional layers:

Multi-Domain Feature Fusion Baseline Network
where ( ) f z equals to input z when z is greater than 0, and 0 otherwise. The input image is in RGB format with 224 × 224 pixels, and the first module generates a feature map with 32 channels and a size of 112 × 112 pixels. Each subsequent module doubles the number of channels compared to the previous module and reduces the image size by half. After the fourth module, a feature map with 256 channels and a height of 14 × 14 pixels is produced. A cascade operation is then applied to channels one and two, as represented by: After the fifth module, a feature map with 512 channels and a size of 7 × 7 pixels is outputted for the fused features. The fully connected layers' size is 512, 128, and 6. To avoid overfitting caused by the deep network, we add a Dropout layer with a parameter of 0.5 to each fully connected layer. Dropout randomly deletes some hidden neurons to improve the model's generalization ability effectively. The final layer of the network is the softmax layer, which aims to maximize prediction accuracy by calculating the loss between the predicted data and the actual label. The class probabilities are then calculated as follows: where k a represents the predicted probability of each classification, C represents the set of categories, W denotes the weight vector, and k z denotes the value obtained by linearly weighting the features of the sample x .

Multi-Feature Attention Fusion Module
To make the model adaptively focus on significant target signal regions and make better use of features, we incorporate an attention mechanism into the MFFBN. This mechanism enables the model to focus on the essential features of the 2-D domain spectrum, reduce feature redundancy by reassigning feature weights, and use feature information more efficiently. In this study, we utilize the SENet channel attention mechanism to focus To prevent gradient explosion and disappearance, a Batch Normalization (BN) layer is included after each convolutional layer. This also helps to speed up the training and convergence of the network. Additionally, the ReLU activation function is applied to all convolutional layers: where f (z) equals to input z when z is greater than 0, and 0 otherwise. The input image is in RGB format with 224 × 224 pixels, and the first module generates a feature map with 32 channels and a size of 112 × 112 pixels. Each subsequent module doubles the number of channels compared to the previous module and reduces the image size by half. After the fourth module, a feature map with 256 channels and a height of 14 × 14 pixels is produced. A cascade operation is then applied to channels one and two, as represented by: where Con(·) represents the splicing operation, F TR (x) denotes the TR map of the fourth output module, and F TD (x) represents the TD map of the fourth output module. After the fifth module, a feature map with 512 channels and a size of 7 × 7 pixels is outputted for the fused features. The fully connected layers' size is 512, 128, and 6. To avoid overfitting caused by the deep network, we add a Dropout layer with a parameter of 0.5 to each fully connected layer. Dropout randomly deletes some hidden neurons to improve the model's generalization ability effectively. The final layer of the network is the softmax layer, which aims to maximize prediction accuracy by calculating the loss between the predicted data and the actual label. The class probabilities are then calculated as follows: where a k represents the predicted probability of each classification, C represents the set of categories, W denotes the weight vector, and z k denotes the value obtained by linearly weighting the features of the sample x.

Multi-Feature Attention Fusion Module
To make the model adaptively focus on significant target signal regions and make better use of features, we incorporate an attention mechanism into the MFFBN. This mechanism enables the model to focus on the essential features of the 2-D domain spectrum, reduce feature redundancy by reassigning feature weights, and use feature information more efficiently. In this study, we utilize the SENet channel attention mechanism to focus on regions essential for human activity. The SENet employs a Squeeze-and-Excitation (SE) module architecture to dynamically determine the relevance of each feature map. This is achieved by computing the attention weights for each channel and using these weights to scale the input features.
Low-level features generally have high resolution and more detailed information but weak semantic information. In contrast, high-level features contain more semantic information but have lower resolution and less detailed information. Although the semantic information of the bottom-level features is weaker, it is still significant. In contrast, the information in higher-level features may lose the most detailed information due to the deeper convolution layers and smaller feature dimensions. We propose the MAFM to enhance the feature representation of HAR. The MAFM combines low-level, mid-level, and high-level features and applies a channel attention mechanism to each depth level of features, allowing the model to focus on the essential features before fusing them at different levels. This approach enables the model to more effectively utilize the different types of features at varying levels of complexity.
The SENet module is presented in Figure 2 and implemented using a series of operations, including a global average pooling layer, two convolutional layers with a kernel size of 1 × 1, a ReLU activation function, and a sigmoid function. As shown in Figure 3, the attention maps are obtained after the second convolution of the second, third, and fourth blocks of the MFFBN model, respectively. We then adjust the three attention-weighted features to 64 × 64 pixels and connect them. The optimized feature maps can be expressed as follows: where F L (x), F M (x), and F H (x) represent the low-level, mid-level, and deep-level features, respectively, and A L (x), A M (x), and A H (x) denote the corresponding attention maps. After fusing the TR and TD maps with a layer of pooling, we can express this module as: where pool denotes the maximum pooling layer, F TR and F TD denote the multi-attentive weighted features of TR and TD maps, respectively, and F f use denotes the fused features.

Multi-Classification Focal Loss Function
Based on FMCW radar sensors, HAR can be challenging in distinguishing between some activities due to slight differences in their 2-D domain spectra. To address this issue, we apply the multi-classification focal loss function [46]. The focal loss function was initially developed for target detection to solve the classification imbalance problem. The traditional cross-entropy loss function treats all samples equally, leading to high error rates in identifying complex classification samples. The focal loss function introduces a moderator that reduces the weight of easy-to-classify samples and emphasizes the importance of confusable samples. Therefore, the focal loss function is suitable for classifying confusable samples.
The focal loss function dynamically adjusts the weights of the loss function based on the difference between the predicted probability and actual label of each sample, rather than using fixed weights. If a sample is correctly classified, its weight decreases. Conversely, if it is misclassified, its weight increases. Specifically, by introducing the modulation factor denoted by (1 − p i ) γ , the MFL function can be expressed as: where, N represents the number of categories, p i is the predicted probability, and t i = 1 if i belongs to the actual label, otherwise, t i = 0. The focus parameter γ is used to control the rate of reducing the weight of easy-to-classify samples. When γ = 0, the MFL function is equivalent to the multi-classification cross-entropy function. By setting γ equal to 2, we adjust the weight of the loss function to increase the model's sensitivity to confusable samples. When samples are misclassified and p i is small, the modulation factor almost tends to 1, so the loss is not affected. When p i is close to 1, the modulation factor almost tends to 0, and the loss of easy-to-classify samples is weighted down.
In the focal loss factor, the most commonly used values are usually 1.5 and 2. Table 1 performed a comparative analysis by using three different values. It can be seen that the performance is best when γ = 2. In summary, the flow of the MFAFN model algorithm is shown in Algorithm 1.

Algorithm 1 MFAFN Model
Input: TR and TD maps; Output: Label: 0-5 (six types of activities); for all training images do 1. Input TR and TD maps into the first block of the network and obtain the respective characteristics; 2. Apply the channel attention mechanism after the second layer of convolution for the second, third, and fourth blocks, resulting in attention maps for each block (denoted as A L (x), A L (x), A L (x)); 3. Resize the three features of the two domains to 64 × 64 and concatenate them to obtain the multi-attentive weighted features (denoted as F TR and F TD ), respectively; 4. After a layer of pooling, concatenate the multi-attentive weighted features of the two domains to obtain F f use ; 5. Obtain deep features F f use by passing the input through block5, and classify the input by feeding the features into a fully connected layer followed by a softmax layer; 6. Calculate the MFL based on the predicted and true values, and perform backpropagation to update the network parameters.

Dataset Description
The dataset used in this experiment was provided by the University of Glasgow, UK [47]. It was acquired by a C-band 5.8 GHz FMCW radar sensor and consisted of six human activities, including walking, sitting down, standing up, picking up an object, drinking, and falling.
MATLAB tools were utilized to preprocess the raw radar data of this dataset, resulting in the TR and TD maps of the six human activities. The numerical matrices were then converted to RGB image format, and the values of the matrix elements were used as the pixel values of the images. The proposed network model was applied for classification, resulting in a dataset of 1338 images. Figure 6 shows the six human activities' TR maps, Figure 7 shows the six human activities' TD maps.
converted to RGB image format, and the values of the matrix elements were used as th values of the images. The proposed network model was applied for classification, resul a dataset of 1338 images. Figure 6 shows the six human activities' TR maps, Figure 7 the six human activities' TD maps.

Environment Settings
The experimental setup for radar data preprocessing involved both hardware and software components. For hardware, an NVIDIA GeForceRTX3080 graphics card was used on a Linux system running Ubuntu 18.04LTS. The computer was equipped with an Intel(R) Core (TM) i5-6300HQ CPU for processing. For software, the experiments were conducted on MATLAB version 2020a, which is a commercial mathematical software developed by MathWorks, a company based in the United States. And the data processing was implemented in Python 3.9 using Pytorch 1.13. To improve the computation speed, a CUDA11.0 parallel computing architecture was employed.
During the training process, the Epoch was set to 50 for training, with a batch size of 32. The stochastic gradient descent (SGD) optimization algorithm was used, with a momentum of 0.9 and a weight decay parameter of 0.0005. We used a learning rate optimization strategy to speed up convergence and improve the model's generalization ability. This strategy allows the model to find the optimal global solution more efficiently while enabling finer adjustment of the model parameters later for better convergence. The initial

Environment Settings
The experimental setup for radar data preprocessing involved both hardware and software components. For hardware, an NVIDIA GeForceRTX3080 graphics card was used on a Linux system running Ubuntu 18.04LTS. The computer was equipped with an Intel(R) Core (TM) i5-6300HQ CPU for processing. For software, the experiments were conducted on MATLAB version 2020a, which is a commercial mathematical software developed by MathWorks, a company based in the United States. And the data processing was implemented in Python 3.9 using Pytorch 1.13. To improve the computation speed, a CUDA11.0 parallel computing architecture was employed.
During the training process, the Epoch was set to 50 for training, with a batch size of 32. The stochastic gradient descent (SGD) optimization algorithm was used, with a momentum of 0.9 and a weight decay parameter of 0.0005. We used a learning rate optimization strategy to speed up convergence and improve the model's generalization ability. This strategy allows the model to find the optimal global solution more efficiently while enabling finer adjustment of the model parameters later for better convergence. The initial value of the learning rate was set to 0.0001, and it grew linearly according to the step size until it reached the maximum value of the learning rate of 0.005, after which the slow-heating phase of the learning rate was completed. The learning rate in the decreasing phase utilized an exponential decay strategy.
For the dataset, the data sets of the two domains were randomly divided into five equal subsets based on the activity classification. In each iteration of the 5-fold cross-validation, four subsets were used as the training set, and one subset was used as the test set. All experimental results reported in this paper are based on the values obtained from the 5-fold cross-validation process.

Experimental Results and Analysis
To verify the effectiveness of the proposed method, we conducted experiments comparing the MFFBN model with the single-domain feature network (SFN), and an ablation experiment by adding the MFL function and the MAFM in the MFFBN model. Additionally, the proposed method is compared with other HAR methods.

Assessment Indicators
Five evaluation metrics, including accuracy, Recall, Precision, F1-Score, and confusion matrix, were utilized to evaluate the performance of the proposed model in this study. All experimental results were obtained by testing on the same dataset. The metrics are defined as follows: where TP and TN represent the number of correctly and incorrectly predicted samples among positive cases. While FP and FN represent the number of correctly and incorrectly predicted samples among negative cases. The Recall rate measures the percentage of correctly predicted positive class data in the dataset out of all positive class data. The Precision rate measures the percentage of positively predicted data in the dataset out of all optimistic predicted data. The F1-Score is the weighted average of the Precision and Recall rates.
In addition, the performance metrics of individual activities are shown through the confusion matrix, also known as the error matrix. Displaying the confusion matrix as a visual graph provides a clearer view of the model's classification results for each activity. The column elements of the confusion matrix represent the actual activity type, and the row elements represent the predicted activity type. Specifically, the walking, sitting down, standing up, picking up an object, drinking, and falling activities correspond to labels A1, A2, A3, A4, A5, and A6, respectively.

Comparison Experiment between the MFFBN and the SFN Models
This section compares the effectiveness of the MFFBN and the SFN models. The SFN model uses the VGG13 network, while the MFFBN model is an improved VGG13 network that includes features from two domains. Table 2 shows that the recognition accuracy of TD maps for human activities can reach 92.12%, while the recognition accuracy of TR maps can only reach 75.08%. The recognition accuracy of the MFFBN model for human activities is 93.1%. These results show that the TD maps possess a higher confidence level than the TR maps, and the fusion of TR and TD features can compensate for the shortage of single-feature recognition. To further analyze the correct and error rates for each activity in the three sets of experiments, confusion matrices were generated for the SFN model using TR maps, the SFN model using TD maps, and the MFFBN model. Figures 8 and 9 demonstrate that the recognition accuracy for standing, picking up an object, and drinking activities is lowest for the TR map. The TD maps have the lowest recognition accuracy for picking up an object and sitting down activities. Overall, the TD maps have higher accuracy in recognizing all activities other than the TR maps. However, the TR maps exhibit lower probabilities of misclassifying sitting down activities as picking up an object and drinking activities, respectively. The two domain feature maps have different advantages for recognizing the six activities.  To further analyze the correct and error rates for each activity in the three sets of experiments, confusion matrices were generated for the SFN model using TR maps, the SFN model using TD maps, and the MFFBN model. Figures 8 and 9 demonstrate that the recognition accuracy for standing, picking up an object, and drinking activities is lowest for the TR map. The TD maps have the lowest recognition accuracy for picking up an object and sitting down activities. Overall, the TD maps have higher accuracy in recognizing all activities other than the TR maps. However, the TR maps exhibit lower probabilities of misclassifying sitting down activities as picking up an object and drinking activities, respectively. The two domain feature maps have different advantages for recognizing the six activities.   In Figure 10, it is shown that fusing multi-domain features has higher accuracy in HAR for standing up, drinking, and falling compared to the other three activities. Compared to SFN, the accuracy is significantly higher for standing up and drinking activities. However, the recognition accuracy for picking up an object is lower, and the misclassification rate as drinking is higher. It is observed that there are similarities in the 2-D domain spectra of different activities. For instance, activities such as sitting down, standing up, picking up an object, and drinking are all stationary movements and exhibit similar RT maps. Among them, picking up an object and drinking have the highest misclassification rate of 23.11%. Similarly, picking up an object and drinking, which are both bending activities, have similar DT maps. Among them, picking up an object and drinking have the highest misclassification rate of 12.44%. As both the RT and DT maps of these two activities are prone to misclassification, it is evident that the fusion of these two features results in the highest misclassification rate of 21.33% for these two activities. This indicates that directly combining the two 2-D domain spectra features using the proposed multi-domain feature baseline network does not reduce the misclassification rate for these two activities and therefore does not improve the recognition accuracy of the picking up an object activity. However, the subsequent experiments involve fusing the MAFM and the MFL functions to address this issue. In Figure 10, it is shown that fusing multi-domain features has higher accuracy in HAR for standing up, drinking, and falling compared to the other three activities. Compared to SFN, the accuracy is significantly higher for standing up and drinking activities. However, the recognition accuracy for picking up an object is lower, and the misclassification rate as drinking is higher. It is observed that there are similarities in the 2-D domain spectra of different activities. For instance, activities such as sitting down, standing up, picking up an object, and drinking are all stationary movements and exhibit similar RT maps. Among them, picking up an object and drinking have the highest misclassification rate of 23.11%. Similarly, picking up an object and drinking, which are both bending activities, have similar DT maps. Among them, picking up an object and drinking have the highest misclassification rate of 12.44%. As both the RT and DT maps of these two activities are prone to misclassification, it is evident that the fusion of these two features results in the highest misclassification rate of 21.33% for these two activities. This indicates that directly combining the two 2-D domain spectra features using the proposed multi-domain feature baseline network does not reduce the misclassification rate for these two activities and therefore does not improve the recognition accuracy of the picking up an object activity. However, the subsequent experiments involve fusing the MAFM and the MFL functions to address this issue.

Ablation Experiment
To verify the effectiveness of the MAFM and the MFL function, we conducted the following ablation experiments: (1) introducing the MAFM into the MFFBN model, which is equivalent to the MFAFN model; (2) introducing the MFL function into the MFFBN model; (3) introducing the MFL function into the MFAFN model. Table 3 shows that in the MFFBN model, with the use of TR and TD maps for feature fusion, the accuracy rate reached 93.1%. Incorporating the MAFM increased the accuracy

Ablation Experiment
To verify the effectiveness of the MAFM and the MFL function, we conducted the following ablation experiments: (1) introducing the MAFM into the MFFBN model, which is equivalent to the MFAFN model; (2) introducing the MFL function into the MFFBN model; (3) introducing the MFL function into the MFAFN model. Table 3 shows that in the MFFBN model, with the use of TR and TD maps for feature fusion, the accuracy rate reached 93.1%. Incorporating the MAFM increased the accuracy rate by 3.34%, and adding the MFL function increased it by 2.2%. Adding the MAFM to the MFFBN model achieves higher accuracy growth than the MFL function. Moreover, when both the MAFM and the MFL function were included, the accuracy rate was more significant, with an increase of 4.48%. By comparing Figure 11 with Figure 10, we observe that the incorporation of the MAFM into the MFFBN model improved the model's recognition accuracy for walking, standing up, picking up an object, and falling. The most significant improvement was in the recognition accuracy of picking up an object activity, which increased by 18.22%. This improvement was mainly due to a decrease in the misjudgment rate of picking up an object activity as drinking, which decreased by 17.33%.  By comparing Figure 11 with Figure 10, we observe that the incorporation of the MAFM into the MFFBN model improved the model's recognition accuracy for walking, standing up, picking up an object, and falling. The most significant improvement was in the recognition accuracy of picking up an object activity, which increased by 18.22%. This improvement was mainly due to a decrease in the misjudgment rate of picking up an object activity as drinking, which decreased by 17.33%.  When comparing Figure 13 with Figure10, we observe that using the MFL function in the MFAFN model, the model's recognition accuracy for sitting-down activity increased by 1.34%. The recognition accuracy of the drinking activity remained 100% unchanged, and the remaining four activities were improved. The highest recognition accuracy improvement was achieved for the picking up an object activity, which reached 20%. When comparing Figure 13 with Figure 10, we observe that using the MFL function in the MFAFN model, the model's recognition accuracy for sitting-down activity increased by 1.34%. The recognition accuracy of the drinking activity remained 100% unchanged, and the remaining four activities were improved. The highest recognition accuracy improvement was achieved for the picking up an object activity, which reached 20%. By applying the framework of our proposed multi-domain fusion hybrid neural network model, we found that combining the two domains leverages their respective By applying the framework of our proposed multi-domain fusion hybrid neural network model, we found that combining the two domains leverages their respective strengths. The results show that the information from each domain can complement each other, and fusion can obtain comprehensive information about human activity from radar signals, reducing classification error rates and effectively improving the accuracy of HAR.
Besides, the MFL function can also effectively improve the activity recognition accuracy of multi-domain feature fusion networks. Figures 14-17 show the Recall, Precision, and F1-Score for each activity classified using the four networks in Table 3. The MFAFN model with the MFL function has the highest Recall for five activity groups: walking, standing up, picking up an object, drinking, and falling. Although the MFFBN has the highest Recall for sitting down, the MFAFN model with the MFL function has only 0.01 lower Recall than the highest value.     Precision represents the proportion of true positive samples among the samples classified as positive by the classifier. A lower precision indicates a higher probability of misclassifying other samples as positive for that particular sample. It can be observed that the MFFBN model has the lowest Precision for the drinking activity, which is only 0.82, indicating a high probability of misclassifying other activities as drinking. The MFAFN model with the MFL function has a 0.13 increase in Precision for the drinking activity.
Regarding F1-Score, only the MFAFN model with the MFL function achieves a Recall of 1 for the falling activity, indicating accurate recognition without misclassifying other activities as falling. The F1-Score for picking up an object and drinking activities is the lowest in the MFFBN model, with only 0.82 and 0.90, respectively, while the MFAFN model with the MFL function improves by 0.11 and 0.07, respectively. Figure 18 shows the changes in training and validation losses of the MFAFN model with the MFL function concerning iteration times. This is the loss curve obtained during the initial debugging of the model using a training set, validation set, and test set ratio of 6:2:2, the actual numbers of samples for training, validation, and testing are as follows: 810 samples, 264 samples, and 264 samples. The final experimental results are based on the results of five-fold cross-validation. The results indicate that after 30 epochs, the training and validation loss functions tend to stabilize, indicating convergence of the model. Specifically, the training loss remains stable at 0.001, indicating that the model has effectively learned the training data without overfitting. Overall, the convergence and stability of this model demonstrate the effectiveness of using the MFAFN model with the MFL function to recognize human activities accurately.  Figure 18 shows the changes in training and validation losses of the MFAFN model with the MFL function concerning iteration times. This is the loss curve obtained during the initial debugging of the model using a training set, validation set, and test set ratio of 6:2:2, the actual numbers of samples for training, validation, and testing are as follows: 810 samples, 264 samples, and 264 samples. The final experimental results are based on the results of five-fold cross-validation. The results indicate that after 30 epochs, the training and validation loss functions tend to stabilize, indicating convergence of the model. Specifically, the training loss remains stable at 0.001, indicating that the model has effectively learned the training data without overfitting. Overall, the convergence and stability of this model demonstrate the effectiveness of using the MFAFN model with the MFL function to recognize human activities accurately.

Noise Sensitivity Analysis
When plotting the 2-D domain spectrograms, we set the values below a threshold to zero in the data matrix. Specifically, when plotting the RT maps, we set amplitudes below 60 dB to zero. To demonstrate the impact of different noise levels on the output results, we compared the experimental results with a threshold set at 80 dB. Figure 19 shows the RT maps for walking, sitting down, standing up, picking up an object, drinking, and falling, with a threshold of 80 dB.

Noise Sensitivity Analysis
When plotting the 2-D domain spectrograms, we set the values below a threshold to zero in the data matrix. Specifically, when plotting the RT maps, we set amplitudes below 60 dB to zero. To demonstrate the impact of different noise levels on the output results, we compared the experimental results with a threshold set at 80 dB. Figure 19 shows the RT maps for walking, sitting down, standing up, picking up an object, drinking, and falling, with a threshold of 80 dB. Similarly, when plotting the DT maps, we set amplitudes below 40 dB from th imum value to zero. To demonstrate the impact of different noise levels on the o results, we compared the experimental results with a threshold set at 60 dB.  Similarly, when plotting the DT maps, we set amplitudes below 40 dB from the maximum value to zero. To demonstrate the impact of different noise levels on the output results, we compared the experimental results with a threshold set at 60 dB. Figure 20 shows the DT maps for walking, sitting down, standing up, picking up an object, drinking, and falling, with a threshold of 60 dB.  In addition, we used these data as a test set, consisting of 264 RT and DT maps each. We evaluated the performance of the trained model using the MFAFN model with the MFL function and achieved an accuracy of 90.23%. The confusion matrix results are shown in Figure 21. In addition, we used these data as a test set, consisting of 264 RT and DT maps each. We evaluated the performance of the trained model using the MFAFN model with the MFL function and achieved an accuracy of 90.23%. The confusion matrix results are shown in Figure 21.  Comparing Figure 21 with Figure 13, it can be seen that the recognition accuracy for the sitting activity decreased by 22.66%, while the picking up an object activity decreased by 15.56%. This indicates that noise has some effect on both activities. In addition, the recognition accuracy for the sitting down, drinking, and falling activities changed less, indicating that noise had the least effect on these three activities.

Comparison with Other HAR Methods
This section compares the proposed algorithm with the latest HAR methods applied to the same dataset, including [11,17,27,28,31,38]. According to Table 4, the proposed approach Comparing Figure 21 with Figure 13, it can be seen that the recognition accuracy for the sitting activity decreased by 22.66%, while the picking up an object activity decreased by 15.56%. This indicates that noise has some effect on both activities. In addition, the recognition accuracy for the sitting down, drinking, and falling activities changed less, indicating that noise had the least effect on these three activities.

Comparison with Other HAR Methods
This section compares the proposed algorithm with the latest HAR methods applied to the same dataset, including [11,17,27,28,31,38]. According to Table 4, the proposed approach outperforms the recent studies with an accuracy improvement ranging from 0.93% to 5.58%. Table 4. MFAFN model and other HAR methods for the same dataset.

Method
Type of Data Model Type Accuracy (%) [27] µD, CVD, TR CSA 92 [38] RTD RD-CNN 92.33 [28] TR, TD, RD CNN and LSTM 93.39 [11] TD CNN and PCA 95.3 [17] TD ResNet 96 [31] H Among them, refs. [11,17] are single-domain networks that utilize TD maps as inputs for HAR. In [11], three deep learning methods were employed: LSTM, Bi-LSTM, and CNN. The CNN consisted of four convolutional layers, four maximum pooling layers, and four dense layers. PCA and data augmentation methods were combined with CNN to achieve the highest recognition accuracy of 95.3%. In [17], ResNet was utilized to classify human activities, achieving 100% recognition accuracy for walking and falling, while the MFAFN performed better for standing up, picking up an object, and drinking.
In contrast, refs. [27,28,31,38] are multi-domain feature fusion networks. In [38], feature fusion was achieved by incorporating TD maps from different range locations. Despite the differences in selected ranges, the accuracy could be improved by combining the features. The literature [31] obtained an accuracy of 96.65% using a fusion of hand-crafted (H) features and DT graph features extracted with CNN. In [27], a fusion method using the Chi-Square algorithm (CSA) was employed to achieve 92% accuracy in recognizing human activities by fusing features from multiple domains, including spectrogram (µD), CVD, and TR. However, the method was only 76.9% and 84.6% accurate in classifying the activities of picking up an object and drinking, respectively. Despite the potential of multi-domain feature fusion in enhancing activity recognition accuracy, no effective measures were taken to address the issue of confusable activities. In [28], 1-D CNN and LSTM were used to extract features for TR and TD maps, while 2-D CNN pulled features for RD maps, considering different 2-D spectral characteristics. However, its accuracy in fusing TR and TD maps was only 92.24%. The accuracy of combining the three features reached 93.39%, and both this network and the MFAFN model with the MFL function achieved 100% identification accuracy in falling activity. Nevertheless, our method outperformed this multi-domain feature fusion network in all three actions: sitting down, picking up an object, and drinking.
We utilized TR and TD maps and employed the MFAFN model to fuse these features effectively. Additionally, we used the MFL function to enhance confusable sample weights, resulting in the MFAFN model that achieved a classification accuracy of 97.58%. Notably, our method outperforms the latest methods mentioned in the literature, particularly concerning two activities: picking up an object and drinking. Compared to the literature [27], our process improves the accuracy of these activities by 17.32% and 15.4%, respectively. Compared to the literature [28], our approach improves the accuracy of these activities by 7.73% and 18.33%, respectively. Our method can effectively improve the classification accuracy of confusable samples.

Conclusions
This paper presents a multi-domain feature fusion network based on an FMCW radar sensor that improves the accuracy of HAR by fusing the TR and TD domains of