A Rapid Response System for Elderly Safety Monitoring Using Progressive Hierarchical Action Recognition

The global trend of population aging presents an urgent challenge in ensuring the safety and well-being of elderly individuals, especially those living alone due to various circumstances. A promising approach to this challenge involves leveraging Human Action Recognition (HAR) by integrating data from multiple sensors. However, the field of HAR has struggled to strike a balance between accuracy and response time. While technological advancements have improved recognition accuracy, complex algorithms often come at the expense of response time. To address this issue, we introduce an innovative asynchronous detection method called Rapid Response Elderly Safety Monitoring (RESAM), which relies on progressive hierarchical action recognition and multi-sensor data fusion. Through initial analysis of inertial sensor data using Kernel Principal Component Analysis (KPCA) and multi-class classifiers, we efficiently reduce processing time and lower the false-negative rate (FNR). The inertial sensor identification serves as a pre-filter, enabling the identification of filtered abnormal signals. Decision-level data fusion is then executed, incorporating skeleton image analysis based on ResNet and the inertial sensor data from the initial step. This integration enables the accurate differentiation between normal and abnormal behaviors. The RESAM method achieves an impressive 97.4% accuracy on the UTD-MHAD database with a minimal delay of 1.22 seconds. On our internally collected database, the RESAM system attains an accuracy of 99%, ranking among the most accurate state-of-the-art methods available. These results underscore the practicality and effectiveness of our approach in meeting the critical demand for swift and precise responses in healthcare scenarios.


I. INTRODUCTION
W E ARE witnessing a significant global demographic shift.In 2019, 9% of the world's population was aged 65 or older, projected to reach 16% by 2050, notably impacting Europe and North America.The number of individuals aged 80 and above is expected to triple from 143 million in 2019 to 426 million in 2050 [1].This aging population trend transcends borders, affecting advanced economies on a global scale [2].The consequence is a burgeoning demand for healthcare services, particularly concerning the health and well-being of elderly individuals who often reside independently within residential communities or extensive nursing facilities [3], [4].The multifaceted challenges they confront include limited access to healthcare services, aggravated by physical limitations that curtail their mobility [5].These challenges render seniors more susceptible to accidents or medical emergencies.Managing their health, medications, and chronic conditions, especially for those with multiple ailments, poses significant hurdles.Human Activity Recognition (HAR)-based surveillance algorithms gained prominence in response.
The proliferation of Internet of Things (IoT) technology has introduced fresh avenues for tackling sensor-based HAR challenges, notably utilizing time-series data from wearable devices [6].Accelerometers and gyroscopes, compact and widespread in low-cost devices, play a key role in HAR.However, inertial sensor-based action recognition, while fast [7], [8], falls short of video-based systems that benefit from richer contextual cues.For example, deep learning-based fall detection achieves just 86% accuracy when relying solely on accelerometer data from wrist-worn devices [9] due to the limitations of single-context accelerometer data lacking 3D information for discerning wrist movements during falls [10].
HAR based on RGB images shows advantages in accuracy [11].RGB images contain rich information, including color, texture, and spatial relationships, allowing for a comprehensive understanding of human behavior in their surroundings.Deep learning methods such as convolutional neural networks (CNNs) have successfully extracted meaningful features from RGB images and achieved high accuracy in HAR tasks [12], [13].The ability to capture scene context enables RGB-based HAR systems to recognize complex activities and interactions beyond the limitations of inertial sensor-based approaches.However, HAR systems based on RGB images face challenges in computational complexity and data storage requirements, especially with high-resolution videos.In addition, deploying cameras for human monitoring may raise privacy concerns, limiting the locations where the cameras can be installed.
Therefore, the skeleton image is introduced to avoid the issues.By representing only key joint positions and motions, skeletal data ensures privacy while capturing human behavior's underlying spatial relationships and temporal dynamics [14].Skeleton data has a small footprint, improving computational efficiency and reducing storage requirements.Deep learning models, such as graph convolutional networks (GCNs) [15] or recurrent neural networks (RNNs) [16], efficiently extract features from skeleton data for accurate, real-time action recognition.Fusing skeletal data with other sensor inputs, such as inertial data from wearable devices, enables multimodal HAR systems to provide comprehensive insights into human activity while preserving privacy and ethical considerations [17].
Information fusion [18] is a pivotal aspect of HAR, enhancing system accuracy by integrating data from various sources or modalities.Fusion occurs at three levels: data, feature, and decision [13].Data level fusion combines raw sensor data, like accelerometer and gyroscope inputs, to create comprehensive datasets.Feature level fusion merges relevant features from heterogeneous data sources, such as the combination of the eigenvectors of the RGB and depth image into a single one.Decision-level fusion integrates information from multiple sources to reach final decisions.This fusion strategy enables HAR systems to recognize diverse activities while enhancing system robustness accurately.
In healthcare for elders, a timely response is critical to ensuring their safety and well-being.This paper proposes a Rapid Response Elderly Safety Monitoring (RESAM) system.This progressive HAR method combines the advantages of wearable inertial data and Skeleton Images while mitigating their respective disadvantages.By integrating information from both modalities, the RESAM system can gain a comprehensive picture of older people's activities, enabling faster and more accurate identification of their behavior.The RESAM system performs action recognition based on the motion details of the wearable device and quickly responds to potentially dangerous behaviors.In addition, skeletal data allows more precise identification results while maintaining privacy.Furthermore, the data fusion of the two recognition methods enhances the overall accuracy and robustness of the system.This holistic approach maximizes the potential of healthcare technology, ensuring prompt and effective responses to support and safeguard the well-being of the aging population.The key features are summarized below.
• Real-time response: The RESAM system prioritizes real-time responses to potentially dangerous behaviors of seniors.The system can quickly detect and identify critical activities by utilizing wearable device data and efficient motion recognition algorithms, enabling rapid intervention and assistance when needed.
• Privacy-preserving human action recognition: Skeleton data integration for human activity recognition ensures privacy preservation while maintaining accurate action recognition.Skeletal data represents key joint positions and motions without processing or storing detailed visual information, addressing privacy concerns and ethical considerations in healthcare settings.
• Enhanced accuracy and robustness: The RESAM system improves the accuracy and robustness of human activity recognition by fusing decisions from multiple modalities, including wearable device input and skeletal data.Combining information from the two sources allows for a more complete understanding of older adults' activities, leading to improved performance and reliability in recognizing and responding to various behaviors.The rest of the paper is organized as follows.Section II illustrates the rationale and detailed design of the proposed RESAM system.In Section III, we introduce the datasets, evaluation metrics, experimental setup, and results cooperation.Finally, Section IV concludes the paper, summarizing the findings and suggesting future research directions.

II. RESAM: RATIONALE AND METHODS
Figure 1 illustrates the architecture of the RESAM system.It begins with the crucial step of data extraction, where skeletal and inertial sensor data are collected and labeled with corresponding tags.After that, the system performs feature extraction on the inertial sensor data, utilizing it for initial classification.If the detection results signify an emergency, the subsequent action involves obtaining more precise outcomes through the fusion of skeleton-based detection, which aligns with the inertial sensor data labels.Finally, whether it is an emergency or not, all data is stored in the cloud for future long-term health state analysis.

A. Inertial Data Pre-Processing
Thanks to the advancement of IoT technology, modern wearable devices are equipped with multiple sensors, such as accelerometers and gyroscopes, providing rich multidimensional data.Therefore, in motion recognition systems based on inertial sensors, the collected data is usually represented as a d-dimensional vector, where d is the number of sensor channels in the wearable device.Figure 2 shows an example of both normal and abnormal inertial signals, highlighting the difference between them.The data from a wearable device containing accelerometers and gyroscopes will be represented as multiple three-dimensional vectors because each sensor measures acceleration or angular velocity along three orthogonal axes (x, y, z).Therefore, the resulting data consists of d-dimensional vectors, where d = 6, counting three channels from the accelerometer and three from the gyroscope.A d-dimensional vector encapsulates the measurements of all sensor channels at a particular time, forming raw sensor data collected over time.However, the information contained in the original data is not all useful.We need the features to distinguish different actions and remove the part polluted by noise.
The dataset of inertial data includes N samples in d dimensions.Each sample can be represented as a d-dimensional vector X i ∈ R d , where i = 1, 2, . . ., n.

B. KPCA-Based Inertial Signal Analysis
Kernelized principal component analysis (KPCA) is a nonlinear dimensionality reduction method [19].The basic idea is to map the original linearly inseparable data to a high-dimensional space through the kernel method, making it linearly separable in the high-dimensional space; The space is still linearly separable.The first step in KPCA is to compute the kernel matrix K , which measures the similarity between pairs of data points in the original space.The most commonly used kernel is the Radial Basis Function (RBF) kernel, defined as: where γ is a parameter known as the kernel bandwidth, and ∥.∥ the Euclidean distance between X i and X j .
Once the kernel matrix K is computed, the next step is to center it by subtracting the mean of each row and each column and then double-centering the matrix.The centered kernel matrix is: where I is the identity matrix, 1 is a column vector of ones, and 1 T is its transpose.After that, the K eigenvalues λ and eigenvectors v are calculated.The eigenvectors represent the principal components of the data in the higher-dimensional feature space, and the eigenvalues indicate the variance captured by each principal component.
Finally, we select the top k eigenvectors corresponding to the k largest eigenvalues to form the projection matrix W ∈ R nk , where k is the desired dimensionality of the feature space.The transformed data in the feature space is obtained as: where X is the original dataset matrix, and φ(x i ) is the feature representation of the data point x i in the higher-dimensional space.
Using KPCA on inertial sensor data yields a potent feature representation, capturing nonlinear relationships and discriminative information and enhancing action recognition performance.KPCA transforms the data into a higherdimensional space, unveiling intricate patterns not apparent in the raw data and preparing data for the classifier.

C. Rapid Response for Inertial Signal
The transformed data is prepared for classification after the feature extraction process using KPCA.In this study, multiple classifiers are individually employed to achieve the best performance.The selected classifiers include Support Vector Machine (SVM), Random Forest, and XGBoost.Each classifier is chosen based on its specific strengths and performance characteristics.
1) Support Vector Machine: The core principle of Support Vector Machine (SVM) involves identifying the optimal dividing boundary within a hyperplane to differentiate various categories [20].Consequently, when dealing with KPCAprocessed data, enhanced outcomes are frequently achieved due to the alignment within the system.SVM was initially designed for binary problems, so when faced with multiclassification issues, the natural idea is to construct multiple binary devices and combine them, called the One-againstall (OVA) algorithm.The m-th SVM is trained with all of the examples in the m-th class with positive labels and all other examples with negative labels.Thus given l training data (x 1 , y 1 ), . . ., (x l , y l ), where x i ∈ R n , i = 1, 2 . . . . . .l and y i ∈ 1, . . ., k is the class of x i , the m-th SVM solves the following problem: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where the training data x i are mapped to a higher dimensional space by the function ø and C is the penalty parameter.When data are not linearly separable, there is a penalty term C l i=1 ξ m i , which can reduce the number of training errors.After solving the equation, the k-th decision function is obtained as: When there are multiple decision outcomes with positive outputs, the input x will be classified as the one with the largest decision function value: ŷ = argmax( f i ).
2) Random Forest: The Random Forest classifier operates through an ensemble of decision trees, where each tree independently predicts the class label of an input instance [21].The final prediction is determined by aggregating the individual predictions through majority voting.The Random Forest construction process involves the following steps: 1) Bootstrap Aggregating (Bagging): A sample of size N is drawn N times with replacement.This process generates N samples, forming the foundation for decision tree training.Each of these N samples is employed to train a decision tree, serving as the samples at the root node.2) Feature Subsetting: With each sample containing M attributes, during the formation of the decision tree, at every node's split, n attributes are randomly selected from the M attributes, where m ≪ M. 3) Node Splitting: For each decision tree node, a split attribute is chosen based on a criterion such as information gain.Each node must be split according to this attribute-selection strategy until further splitting is infeasible.The attribute used for splitting in the parent node is avoided for selection in the child nodes.4) Ensemble Formation: This process is repeated numerous times, generating many decision trees.The ensemble of these trees constitutes the Random Forest.5) Prediction and Aggregation: During prediction, an input instance is passed through each tree to obtain individual class predictions.The final prediction is determined through majority voting among the trees.Random Forests create an ensemble of different decision trees, each trained on a different subset of data and with a different choice of attributes.Collective predictions of these trees yield robust and accurate classification models.
3) XGBoost: XGBoost (eXtreme Gradient Boosting) is an integrated learning algorithm in the integrated learning category [22].It excels in handling sequential data, making capturing temporal dynamics in human actions practical.It manages high-dimensional inertial signal data and automatically selects the most informative features, simplifying feature engineering.As an ensemble learning method, XGBoost combines multiple models, typically decision trees, which is advantageous in recognizing complex actions.This approach mitigates the potential biases of individual sensors and sensor noise.XGBoost also offers model interpretability through feature importance scores, aiding in identifying the most relevant sensor measurements for action recognition.Its gradient-boosting mechanism allows iterative error correction, adapting well to the sequential nature of inertial data.Furthermore, XGBoost includes techniques to handle imbalanced datasets, a common challenge in human action recognition, ensuring accurate recognition across all action classes.In summary, XGBoost's capabilities in sequential data analysis, feature selection, ensemble learning, interpretability, and adaptation to inertial signal characteristics make it a robust choice for inertial signals-based HAR.
The tree model used is the CART regression tree model, which assumes a binary tree structure, repeatedly dividing based on features.For example, a node splits using the jth feature's eigenvalue: samples below the threshold s go left, others right, shown as Following this core concept, the process involves successive feature segmentations to expand the tree.By iteratively adding trees, we are learning new functions to match previously predicted residuals.After training with k trees, predicting a sample's score involves directing it to a leaf node in each tree, with each leaf node representing a score.Ultimately, the expected value of a sample is the sum of the scores corresponding to each tree.The comprehensive model for generating a decision tree is where F = f (x) = w q x (q : R m → T, w ∈ R T ) is a collection of all classification and regression trees, x i is a feature vector, q is the structural information contained in the leaf nodes of the corresponding classification and regression trees, T is the number of leaf nodes on the corresponding classification regression tree, and each classification regression tree corresponds to its structural information q and leaf nodes weight w.The objective function definition of XGBoost is: where l indicates the selected loss function, which calculates the error between the predicted value ŷ and the real value y, and the part after the plus sign is the regular term, which reduces the complexity of the model and alleviates the over-fitting of the model.As mentioned above, the newly generated tree needs to fit the residual of the last prediction, so after iteration, the model for generating the t-th tree is: Therefore, the target formula can be expressed as the sum of multiple iterations: The next step is to find a f t that can minimize the objective function.The idea of XGBoost is to approximate it with its Taylor second-order expansion at f t = 0. Therefore, the objective function is approximated as: where g i = ∂ ŷ(t−1) l(y i , ŷ(t−1) ) is the first derivative and the h i = ∂ 2 ŷ(t−1) l(y i , ŷ(t−1) )is the second derivative.Since the prediction scores of the first (t − 1) trees and the residual of y do not affect the optimization of the objective function, they can be removed directly.At the same time, each sample will eventually fall into a leaf node, allowing us to utilize identical leaf nodes.Consequently, point samples are regrouped.The objective function can be rewritten as a quadratic function of the leaf node score w, and the final calculation formula for the optimal w and objective function value using the vertex formula is as follows: where T is the leaf node number, w is the L2 regularization of the leaf node score.When an internal node splits, if the loss function value is less than γ , then the split stops.λ is a similar penalty coefficient.G and H are summations of g i and h i , respectively.After obtaining the final objective function, it is only necessary to continuously generate the optimal classification regression tree and integrate it into the existing model to form the final XGBoost model for classification.The fast response these algorithms produce provides initial judgment on the data obtained from the inertial sensors.However, the captured target actions need further evaluation due to the algorithm's potential inaccuracy.Therefore, the next step is an accurate classification based on the skeleton image captured simultaneously with the inertial sensors.

D. Skeleton Image Pre-Processing
The Kinect sensor can construct a simplified human skeleton model using 20 key points, as shown in Fig. 3, without needing all 206 bones.Each joint point's spatial coordinates are denoted as well P(x, y, z), where x and y represent the abscissa and ordinate, respectively, and z corresponds to the distance from the camera to the human body.During movement, the relative positions of these joints change.To better represent the offsets of limb joint points with the hip joint and remove the camera distance effect, the central node of the hip is used as the central origin.The formula to calculate the initial spatial position feature is given by: where p n represents the other nodes excluding the hip joint, and p hi p is the hip-center joint.Therefore, the differences in X , Y , and Z coordinates are obtained using the 3D matrix vector calculation.These differences form the feature vectors for the m-th frame: x , f m y , f m z ] with the size of 19 × 3.An entire action can be represented as a set of these feature vectors for all frames: Due to the varying heights of individuals, the coordinate values of skeletons can differ.Before training the database, a normalization process is conducted.Point 1 represents the hip center, while point 2 denotes the middle of the spine.The middle spine length is defined as the Euclidean distance between these points.This process involves determining the maximum length of the spine across the entire dataset and establishing it as a fixed parameter Max_Middle_Spine.This normalization aims to ensure that every sample has the same length of the middle spine, thereby canceling out the effect of different heights of individuals.Therefore, the final normalized action space feature vector is:

E. ResNet-Based Accurate Classification
Residual Networks (ResNet) represent a significant advancement in deep learning architectures, specifically designed to address the vanishing gradient problem that can hinder the training of very deep neural networks.ResNet introduces the concept of residual blocks, which allows gradients to flow more effectively during backpropagation, enabling the training of much deeper networks.
Each residual block consists of skip connections, also known as shortcut connections, which bypass one or more layers.This helps mitigate the vanishing gradient problem and enables the training of deeper networks.The output of a residual block is a combination of its input and the output of the internal layers, creating a residual; the output of a residual block can be represented as: where F represents the transformation performed by the internal layers of the block.This formulation enables the network Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I RESNET-20 ARCHITECTURE AND PARAMETER SIZES
to learn the residual transformation, making optimizing and learning more complex features easier.In terms of layers, a typical ResNet architecture includes several convolutional layers and residual blocks.The specific architecture can vary, but the basic structure involves stacking multiple blocks together.The architecture often starts with initial convolutional and pooling layers, then a series of residual blocks, and concludes with fully connected layers for classification.
ResNet-20 is a specialized architecture designed to work efficiently with small-scale images, which is particularly relevant for scenarios like our skeleton images after processing [23].In our case, the action represented in the skeleton matrix is 19 × 3 × 20, effectively making it a small image with three channels after transposition.The ResNet-20 architecture is well-suited for such compact images and is optimized for tasks like human action recognition using skeleton data.It allows for effective feature extraction and classification, which is crucial for accurately recognizing actions based on skeletal information.
Table I shows the parameters and architecture of ResNet20."Time" is the number of frames used to represent the whole action."Input Size" represents the dimensions of the input Skeleton Image.The "Filter Size/Parameters" column describes the filter sizes and the number of parameters for each layer."Output Size" shows the dimensions of the output feature maps or layers, while GAP means Global Average Pooling.The parameter count for the Fully Connected (FC) layer depends on the specific number of classes in the classification task.ResNet-20 allows the residual blocks to reach the desired depth while performing well, particularly on small-scale image classification tasks.

F. Decision-Level Information Fusion
In the last step of our RESAM scheme, the Dempster-SHAFER evidence theory (D-S theory) is adopted to integrate decision-level data [24].The D-S theory is a robust framework for reasoning and combining data from various sources when processing uncertainty.In RESAM, we use the function of basic probability distribution (BPA).
Two distinct sources of evidence, I and S, are considered to apply the DS-Theory on the decision-level fusion.Each provides degrees of belief regarding various hypotheses within a frame of discernment, denoted as , where The following equations calculate the Belief (Bel): where the m I (B) and m S (C) are the mass assigned to hypotheses B and C by Inertial-based HAR(I) and Skeleton-based HAR(S).
To combine evidence from both sources I and S, we employ Dempster's rule of combination to compute the Belief (Bel) and Plausibility (Pl) functions for hypothesis A: Here, Bel(A) represents the overall degree of belief in hypothesis A, considering both sources I (Inertial Sensorbased HAR) and S (Skeleton-based HAR).On the other hand, Pl(A) indicates the total degree of support for hypothesis A across both sources.The final decision-making process involves comparing the maximum Pl(A) and the maximum Bel(A), allowing us to choose the most plausible and believable hypothesis.
These functions are crucial for fusing Inertial Sensor-based HAR and Skeleton-based HAR evidence.Calculating belief and plausibility values enables comprehensive decision-making in human action recognition scenarios, ensuring that the system makes informed and accurate judgments based on multiple sources of evidence.

A. Database and Evaluation Method
1) UTD-MHAD Database: Due to the time consistency requirements of the inertial sensor and skeleton data in the system design, the experimental data set must contain two kinds of data collected concurrently.Therefore, the UTD-MHAD data set is selected because it is deliberately designed to include a single Kinect camera and wearable inertial sensor [25].The Kinect camera captures color images at 640 × 480 pixels and 16-bit depth images at 320 × 240 pixels, running at approximately 30 frames per second.The wearable inertial sensors used in the dataset were developed at the ESSP lab at the University of Texas at Dallas.The sensor consists of a 9-axis MEMS sensor that captures 3-axis acceleration, 3-axis angular velocity, and 3-axis magnetic field strength.It integrates a 16-bit low-power microcontroller, a dual-mode Bluetooth low-energy unit for wireless data transfer to a laptop/PC and a serial link between the MEMS sensor and the microcontroller for control commands and data transfer.
The UTD-MHAD dataset covers 27 actions, including arm swings, hand waves, clapping, and motion-related movements.Motions were performed using wearable inertial sensors on the subjects' right wrist or right thigh, depending on whether the action was primarily arm-or leg-based.Exercises 1 through 21 have the sensor on the right wrist, while drills 22 through 27 use the right thigh position.This database provides a valuable resource for studying and analyzing human motion in various scenarios and activities.2) Self-Collected Database: While public databases can be used for synchronization purposes, they are not well suited to the specific requirements of elderly safety monitoring.As such, they lack action designs to address behaviors associated with potential risks.To fully validate the fast response mechanism and the precision of our RESAM scheme, we conducted laboratory experiments in a simulated home environment, collecting proprietary data.
In this experimental setup, three Kinect V2 cameras were strategically placed in front, to the right, and behind the subject, giving the skeleton image for the actions from different directions simultaneously, as shown in Fig. 4. At the same time, the TicWatch Pro3 is used to collect inertial sensor data.Fifteen volunteers actively participated in the experiment, resulting in 389 datasets.In total, the experiment consisted of 12 different actions.Table II presents the details of the action and number of each class.This customized experimental setup allowed us to evaluate the system's efficacy in scenarios that mirror real-life situations, fine-tuning its rapid response mechanism and assessing its accuracy.
3) Evaluate Method: In healthcare, it is crucial to recognize that the accuracy rate alone may not adequately gauge diagnostic efficiency.The False Negative Rate (FNR), often called the missed diagnosis rate, assumes pivotal importance.To illustrate, when a routine action is erroneously classified as an emergency, this error can be rectified in a subsequent retest with minimal harm.However, in contrast, if a genuinely hazardous action is mistakenly perceived as usual and subsequently disregarded, the repercussions can be severe, including delayed disease diagnosis and the loss of an optimal treatment window.
Therefore, clinical practice emphasizes minimizing the FNR, calculated as Eq.26, where FN and TP are the False Negative and True Positive, to ensure that critical conditions are not overlooked.

B. Experimental Results
1) Inertial Sensor Recognition: Table III compares our proposed approach with existing state-of-the-art methods in the inertial sensor-based action recognition field.When applying the inertia data collected by Ticwatch to accurate inertia data, our method obtained the highest accuracy.This indicates the effectiveness of our approach and suggests that it may affect the potential of various applications.
The accuracy of the type of action may decrease slightly compared with other existing studies, but it is still within the acceptability range.This adaptability and robustness make our method a multi-functional tool that recognizes extensive human behavior, even if those were previously considered more complicated or subtle.
Achieving an FNR as low as 1.2% is a significant milestone, signifying the system's ability to maintain a high level of sensitivity and reduce the risk of missed diagnoses.
2) Skeleton Image Based HAR: Recently, skeleton-based action recognition based on deep learning and neural networks has been widely used, as shown in Table IV.In this rapid development landscape, our method offers simplicity and effectiveness.Despite the complexity of challenges brought by bone motion recognition, we have reached the accuracy level of the most complicated algorithms that can be used today.
While complex algorithms can achieve impressive results, it is essential to remember that the balance between simplicity and efficiency is often the top priority in real-world applications.Our method not only meets the strict requirements of modern action recognition but also accomplishes this in a simplified and effective way.The simple algorithm allows our system to deploy in hardware with less powerful computing power, such as home computing centers, and maintain accuracy while responding quickly.
3) Decision-Level Fused Result Analysis: Table V provides a comprehensive comparison with state-of-the-art approaches, offering a detailed analysis encompassing both accuracy and power and time consumption metrics.We have achieved superior accuracy while maintaining efficient power usage and rapid processing times.
Table VI represents the running time of the experiment under different hardware configurations.On high-performance GPUs, such as GeForce 1080, GeForce 4080, and RTX A5000, the system's rapid response time is 2.04 seconds, 1.40 seconds, and 1.22 seconds, respectively, on average.In contrast, the response time extends to 36.5 seconds when using the Raspberry Pi 4 to run the system.
Our RESAM system was initially designed for the safety of older people living alone.What needs to be considered is that our target group may not be able to use the high-performance equipment used in the laboratory environment in actual application scenarios.A report in 2017 showed that the average response time for emergency services after 911 in the urban community was about seven minutes [45].This response time can be extended to 15 minutes or longer in the rural environment.The result shows that even if the system runs on a low-computing device like a Raspberry Pi 4, it only takes half a minute to respond.Our experimental results Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.demonstrate its adaptability to low-cost edge devices with limited computational and storage capacity, which aligns with our primary goal.

IV. FUTURE DIRECTION AND CONCLUSION
Action recognition, particularly when involving the fusion of inertial and skeletal data, presents notable challenges, with precise synchronization during data collection being paramount.Maintaining impeccable temporal alignment between these data sources demands meticulous attention and often necessitates specialized hardware setups.Creating a comprehensive database housing synchronized inertial and skeletal data proves essential for achieving accurate fusion-based action recognition results.One promising avenue lies in online medical care within the Metaverse, augmented by digital twins (DT).Harnessing the DT's capabilities, including real-time data integration, historical data storage, and AI-driven enhancements, enables the acquisition of vast information volumes, ultimately yielding more precise recognition outcomes.Moreover, the intricate human-machine interaction within DT models enhances the provision of insightful and actionable recommendations for decision-makers within the context of online healthcare scenarios [46].
In summary, this paper presents a fast and accurate method for human action recognition, harnessing the fusion of inertial sensor data and skeletal information.Our RESAM scheme leverages the unique strengths of both modalities, blending the spatial richness of skeletal data with the temporal dynamics captured by inertial sensors.Experimental results attest to its exceptional performance in precisely identifying diverse human actions.RESAM possesses the ability to deliver high-accuracy results while maintaining a low complexity.Moreover, its adaptability to various hardware configurations underscores its practicality and versatility.We employ the Dempster-Shafer evidence theory to provide a robust framework for decision-level data fusion, capable of integrating information from multiple sources under uncertain conditions, thereby achieving comprehensive and dependable action recognition.RESAM minimizes the risk of missed diagnoses in critical medical applications, where timely and accurate diagnosis is paramount.This study underscores the significant potential of multimodal fusion in enhancing the accuracy and responsiveness of human action recognition systems, with anticipated advancements in healthcare, security monitoring, and beyond.

Fig. 2 .
Fig. 2. Examples of inertial signal for normal and abnormal activities.

Fig. 4 .
Fig. 4. Skeleton images for the same action from 3 cameras.

TABLE II COUNT
OF ACTIONS FOR SELF-COLLECT DATABASE

TABLE III PERFORMANCE
COMPARISON OF SKELETON-BASED ACTION RECOGNITION IN TOP-1 ACCURACY (%).THE BEST ONE IS IN BOLD, AND THE SECOND ONE IS UNDERLINED