Deep Learning for Computer Vision based Activity Recognition and Fall Detection of the Elderly: a Systematic Review

As the percentage of elderly people in developed countries increases worldwide, the healthcare of this collective is a worrying matter, especially if it includes the preservation of their autonomy. In this direction, many studies are being published on Ambient Assisted Living (AAL) systems, which help to reduce the preoccupations raised by the independent living of the elderly. In this study, a systematic review of the literature is presented on fall detection and Human Activity Recognition (HAR) for the elderly, as the two main tasks to solve to guarantee the safety of elderly people living alone. To address the current tendency to perform these two tasks, the review focuses on the use of Deep Learning (DL) based approaches on computer vision data. In addition, different collections of data like DL models, datasets or hardware (e.g. depth or thermal cameras) are gathered from the reviewed studies and provided for reference in future studies. Strengths and weaknesses of existing approaches are also discussed and, based on them, our recommendations for future works are provided.


Introduction
The global population is experiencing rapid growth, accompanied by a significant increase in life expectancy, particularly in developed countries.Bloom and Luca [1] note that life expectancy in China and India has surged by nearly 30 years since 1950.Consequently, a substantial portion of the population in developed nations, approximately 20%, is aged 60 and above, a figure projected to surpass 30% in the next four decades.
With this demographic shift comes a growing concern for elderly1 care, as the need for assistance and support rises proportionately.Among the myriad challenges faced by the elderly, falls represent a particularly prevalent and perilous occurrence.The World Health Organization highlights alarming statistics on falls, identifying them as the second leading cause of unintentional injury deaths worldwide.Each year, an estimated 684,000 individuals succumb to fall-related injuries globally, with an additional 37.3 million falls severe enough to necessitate medical attention [2].Apart from the physical harm incurred by the elderly, the economic ramifications are substantial, with fallrelated treatment costs comprising a significant portion of healthcare expenditures in various countries such as the USA, Australia, EU15 and the United Kingdom [3].
Automated fall detection for the elderly is feasible through data collected from wearable or environmental devices, such as accelerometers, gyroscopes, and cameras.Furthermore, Human Activity Recognition (HAR) holds promise for diverse applications, ranging from automatic life-logging to identifying patterns indicative of illness [4,5].Vision data from cameras is increasingly utilized for fall detection and HAR tasks due to its numerous advantages over wearable devices or other sensors.These advantages include the ability to detect multiple events simultaneously, suitability for various subjects, environments, and tasks, as well as ease of installation and visual verification of data [6].
From an algorithmic standpoint, Deep Learning (DL) has revolutionized digital image processing, emerging as the state-of-the-art approach in numerous domains [7].Over recent years, a plethora of DL architectures have been developed and evaluated Table 1 Comparison of previous reviews with ours.By columns, important aspects taken into account in this review, and whether they are addressed or not by each review.From left to right: if the review is systematic; focuses on the exploration of DL solutions; targets elderly people as the users of the system; centers on the use of vision data; explores the Fall Detection and the Human Activity Recognition tasks; explores the use of RGB, depth or infrared data; takes privacy as a critical concern; describes the hardware used in the found studies; and if it studies the deployment of the FD and HAR systems in real environments.
While prior reviews have addressed various aspects of our research domain, notable differences underscore the necessity of our study.Table 1 sheds light upon this by displaying the pivotal aspects considered in the current review, along with whether they are addressed or not in the aforementioned reviews.
The sole review exclusively focusing on DL techniques was conducted by Alam et al. [14], which, however, omitted HAR from its scope, thus neglecting a significant portion of studies included in our analysis.In contrast, other reviews encompassed techniques employing handcrafted features or classical vision approaches, reflecting a broader scope than our exclusive focus on DL-based solutions.Furthermore, previous reviews often overlooked the importance of studying DL-related nuances, such as the significance of training datasets, architectural considerations, and feature extraction methods.In our review, we meticulously categorize and elucidate these nuances through a comprehensive taxonomy of identified techniques.
Another notable observation is the limited attention given to HAR in several previous reviews, with some omitting the task altogether.As a result, our review unveils a greater number of studies dedicated to fall detection and HAR in the elderly.Additionally, our analysis delves deeper into the intricacies of these tasks, providing a more comprehensive understanding.
Moreover, only a few reviews explored applications within AAL systems and the associated privacy implications.Hardware specifications, beyond the prevalent use of Kinect cameras, were rarely examined, and the effective deployment of fall detection or HAR systems was not thoroughly explored.In contrast, our review emphasizes these aspects, which are pivotal in facilitating the transference to society.
Finally, it is worth noting that, apart from [13], none of the previous reviews adhered to a systematic review process.By rigorously following the systematic review methodology outlined by Kitchenham and Charters [8], our study ensures a robust and unbiased selection and analysis of relevant studies.We conducted a comprehensive search across various databases, employing well-defined search strings aligned with our research questions.Each study underwent careful quality assessment, and strict exclusion criteria were applied to ensure the inclusion of only the most relevant and high-quality literature.This systematic approach minimizes potential biases and ensures that our review is based on a well-rounded selection of literature.

Review Questions
As outlined in [8], specifying the research questions is a critical aspect of any systematic review, as they guide the entire methodology: from the search process identifying primary studies to address them, to the data extraction process extracting the required data items, and finally to the data analysis synthesizing the data to answer the questions.The review questions for this study are presented in Table 2.The first research question, RQ1, aims to identify the methods used to recognize activities or detect falls among elderly individuals.The choice to specifically investigate HAR and fall detection stemmed from an exploratory initial search, where they emerged as the two most relevant recognition tasks in AAL for the elderly.Given that visual data offers numerous advantages over other sensor data types, such as visual verification and simultaneous subject recognition, and DL has become the state-ofthe-art approach in computer vision, conducting an in-depth analysis of the most prevalent methods with these characteristics is crucial for informing future research in this domain.Furthermore, three research subquestions are included regarding common data types (e.g., RGB, depth, thermal, etc.), DL architectures (e.g., CNN, RNN, etc.), and datasets found in the reviewed literature.These subquestions aim to delve deeper into the solution choices at different design steps, which are closely related to various requirements such as privacy preservation, result stability, and inference speed.
The second research question, RQ2, emerges as a significantly unexplored area, as highlighted in Table 1 of the Background section.Many previous reviews have focused on the recognition phase of previous studies, enumerating common methods, processing steps, and datasets.However, the effective deployment in real-world scenarios is pivotal for the transfer of such methods to society, and this aspect remains largely unexplored.Works with implementations in real environments, whether through the use of assistive robots or camera-based setups, are expected to be found among the selected studies.Therefore, it is desirable to explore their design choices, setups, and encountered challenges in greater depth.Additionally, privacy is a particularly concerning aspect to consider when dealing with users, especially when utilizing visual data from cameras, and the approaches to addressing it are of interest for future research.For these reasons, RQ2.1 and RQ2.2 delve into common hardware choices and privacy preservation strategies.

Review Methods
In this section, we provide a detailed description of the systematic review protocol followed, based on the guidelines outlined by Kitchenham and Charters [8].Firstly, we list and analyze the primary data sources used, providing visualization of the distribution of studies among these sources.Next, we define the search strategy, which encompasses search terms, synonyms, and time restrictions.Following this, we establish criteria for inclusion and exclusion of studies, followed by the design of a quality assessment checklist to identify and remove low-quality studies.Finally, in the data extraction and synthesis stage, we define how information from each primary study is obtained and outline the specific attributes considered of interest.

Data sources
For this systematic review, we selected five primary data sources: SCOPUS, Web of Science (WOS), IEEE Xplore Digital Library, ACM Digital Library, and PubMed.
SCOPUS and WOS were chosen as comprehensive digital libraries covering a wide range of disciplines, while IEEE Xplore focuses on engineering and technology, ACM Digital Library specializes in computer science, and PubMed is centered on biomedical studies.This selection ensures the inclusion of relevant literature from diverse domains, maximizing the breadth of content considered in our review.
The distribution of studies retrieved from each source is illustrated in Figure 1.As depicted, the majority of studies were sourced from ACM and SCOPUS, with only a small fraction (110 out of a total of 2,616) obtained from PubMed.

Search strategy
We constructed different query strings tailored to match the syntax of each digital library while minimizing differences and employing consistent synonyms for the concepts being searched.Each query string connected the various concepts using logical AND, while synonyms for each concept were connected with logical OR.To account for inflection of certain keywords, we utilized the "*" operator after the root word to allow for any possible word endings.In the SCOPUS library, the search was restricted to titles, abstracts, or keywords due to the impracticality of retrieving results otherwise, with the majority being poorly relevant.Conversely, the entire text was searched for in the remaining databases.The primary concepts searched, along with their corresponding lists of synonyms, are as follows: • Task to perform (activity recognition or fall detection): "action recognition" OR "activit* recognition" OR "fall* detection" OR "behaviour recognition" OR "behaviour detection" OR "physical activity recognition" • Ambient Assisted Living: "monitoring" OR "assist* living" OR "AAL" OR "smart home" OR "activit* of daily life" OR "activit* of daily living" OR "ADL" • Target collective (elderly people): "elder*" OR "old* people" OR "senior" • Kind of data used (Computer Vision): "vision" OR "rgb" OR "video" OR "image" OR "skeleton" OR "depth" OR "camera" OR "gesture" Initially, we included studies published from 2013 onwards in the search.However, upon further examination, we observed that the majority of relevant studies were published recently.Consequently, we decided to limit the review to the last five years.Figure 2 displays the accumulated relevant studies from 2013 to 2023.As depicted, only 19 relevant articles were found during the first six years, while 151 were discovered in the last five.This trend underscores the increasing significance of DL-based strategies for HAR and fall detection.By focusing on studies published in the last five years, we aim to gain a deeper analysis of recent trends.
The results of study collection and duplicate removal are illustrated in Figure 3.A total of 2,616 studies were collected from the different sources using the aforementioned queries, of which 633 duplicates were detected and removed, leaving a total of 1,983 studies.

Study selection
After collecting studies from various sources, limiting by year, and removing duplicates, exclusion criteria were applied to eliminate non-relevant studies.The exclusion criteria were as follows: • Deep Learning: Studies not utilizing DL were considered irrelevant for this review.
Including this criterion in the exclusion criteria rather than in the query strings  enabled the inclusion of more relevant studies, since many studies did not directly reference DL but instead used the name of a specific model.• Language: Studies not in English or Spanish were excluded.
• Data Type: Studies using data types other than RGB, depth, or IR were excluded.
This includes both videos and images.Skeleton data was also included, but only if computed from the other three types of data.Studies using sensory data along with visual data were also included, allowing for multimodal approaches.• Accessibility: Studies not accessible for various reasons, such as being part of paid content (e.g., book chapters), source website down, or retracted content, were excluded.• Redundancy: In cases where a journal article extended a work already presented in a conference, the conference proceedings publications were omitted, as the journal article represented an extension of the same work.• Task: Studies focused on tasks other than HAR or fall detection, such as velocity estimation, gait trend, level of tiredness, etc., were excluded.However, studies that did not directly perform HAR or fall detection but presented a new dataset for these tasks were included.• Target Collective: Studies not centered on elderly people were excluded.Merely mentioning the elderly as one of the beneficiaries of the work was insufficient; the study had to either use data from elderly people or have them in mind when designing the experiment.• Works in Progress: Conference proceedings about works in progress, containing only the initial stages of the study and lacking the experimentation phase, were excluded.• Quality: Publications with very poor quality (e.g., null reproducibility, highly biased decisions, too small datasets, etc.) were excluded.More information about quality assessment can be found in Section 4.4.
As depicted in Figure 3, only 151 studies remained after applying the exclusion criteria, comprising 89 conference proceedings and 64 journal articles.The conference proceedings were retained for analysis among the relevant studies, as they serve as a standard search strategy to address publication bias, which can lead to systematic bias in systematic reviews unless special efforts are made to address this issue [8].

Quality assessment
Given the absence of a universally agreed-upon definition of study "quality," the proposed guidelines in [8] were adhered to, primarily focusing on bias and validity as measures of quality.Specifically, the following aspects were taken into account: • Reproducibility: Assessing whether the work can be replicated.This can be achieved by disclosing the dataset used, using external datasets, and either publishing the code used for the model or providing sufficient details to recreate the model.• Comparison with Other Works: Evaluating whether the performance of the model is compared with the state-of-the-art.It's essential to ensure that comparisons are made under fair conditions, meaning that the models should be trained and tested on the same data to avoid introducing bias.• Use of External Datasets: Considering whether the model is tested on external datasets to mitigate possible bias from the data and facilitate comparison with other models for the same task.Additionally, using external datasets allows other studies to utilize the results without the need to retrain the model on different data.
These aspects were included in the list of fields during the data extraction phase (last three fields), as discussed in Section 4.5.Moreover, these quality aspects were also used as exclusion criteria, as previously mentioned in Section 4.3.
In addition to these aspects, the type of study, either conference proceedings or journal articles, was also considered as a quality indicator, with journal articles typically being longer and more mature.

Data extraction and synthesis
From each study remaining after applying the exclusion criteria, various data points were extracted to summarize the content and establish taxonomies for various aspects of interest.All data were compiled into a table, with each entry containing the following fields: • Title The complete list of relevant studies is provided in Tables 5 and 4, which display only basic information for each study.The remaining information will be synthesized in Section 5 through tables and plots, allowing for an overview of the distribution of works by used data types, DL model families, datasets, etc.Additionally, particularly relevant or interesting aspects of the works will be summarized, and important concepts will be addressed in more detail.

Results
This section provides an overview of the primary studies discovered through the systematic search process and presents the findings.Each study is thoroughly examined, and summaries are presented in the form of tables and graphs where applicable.Subsections are structured to address individual research questions, enhancing readability and organization.

RQ1: Fall detection and Human Activity Recognition
The review primarily focuses on two main tasks: fall detection and Human Activity Recognition (HAR).It is worth noting that fall detection can be viewed an especially important activity of HAR.As illustrated in Figure 4, fall detection has received the most attention in the past five years, with a total of 72 studies, while HAR has been explored in 52 studies.This discrepancy highlights the significance of fall detection when concerning the elderly population.Many works emphasize the importance of accurately and swiftly identifying falls among the elderly, given the potential for injuries and health implications if prompt actions are not taken.Consequently, several studies mention integrating fall detection into systems or applications capable of alerting medical personnel [41,49].Only 27 out of 151 studies (approximately 18%) address both tasks simultaneously.This disparity arises from the emphasis placed on fall detection compared to other activities (such as walking or standing up), as well as the limited availability of data concerning fall scenarios, often resulting in an imbalanced problem.However, some studies manage to address both tasks.For instance, in [24,67,104], both tasks are computed using the UP-FALL dataset [168], which includes five types of falls and six common activities.This balanced dataset allows for the preservation of the importance of accurately detecting falls amidst other activities.A similar approach is adopted in [106], where a custom dataset with egocentric videos is utilized.Nevertheless, there are studies that treat falls as just another task to recognize [30,47,161].

RQ1.1: data type
Among the studies collected, three types of vision data were considered: RGB, depth, and infrared (IR).The distribution of these data types is illustrated in Figure 5. RGB data were the most prevalent for fall detection and HAR among the elderly (132 studies), followed by depth data (30 studies), with IR data being the least utilized (6 studies).This discrepancy can primarily be attributed to the accessibility of common cameras compared to specialized ones equipped with depth or infrared sensors.Additionally, RGB cameras offer benefits such as lower costs and easier visual data inspection.Notably, infrared cameras are less frequently employed, typically positioned overhead (top-down perspective) and characterized by very low resolutions, allowing for the use of simpler CNN models [45,122], as well as non-convolutional models like LSTM [119,159] and Transformer [119].Depth cameras are more commonly used than infrared ones, although they are often employed to extract skeleton joints rather than directly performing fall detection and HAR.Specifically, 67% of studies utilizing depth data computed skeleton joints before classification [38,152,152], while the remaining 33% did not [20,148,155].
Skeleton poses and sequences emerged as prevalent data types across the reviewed studies, with 67 studies incorporating skeleton data in some form.Given the humancentric nature of HAR and fall detection tasks, skeletal data represent logical features, offering efficient information compression while maintaining interpretability.Skeletons are typically represented as ordered sets of coordinates of body landmarks, either in 2D [24,61,107] or 3D [22,38,152] positions, depending on whether they were estimated from RGB or depth data, respectively.When skeleton estimation is performed on videos, the result is a sequence of skeleton poses with an added temporal dimension, enabling exploration of pose evolution over time intervals.35 studies employed the evolution of one or more body landmarks for fall or HAR recognition [49,107], while the remaining 32 studies performed recognition using static poses exclusively [42,118].
In addition to vision data, some studies utilized sensor data to enhance system performance, employing different models or strategies for classification and subsequently fusing the results.Fourteen studies, listed in Table 5, utilized at least one of five types of sensor data, including Inertial Measurement Unit (IMU)2 , audio, barometer, luminosity, radar, electrocardiogram (ECG), GPS, and network traffic.IMU data was the most commonly used, featured in 10 of the 14 studies, particularly for fall detection (in 6 out of 10 studies using IMUs), owing to its effectiveness in identifying abrupt movements and subsequent immobility [64,149].Barometer, GPS, radar, luminosity, and ECG data were consistently employed in conjunction with IMU data.Barometer and luminosity data served to acquire auxiliary or redundant information to enhance recognition consistency [74,149].ECG data in [47] was utilized to identify inconsistencies in recognition and trigger specific further computations.In [85], four types of data (IMU, audio, radar, and GPS), along with visual data, were used for federated learning, where independent models were trained using different data modalities.
Regarding data fusion, no instances of early fusion were found.Instead, intermediate (7 studies) and late (5 studies) fusion methods were prevalent.Late fusion involved using a model for each data modality to produce a classification result, with the final classification determined using either voting [74,89] or weight attribution methods [44,69,149].In intermediate fusion, different models extracted features from various modalities, with a final model performing classification based on concatenated feature inputs.Various final models were utilized, including CNN [64], fully connected layers [54,72], SVM [99], and stacked classifiers [77].In two studies, no fusion was performed, with different options provided for classification using distinct data modalities [47,127].

RQ1.2: DL models
Table 6 provides a summary of all DL models utilized in the analyzed studies.These models are often employed for various specific tasks, including skeleton joints estimation, optical flow computation, and feature extraction.Moreover, the input data for these models encompasses not only images or videos but also features frequently computed by other DL models, such as 2D or 3D skeleton poses and optical flow.
A taxonomy of the identified DL models, based on different characteristics, is presented in Figure 6, offering a total count for each category.There is considerable diversity in the utilization of these models, regarding datasets used for evaluation, data types, and methodology.As demonstrated in the next section, in Table 7, a wide range of datasets was employed across the analyzed studies, with many utilizing custom datasets.Additionally, prominent datasets like URFD and UP-FALL offer various data types, including RGB recordings, depth, skeletons, accelerometers, etc., which may lead to data differences even when studies are evaluated on the same dataset.The methodology for training and testing DL methods also varies across studies, with some employing k-fold cross-validation, leave-one-out cross-validation, or no cross-validation at all.Consequently, due to the lack of standardized conditions for a fair comparison, quantitative metric results were not included in Table 6.AlphaPose 2D Skeleton RGB Image [171] MediaPipe 2D Skeleton RGB Image [172] PoseNet 2D Skeleton RGB Image [173] MoveNet 2D Skeleton RGB Image [174] RMPE 2D Skeleton RGB Image [175] PoseFlow 2D Skeleton RGB Image Baidu AI 2D Skeleton RGB Image [176] FastPose 2D Skeleton RGB Image [177] MobileNet 2D Skeleton, HAR, FD RGB Image [178] DeepHAR 2D Skeleton, HAR, FD RGB Image [179] PoseConv3D 3D Skeleton Depth Image [180] STN 3D Skeleton RGB-D Image [181] Autoencoder FD Different features [182] GAN FD Different features [183] Siamese CNNs FD RGB or Optical Flow Video [97] FallNet FD RGB Video [184] DeepFall FD RGB, Depth or IR Video [185] Sep-TCN FD Skeleton sequence [186] DCF-Net Features RGB Image [187] SqueezeNet Features RGB Image [188] EfficientNet Features RGB Image [189] C3D Features RGB Video [190] R-CNN Features (OD, OS), FD RGB Image [191] YOLO Features (OD), HAR, FD RGB Image [67] MSSkip Features (OS) RGB Image [192] PointRend Features (OS) RGB Image [193] LiteFlowNet Features (Optical Flow) RGB Video [194] Slowfast Features, HAR RGB Video [195] InceptionV3 Features, HAR RGB or Depth Image CNN Features, HAR, FD Image, Video or different features [196] LSTM Features, HAR, FD Skel.sequence or Video features [197] VGG Features, HAR, FD RGB or Depth Image [198] ResNet Features, HAR, FD RGB or Depth Image or Skel.Pose [199] GCN Features, HAR, FD Skel.sequence or Video features RNN Features, HAR, FD Skel.sequence or Video features [200] I3D Features, HAR, FD RGB Video [201] GRU Features, HAR, FD Skel.sequence or Video features [202] Transformer HAR IR Image (8x8) [203] TANet HAR RGB Video [204] TPN HAR RGB Video [205] iCAN HAR Bounding Boxes sequence [206] Xception HAR RGB Image [207] TSN HAR RGB Video [208] VST HAR RGB Video [209] TimeSformer HAR RGB Video [210] Glimpse Clouds HAR Skeleton sequence [211] AIA HAR Video features and Bounding Boxes MLP HAR, FD Different features [212] AlexNet HAR, FD Feature Image [28] ARFD-Net HAR, FD Skeleton sequence Fig. 6 Taxonomy of the DL techniques used in the found studies.The number of studies where each category was used is displayed in bold.Note that multiple models were used in many studies, and hence the same study can be counted in more than one category.
As mentioned in section 5.2, many studies utilize skeleton joints as features for fall detection and HAR.To estimate these joints, various DL models are employed, with OpenPose [169] and AlphaPose [170] being the most prevalent (appearing in 25 and 10 studies, respectively).OpenPose utilizes a non-parametric representation (referred to as Part Affinity Fields) to detect skeleton joints from all humans in the image simultaneously, while AlphaPose performs human detection first and then predicts the skeleton joints for each individual.Subsequently, multiple models are used for fall detection and HAR with these skeleton joints: • Recurrent Networks: Long-Short Term Memory (LSTM) [93,95,180] and Gated Recurrent Unit (GRU) [95,114] are commonly used, with others grouped as RNN [40,66].• Graph-Based Network: The Graph-Convolutional Network (GCN) was the only one found, which treats skeletons as graphs rather than sequences [44,107,112].Additionally, graph-based networks have the potential to perform collective activity recognition by leveraging interactive relations [213].• Convolutional Networks: Various models like VGG architectures [127,146,165], MobileNet [42,81,141], ResNet family [81,141], among others [118,150,156], are employed.
Object detection was a prevalent task in the reviewed studies (found in 35 studies), with models from two families: R-CNN [190] and YOLO [191].R-CNN involves a multi-step process including region proposal, feature extraction, object classification, bounding box regression, and non-maximum suppression.Conversely, YOLO focuses on real-time object detection with a single pass through the image.Both models received several ameliorations in later versions.These models were utilized for various purposes across the studies: • Obtaining a sequence of bounding boxes from scene objects, which can serve as features in next steps [46,52,99].• Triggering computation of fall detection or HAR upon detection of human presence, saving computation time [100,105].• Reducing data complexity by putting the focus on the target person [27,102,161].
• Getting features from the humans in the scene, like height-to-width ratio, used for fall detection or HAR in further steps [31,78,136].• Direct detection of falls or recognition of activities [25,76,139].
Additionally, object segmentation plays a crucial role in several studies.The most commonly used model is Mask R-CNN [214], which extends the capabilities of the R-CNN family to object segmentation.Another notable model is PointRend [192], a neural network module that enhances the granularity of segmentation models by treating image segmentation as a rendering problem.Conversely, a novel model proposed in [67] specifically addresses object segmentation as part of the processing pipeline for fall detection and post-fall classification, named MSSkip.MSSkip builds upon common ideas from other segmentation models but incorporates multi-scale skip connections and depth-wise separable convolutions in the decoder to minimize computation.Object segmentation serves various purposes in the reviewed studies: in [103], averaged output masks are utilized as spatio-temporal features for further recognition steps; [88] performs direct classification into fall or not fall based on the segmentation of fallen individuals; segmentation masks are fed to a convolutional LSTM in [67] and to a CNN followed by an LSTM in [153] to extract spatial and temporal features for fall detection; in [108], segmentation masks are input to different machine learning models to identify falls.Conversely, in [41], segmentation is used solely to anonymize images before feeding them to an autoencoder for fall detection.
Furthermore, fall detection is frequently approached as a normal/abnormal classification task in the reviewed studies, with normal activities modeled and falls treated as abnormal data.This involves performing feature extraction, either using pre-trained models to extract spatio-temporal features from video/images or utilizing estimated skeleton joints, followed by training a model to identify normal activities.Various approaches are employed for this task, such as utilizing an MPED-RNN network on skeletal data [94], employing DeepFall on multiple data modalities (RGB, depth, and IR) [20], using autoencoders after obtaining spatio-temporal features from other networks [41,111,144], and employing Generative Adversarial Networks (GANs) by utilizing the discriminator as the normal/abnormal classifier [86,103].
Finally, the choice of architecture in the analyzed studies often depends on the data dimensionality, with recurrent neural networks (RNNs) primarily used when considering the temporal dimension and feedforward neural networks (FFNNs) when not.RNNs are well-suited for problems involving sequential data due to their ability to remember input data using internal memory.As such, they are often employed for fall detection and activity recognition from skeleton sequences [49,180] and feature sequences computed frame-wise by CNNs [24,143,148].While CNNs are commonly used for extracting visual features from images, transformers have also been utilized in the FFNNs category, particularly for tasks involving low-resolution images [119], 3D skeleton data [81], and video by adapting Vision Transformer (ViT) [215] to video formats [53,79].Additionally, multilayer perceptrons (MLPs) are consistently employed for skeleton data [34,38,92,140] or visual features [108].

RQ1.3: Datasets
Table 7 provides a comprehensive list of datasets used in the reviewed studies for activity recognition and fall detection.Emphasizing the importance of reproducibility and comparability, only publicly available datasets are included, aiming to facilitate future research in the field.Each dataset is categorized based on several common characteristics: • Elderly: Despite fall detection and activity recognition often targeting elderly individuals, only a small fraction of datasets (12%) include samples from this demographic.This scarcity highlights the challenge of collecting real-life data from the elderly population, especially genuine fall incidents.• Falls: The majority of datasets (58%) include falls as a class, with 23% specifically focusing on binary classification between fall and not fall activities, underscoring the significance of this task in eldercare.• Type: Video data is predominant (85% of datasets), aligning with the temporal nature of activities like falls, where temporal context is crucial for accurate recognition.Furthermore, video allows for the rapid acquisition of a large quantity of images in the form of frames, which can then be utilized by data-driven solutions, such as DL-based methods.• Data types: While RGB data is ubiquitous, depth frames, skeleton joints, and inertial data are found in 38%, 29%, and 13% of datasets, respectively.Other data types such as infrared data and motion history volumes (MHV) are less common.The presence of RGB data in all datasets allows for the discovery of the exact conditions of the recordings (environment, perspective, users, etc.) and serves as a visual check of the data, a feature not offered by other types of data.• Samples: Dataset sizes vary significantly, ranging from less than 50 samples (e.g., FDD-Chen) to over 500,000 samples (e.g., Kinetics 700-2020), reflecting the diversity in data availability.• Classes: The number of classes also varies widely, from binary classification to datasets with hundreds of classes, though the latter are typically not focused on AAL.• Studies: Half of the datasets are utilized in only one study, while only five are used in more than ten studies, indicating varying degrees of dataset popularity and usage.
The University of Rzeszow Fall Detection (URFD) dataset [216] stands out as the most extensively used, featuring in 40 studies [41,89,153].Focused on fall detection, URFD offers 70 sequences capturing falls and activities of daily living (ADL) from two perspectives, along with various data modalities including RGB, depth, skeleton joints, and inertial data.The UP-FALL dataset [168], appearing in 17 studies [24,39,103], provides data from 17 subjects performing 11 activities, offering RGB video, infrared images, and inertial data for both fall detection and human activity recognition (HAR).In contrast, the Le2i dataset [217], used in 16 studies [47,93,137], focuses solely on fall detection, featuring 143 videos with falls and 48 with normal activities, with varying actors, scenery characteristics, and illumination conditions.Similarly, the MultiCam dataset [218], utilized in 16 studies [27,30,72], provides RGB video from 24 sequences captured from eight perspectives, facilitating the study of falls and confounding events.The NTU RGB+D dataset [219], used in 14 studies [112,118,131], offers a vast collection of samples from 40 subjects performing 60 activities, recorded using Kinect cameras, thus providing RGB video, depth images, and skeleton joints.An extended version of this dataset also exists: the NTU RGB+D 120 dataset [230], which expands upon it by adding 60 additional classes.However, it is only utilized in two of the reviewed studies [107,135].The remaining datasets were utilized fewer than 10 times, with approximately half of them being employed in only one study.
While most datasets are collected from real environments, two exceptions are noted: [101] and [135], offering synthetic images and videos, respectively.Despite the advantages of synthetic data, such as ease of acquisition and controlled conditions, models trained solely on synthetic data may lack adaptability to real-world scenarios.
Notably, some studies opted for custom datasets instead of utilizing existing ones.Figure 7 illustrates the proportion of studies using custom, external, or both types of datasets.Only 19 studies provided evaluations on both custom and external datasets, with a greater frequency of evaluations conducted solely on external datasets (86 studies) compared to those exclusively using custom datasets (46 studies).

RQ2: Framework integration
In 18 of the reviewed articles, frameworks were proposed to integrate the tasks of HAR or fall detection into real environments, addressing various aspects such as security, utilization of cloud services, client-server configuration, network communications, IoT devices, etc. Below, we provide brief descriptions of the proposed frameworks.
In [42], a custom robot is suggested to integrate the HAR task into the environment, alongside other functionalities like language processing to enable chatbot interactions.In [161], a camera system is employed to capture visual data, which is then sent to a central server for computation.Subsequently, notifications, reports, and alerts are dispatched to a designated "guardian".
In [74], a Docker-based system is proposed to manage the flow between various programs involved in fall detection, distributing resources, and regulating communications.Docker is also utilized in [78], where the NAO robot is suggested for data acquisition and user interaction to prevent falls.In [30,32], an intermediary step between recording and DL computation is introduced to preprocess video data and reduce bandwidth consumption.
In [18,33,49,52,58,93,105], the proposed frameworks integrate the collection of visual data through camera monitoring systems, centralized server-based recognition of fall detection or various activities, and trigger various responses based on the severity of the situation, such as contacting health services.For instance, [33] utilizes the thirdparty service 'Twilio' to send phone messages in case of a fall, while in [105], the system transfers recordings to a computer for human inspection upon fall detection.
In [123,127], activity recognition results, along with recorded video data, are transmitted to a mobile application used for monitoring system users.Similar capabilities are offered in [63], with the addition of face blurring anonymization.[77] conducts all experiments in a connected environment, exploring the use of network traffic from multiple smart appliances combined with visual data to recognize various activities.Additionally, to assess the transferability of their approach across environments, they experimented with a smart residential apartment.
In [85], federated learning is employed to ensure privacy preservation of users.The system incorporates three sensor modalities (depth, mmWave radar, and audio) and was tested in the homes of 16 elderly subjects.

RQ2.1: Hardware
A list of the hardware used in the reviewed studies (when mentioned) is presented in Table 8.Specialized cameras such as thermal, depth, and wearable cameras, as well as social assistive robots, were included.Information regarding datasets not created in the reviewed studies was excluded.Hardware related to computation or common RGB cameras was omitted due to the wide range of possibilities available in these areas.Aldebaran URG [116,158] For depth video retrieval, the most commonly used camera is the Microsoft Kinect (7 studies), followed by the Orbbec Astra Pro (3 studies), and Intel RealSense (1 study).These cameras share similar specifications, offering RGB-D recording using an IR camera for the depth channel, which provides accurate depth estimation at short distances.Additionally, they enable reliable 3D skeleton joint estimation.
There is less consensus in the use of thermal cameras, with multiple camera models employed.Consequently, there is considerable variation in the retrieved data, including differences in resolution, sensitivity to temperature, maximum and minimum effective distances, etc.
Only five studies deployed HAR or fall detection in an AAL system using a social assistive robot.Among these, two studies utilized the Pepper robot, one employed the NAO robot, and the remaining studies used custom-made robots.

RQ2.2: Privacy protection
Figure 8 illustrates the various privacy protection methods identified in the reviewed studies.Among the 151 studies reviewed, 75 did not address privacy concerns, opting for the use of unmodified RGB video or images of elderly users.Among the remaining studies, the majority employed skeleton data computed from RGB images, while four offered specific methods to anonymize RGB data, and others chose to utilize thermal or depth data instead.The most effective privacy-preserving methods avoid the deployment of RGB cameras in AAL settings.This is typically achieved through the use of visual data types that do not allow for subject identification, such as thermal and depth imaging.Among the collected studies, five exclusively employed thermal data [20,45,119,122,159].In all cases, DL-based methods utilized CNNs to extract visual features and perform classification.Additionally, 21 studies utilized solely depth data, with 17 of them using it to estimate 3D skeleton poses, as demonstrated in [38,43,81,152].Notably, Microsoft Kinect was utilized in all 17 studies to estimate skeletons from depth maps through randomized decision forests [251], leaving RGB data unused for this estimation.Four studies exploited depth data without skeleton estimation, instead relying on the extraction of human silhouettes [148] and visual features using CNNs [20,113,117].
A total of 51 studies utilized RGB data at some stage, applying anonymization techniques.In contrast to the aforementioned studies, the input data used by these studies can be used to identify subjects, as conventional video recording is involved at the beginning of the processing pipeline.Among these, 47 studies relied on 2D skeleton estimation methods like OpenPose [169] and AlphaPose [170] to protect privacy, removing visual data that can be used to identify users, as illustrated in [24,44,104,107,137].There were four studies in which privacy was protected through other methods.In [144], an IR camera is used to detect the face region of frames and remove it from the RGB frames.In [86], the RGB frames are modified in such a way that individuals cannot be identified, while fall detection can still be applied effectively.In [55], a wearable camera providing a first-person perspective is used to avoid recording the user of the system.Human silhouettes are computed in [41] and used for future recognition steps.

Discussion
This section utilizes the discovered results and the responses provided to the review questions to underscore common strengths and weaknesses of the reviewed studies.It also compiles a comprehensive list of recommendations for future reference based on the findings of this systematic literature review.In Figure 9, the search process and key findings from the reviewed studies are summarized.

Strengths and weaknesses of the reviewed studies
Upon reviewing the 151 relevant studies and addressing the research questions, the main strengths and weaknesses observed are discussed in this subsection, which we believe can provide valuable insights for future studies in the field.
A notable benefit of utilizing skeleton joints is their ability to significantly reduce data size compared to raw image or video data, while also offering user anonymization, maintaining data interpretability, and achieving satisfactory results in fall detection and HAR.Furthermore, there is a growing number of methods to derive human skeletons from RGB or depth data, with 13 different skeleton estimation DL models identified in the reviewed studies (as shown in Table 6).
The primary strength of studies employing only depth or infrared data lies in the privacy protection they afford, as RGB footage is not recorded at any point in the system pipeline.However, these studies also face two major weaknesses: a reduced amount of data for detection or recognition tasks, particularly pronounced in the case of IR recordings where resolution tends to be much lower, and less interpretable data, which may pose challenges when manual intervention is required to address errors.
Among the reviewed studies, 27 perform both fall detection and HAR tasks (refer to Figure 4).This integration is particularly significant, as it is often desirable to detect accidental falls while conducting HAR on elderly individuals.It is important to note that while fall detection can be integrated as another class during HAR, it should be computed separately due to its critical nature.Therefore, most studies including fall detection implement it differently than the recognition of other classes.
Numerous studies have overlooked the temporal dimension when conducting HAR, thereby constraining the task significantly.This omission poses a significant weakness, particularly when incorporating activities that are challenging to distinguish without temporal data or are more effectively recognized with it, such as sitting/getting up or putting on/off clothes.Nonetheless, confining the analysis to spatial data typically offers the advantage of being faster and more straightforward.
Regarding the choice of model architecture, convolutional models were found to be predominant.Their primary strengths lie in their effectiveness in processing spatial data and their extensive history, which has led to numerous improvements and architectural refinements across various fields and tasks.Given their suitability for image-based tasks, convolutional models are widely preferred and even have 3D versions tailored for video processing.In contrast, recurrent models excel in handling sequential data, thus complementing the capabilities of CNNs by facilitating the tracking of computed features across different frames.Multi-layer perceptrons, however, do not yield favorable results with spatial or sequential data; they are typically employed for classification based on computed features, akin to fully connected layers in a convolutional neural network.Transformer-based architectures, being relatively newer, are not as ubiquitous.Despite their promise in handling sequential and vision data, their large parameter count presents challenges in training and deploying them on lowspecification systems.Nonetheless, they have showcased significant potential across various domains.
Given that fall detection and HAR for the elderly aim to assist this population in AAL settings, studies offering frameworks for deploying systems in real environments are of particular interest.Eighteen studies, described in Section 5.5, fall into this category.
Utilizing only an external dataset may impact the applicability of the technique to specific situations or environments, but it allows for the comparison of different methods on the same data.Conversely, relying solely on a custom dataset yields the opposite effects.The primary drawback associated with using only custom-made datasets is the external validity of the findings, as it becomes challenging to compare results with other studies, especially if the custom data is not disclosed.Including an evaluation on external datasets not only distinguishes studies from previous ones but also enables future studies to build on the obtained performance.While the majority of the reviewed studies evaluate on existing datasets for fall detection and HAR, 46 exclusively perform evaluations on new custom datasets (as depicted in Figure 7), limiting the reliability of the results without comparisons with existing techniques or models.Conversely, 19 studies utilize both custom and external datasets, leveraging the strengths of each approach: specialization on custom data and comparison with other methodologies.
From a data perspective, three common weaknesses are evident in the datasets utilized: the absence of elderly individuals, a limited number of samples, and the inclusion of numerous classes unrelated to activities of daily living (ADL), which may render them less suitable for fall detection and HAR among elderly populations.Primarily, the majority of datasets (88%) lack elderly participants, presenting challenges during deployment as they represent the target users of the system but are not represented in the training data.In this regard, datasets such as ETRIActivity3D, ToyotaSmartHome, MUVIM, and FPDS-Elderly would be more suitable.Additionally, a limited number of samples may prove insufficient for DL models to generalize effectively.Three of the four most extensive datasets contain fewer than 200 samples, while the remaining dataset contains fewer than 600, with approximately half of the utilized datasets containing less than 1,000 samples.Instead, datasets such as ETRI-Activity3D, NTU RGB+D (or NTU RGB+D 120), or ToyotaSmartHome, all offering more than 10,000 samples, would yield better generalization results.Lastly, datasets should be tailored to focus on ADL rather than general HAR to avoid unnecessary classes for monitoring elderly individuals in their daily lives.For instance, Kinetics (400, 600, or 700) or UCF101 would not be suitable for the considered tasks as they comprise videos collected from the internet, potentially containing irrelevant activities and cuts.

Recommendations for future works
Based on the results of this SLR, a series of particularly important considerations, in our understanding, should be taken into account when conducting new studies on the topic.
First and foremost, it is crucial to assess user privacy.As observed, the approach to privacy protection will likely influence the type of data used, ranging from conventional RGB data to modified RGB, depth, IR, or skeleton data, which prevent user identification in the footage.Therefore, we recommend considering privacy protection as a fundamental aspect from the outset of the study.
Selecting an appropriate DL model for fall detection and HAR requires consideration of the deployment conditions.For embedded systems or edge-deployments, such as in social robots or mobile applications, compact models are preferred, such as MobileNet or EfficientNet-well-known CNNs specifically tailored for such devices.These models can be augmented with recurrent models like LSTM to accommodate temporal data.Conversely, if model size is not a constraint, 3D CNNs like I3D, TPN, TANet, SlowFast, and C3D are suitable for video data, while GCN can be applied to skeleton data.Alternatively, Transformer-based architectures like TimeSformer or VST are also an option for processing video input data.
For model evaluation, utilizing a publicly available dataset is essential to enable comparison with existing models or techniques.Prominent datasets for fall detection include URFD, UP-FALL, MultiCam, and Le2i, while for HAR, UP-FALL and NTU RGB+D are commonly used.However, we encourage the adoption of ETRIActivity3D or ToyotaSmartHome, which offer a more extensive collection of video samples and include elderly participants.Both datasets support HAR, with ETRIActivity3D additionally containing falls and providing multiple perspectives from elderly users, diverse classes (at least 30), and various data modalities, including RGB, depth, and skeleton joints.
In cases where a custom dataset is provided, authors are encouraged to make it publicly available.This facilitates its use in future studies, either directly or by merging it with other datasets to form a larger dataset, enhances the reproducibility of experiments, and enables comparison with newer models or techniques.RGB-D cameras, such as Microsoft Kinect, Orbbec Astra Pro, and Intel RealSense, are recommended for collecting custom datasets as they facilitate experimentation with various types of data, with depth data offering privacy preservation capabilities.
When deploying the system in a real environment, the most common approach, as indicated by the reviewed studies, involves establishing a camera setup within the environment.This setup records data and transmits it to a central server for processing.It is also the most cost-effective option, depending on factors such as camera type, resolution, and processing requirements.Alternatively, for those preferring to use an assistive robot, both NAO and Pepper robots are viable solutions.These commercial robots come equipped with cameras, speakers, microphones, and other necessary components, offering customizable options to adapt to different projects and environments.

Conclusions
In this systematic literature review, we have investigated fall detection and human activity recognition for the elderly, with a particular focus on deep learning techniques applied to computer vision data.Our study aimed to address two primary research questions related to the implementation of DL methods for these tasks and their deployment in real-world environments, considering hardware and privacy concerns.
Throughout the review process, we analyzed 151 relevant studies, providing a structured overview of the main findings to facilitate accessibility for practitioners and researchers.The findings offer valuable insights into the effective implementation of DL techniques for fall detection and HAR in elderly care, which are becoming increasingly important in the context of Ambient Assisted Living (AAL) systems.
Privacy emerged as a common concern, with 50% of the reviewed studies lacking any measures to address it.The most prevalent privacy protection method identified was the use of skeleton joints estimation, employed in 45% of the studies.
Convolutional DL models were found to be predominant, owing to their effectiveness in processing spatial data and extensive history of refinement.However, we observed a lack of consideration for the temporal dimension in many studies, which limits the recognition of some activities.
Regarding datasets, we identified three common weaknesses: the absence of elderly individuals, a limited number of samples, and the inclusion of numerous irrelevant classes for the AAL systems.We recommend datasets such as ETRIActivity3D and ToyotaSmartHome, which offer extensive samples and include elderly participants.
Moving forward, we emphasize the importance of privacy assessment from the outset of studies and recommend selecting appropriate DL models based on deployment conditions.Utilizing publicly available datasets for model evaluation is crucial, and authors are encouraged to make custom datasets publicly available to enhance reproducibility and facilitate future research.
In terms of deployment, camera setups within the environment were the most common approach identified, offering cost-effectiveness and flexibility.Alternatively, assistive robots like NAO and Pepper provide customizable options for deployment in various projects and environments.
Overall, this SLR provides a comprehensive overview of recent advancements in DL-based fall detection and HAR for the elderly, offering valuable insights for researchers, practitioners, and policymakers involved in developing and implementing AAL technologies.

Fig. 1
Fig. 1 Number of publications obtained from each database, before duplicate removal and study selection (2616 in total).

Fig. 2
Fig. 2 Accumulated publications from 2013 to 2023 (both included) after study selection and quality assessment.

Fig. 3
Fig.3Articles collected at each phase of the systematic search process, including acquisition from various sources, removal of duplicates, and study selection, which also involved quality assessment.

•
Author/s • Type (journal article or conference proceedings) • Publication year • Task (HAR or fall detection) • Data type (RGB, depth, IR or skeleton) • Auxiliary sensor data type (accelerometers, gyroscopes, etc.) • Camera used • Dataset (name of external dataset/s or "custom") • DL model/s and task (skeleton joints estimation, feature extraction, classification, etc.) • Other ML models or computer vision techniques used • System integration in a robot (yes/no) and which one • System integration in a framework (yes/no) • How is privacy preserved?(depth or IR only, low resolution, etc.) • Reproducible (yes/no) • Test with external datasets • Comparison with other approaches

Fig. 4
Fig. 4 Distribution of studies by target task: fall detection, HAR or both.

Fig. 5
Fig. 5 Distribution of studies by data type used.Note that more than one data type was used in some studies.

Fig. 7
Fig. 7 Distribution of studies by dataset used.

Fig. 8
Fig. 8 Distribution of studies by method used to preserve privacy.The total does not add up to 151 studies because in some studies different options were given.

Fig. 9
Fig. 9 Summary of the search process and found results.

Table 2
Primary and secondary research questions used for this SLR.
RQ1.1 What is the preferred data type?RQ1.2What are the most extensively used architectures?RQ1.3What are the most extended datasets?RQ2How can these tasks be deployed successfully in a real environment?RQ2.1 What is the most common hardware (cameras, robots, etc.)?RQ2.2How is privacy of the elderly preserved?

Table 3
Full list of relevant studies examined in this systematic review.The tasks of the studies may involve FD (fall detection), HAR (Human Action Recognition), or both (FD, HAR).The CV data column indicates the type of computer vision data used: RGB, D (depth), or IR (infrared).Data types are separated by commas if the study requires all of them or by slashes if only one is necessary.(continued on the next page)

Table 4
Full list of relevant studies examined in this systematic review.The tasks of the studies may involve FD (fall detection), HAR (Human Action Recognition), or both (FD, HAR).The CV data column indicates the type of computer vision data used: RGB, D (depth), or IR (infrared).Data types are separated by commas if the study requires all of them or by slashes if only one is necessary.(continuation)

Table 5
Studies using multi-modal approaches and type of fusion with visual data.

Table 6
DL models utilized in the reviewed studies, tasks they are employed for, input data they process, and number of studies in which they are featured.Abbreviations are used for Fall Detection (FD), Object Detection (OD), and Object Segmentation (OS).Models of to the same family are grouped together, including R-CNN, LSTM, YOLO, VGG, ResNet, RNN, CNN, and GCN.

Table 7
Comprehensive list of publicly available datasets used in the reviewed studies, along with their basic specifications.The columns "Eld." and "Cl." denote the presence of elderly people in the datasets and the number of classes, respectively.The "Studies" column indicates in which studies they appear, or the number of reviewed studies if it is greater than two.

Table 8
Special cameras and social robots found in the reviewed studies.