A Study on the Application of Convolutional Neural Networks to Fall Detection Evaluated with Multiple Public Datasets

Due to the repercussion of falls on both the health and self-sufficiency of older people and on the financial sustainability of healthcare systems, the study of wearable fall detection systems (FDSs) has gained much attention during the last years. The core of a FDS is the algorithm that discriminates falls from conventional Activities of Daily Life (ADLs). This work presents and evaluates a convolutional deep neural network when it is applied to identify fall patterns based on the measurements collected by a transportable tri-axial accelerometer. In contrast with most works in the related literature, the evaluation is performed against a wide set of public data repositories containing the traces obtained from diverse groups of volunteers during the execution of ADLs and mimicked falls. Although the method can yield very good results when it is hyper-parameterized for a certain dataset, the global evaluation with the other repositories highlights the difficulty of extrapolating to other testbeds the network architecture that was configured and optimized for a particular dataset.


Introduction
Due to the growing life expectancy and the social changes in the traditional family structure, the population of seniors that live alone in their homes has notably increased during the last few decades. In this context, falls are a major risk for the quality of life and the autonomy of the elderly.
According to the studies reported by the World Health Organization (WHO) [1], falls represent the second leading cause of accidental deaths around the world, producing a particularly high morbidity among people aged 65 and older. For those aged over 80 residing in community settings, the percentage of those persons that experience at least one fall per year climbs to 50% [2], with 40% of them suffering recurrent falls [3]. In the USA, the annual number of fall-related injuries is expected to reach 3.4 million in 2020 and 5.7 million by the year 2030 [3]. As it refers to the economic impact on the sustainability of national health systems, the global medical costs attributable to falls in 2015 totaled about $50.0 billion [4].
Aid response time is a key element to prevent the most serious potential consequences of the comorbidities and disabilities linked to falls. Consequently, the study of systems for the automatic recognition of falls has become an important research topic in the fields of telemedicine and human activity recognition during the last ten years.
The objective of Fall Detection Systems (FDSs) is to continuously monitor the movements of a certain user (or patient) with the aim of transmitting an alarm notification (text message, phone call, etc.) claiming assistance to a remote observation point whenever a fall is suspected. FDSs must be carefully designed to discriminate falls from other routines or ADLs (Activities of Daily Living) so that the number of both unnoticed falls and false alarms (ADLs misidentified as falls) is minimized.
In spite of the variety of existing solutions to the problem of fall detection, FDSs are usually categorized into two groups [5][6][7]: context-aware (vision and/or ambient based) and wearable detectors. In context-aware systems, falls are recognized by processing the signals collected by environmental sensors (such as cameras, depth sensors, microphones, vibration sensors, etc.) located in the vicinity of the subject to be tracked. Hence, the operation of a context-aware architecture is confined to the particular zone (e.g., room, nursing home) where the sensors are deployed and configured. In this zone, the alteration of factors such as the lighting, the disposition of the furniture, or the presence of unexpected elements (occlusions, pets, falling objects, other individuals, spurious sounds, etc.) may impact heavily on the effectiveness of the detection decision [8]. Furthermore, in the case of using audiovisual equipment, the patients may feel their privacy compromised.
On the other hand, wearable systems permit monitoring the patient's movements by means of one or several transportable sensors (mainly accelerometers, but also gyroscopes, and much less frequently, magnetometers or ECG sensors), which are fixed to the clothes or attached to the body through elastic bands.
Wearable FDSs can be easily implemented on smartphones as these popular gadgets natively embed inertial sensors. Otherwise, if the sensing capabilities of the phone are not leveraged, external sensors can also connect with a smartphone via a low-power wireless standard (such as Bluetooth Low Energy) with a view to benefit from the long range connectivity (Wi-Fi, 4G/3G) of these personal devices. Thus wearable FDSs offer a cost-effective alternative to track the movements unequivocally linked to a certain user practically, without any geographical or location restriction. The increasing capacity of wearables to put into operation sophisticated detection algorithms has also fostered the interest in this typology of FDSs within the research community.
Falls are generically and ambiguously defined by the WHO as events that "result in a person coming to rest inadvertently on the ground or floor or other lower level" [1]. Due to the complex dynamics and the broad variety of the types of falls, fall detection algorithms based on machine learning techniques yield much more accurate results than those obtained by 'thresholding' strategies [9,10], which simply compare a certain variable or groups of variables (e.g., the acceleration magnitude) with one or several preset thresholds or limit values to produce the detection decision.
In the domain of machine learning, different classes of architectures based on artificial neural networks such as Recurrent Neural Networks (RNNs) [11][12][13] have been successfully employed as the movement classifier of a FDS. Similarly, Convolutional Neural Networks (CNNs) have also been recently proposed as a promising technology for those HAR (human activity recognition) systems [14,15] and wearable FDSs that process the data gathered by inertial sensors [16][17][18].
CNNs are formed by a sequence of processing layers interrelated through neurons. Unlike other machine learning techniques commonly employed in wearable FDSs (such as Support Vector Machine or k-Nearest Neighbors), CNNs allow modeling the underlying structures of large datasets without requiring human guidance as they are capable of identifying those internal features that optimize the representation of the data with different layers of abstraction [19].
A central issue still under discussion in the study of fall detection algorithms is their evaluation. On account of the evident complexity of testing the FDS in a real scenario with actual falls suffered by elderly patients, fall detection algorithms are massively evaluated by the literature against databases of inertial measurements. These data are typically collected by a group of volunteers that transport inertial sensors during the systematic emulation of falls and ADLs executed as a function of a preconfigured testbed.
In almost all initial studies on FDSs, the resulting evaluation datasets were not made publicly available so that they could not be re-utilized by other authors to compare new proposals. This lack of specific benchmarks was partially remedied over the past years, during which several repositories designed for the evaluation of FDSs have been released. Thus, an increasing use of public datasets Sensors 2020, 20, 1466 3 of 21 as benchmarking tools has been clearly detected in the recent related literature. However, although existing databases strongly differ in many aspects (sampling rate and range of the sensors, number and characteristics of the experimental users, emulated movements, length of the samples, etc.), in the vast majority of the works, the detection methods are parameterized and tested by taking into account only a single dataset. Thus, it is legitimate to question whether the results obtained by these studies for a particular dataset can be extrapolated when other test samples are considered.
In this paper, we evaluate the capability of a CNN to perform as the movement classifier in a wearable FDS. The hyper-parameters of the deep learning architecture are initially designed to optimize the detector performance when a specific dataset is utilized as the evaluation framework. Then, the same architecture is trained and tested with the other 13 datasets. The analysis shows the huge variability of the quality metrics, which deeply depend on the utilized repository.
The rest of this paper is organized as follows. Section 2 revises the existing datasets, while Section 3 comments on the configuration of the CNN (input features, layers, dimensions). Section 4 presents the numerical results systematically obtained for the different datasets when the same detection method is applied. Finally, the conclusions are recapitulated in Section 5.

Revision and Selection of Public Datasets
Due to the inherent complications of gathering inertial measurements of actual falls suffered by older people, most works have evaluated their proposals for FDSs by creating a testbed in which a set of volunteers transporting one or several inertial sensors systematically execute a predetermined number of activities. These preconfigured activities usually include typical ADLs (such as sitting, walking, running, climbing stairs, etc.) as well as some types of mimicked falls (which are normally carried out on a mat or padded surface to avoid injuries). Unfortunately, in many studies, these measurements obtained in the testbed are not made publicly available to be exploited by the research community to compare new proposals. To overcome this drawback, different datasets have been published (especially during the last four years) as benchmarking tools for cross-comparison of detection algorithms. Table 1 presents a comprehensive list of the authors, reference, institutions, and year of publication of these datasets. Ahmed et al. [20] presented another repository (generated by 140 subjects and intended for assessing fall risk), which was not considered in this analysis as it only includes five falls.
All of these datasets comprise the measurements collected by the inertial sensors worn by the selected volunteers during the preconfigured experiments. The number of obtained samples and considered typologies of the emulated ADLs and falls, the duration of the traces as well as the basic characteristics of the participants (number, gender, and age range) are described in Table 2. Table 3 summarizes, in turn, the type and basic properties (sampling rate, range) of the sensors employed to generate the repositories. The table also indicates the corporal position on which the sensor was located or attached during the experiments (see [21] for a further comparison of some of these datasets). As can be observed from the table, although there are cases where up to seven sensing positions have been considered, most datasets include only a single measuring point. In all cases, the sensor embeds at least an accelerometer and less often, a gyroscope, a magnetometer, and/or an orientation sensor.
Position (x,y,z coordinates) Right ankle, Left ankle, Waist (belt), Chest 4 external IMUS (tags) 10 Tens of meters Sensors 2020, 20, 1466 7 of 21 These available repositories are being increasingly considered by the recent literature on algorithms for FDSs to evaluate the effectiveness of the detection process. Table 4 lists those studies that have employed public datasets in order to test neural algorithms aimed at detecting falls in wearable systems. The table itemizes the number and name of the utilized repositories as well as the quality metrics achieved by the neural detection methods (mainly sensitivity and specificity, or alternatively, accuracy or AUC (area ander the receiver operating characteristic curve). Table 4 shows that the evaluation of the method in most studies (12 out of 17) was limited to a single dataset. Only three works validated their proposals against more than two datasets. The work by Khojasteh et al. [42] made use of four databases, but two of them (DaLiac [43] and Epilepsy [44] repositories) only included ADLs (which only allows for evaluating the capacity of the system to avoid false alarms). The interesting dataset from the FARSEEING project [35], also used in that study, consists of the traces obtained from 300 real world falls captured by monitoring a population of hundreds of older people for several weeks. However, only 22 samples of that repository were available (under request to the project managers). The FARSEEING dataset was also taken into consideration by one of the two studies that employed three datasets: the work by Mauldin in [39], which introduced two similar databases (known as Smartwatch and Notch) collected with wrist sensors (a smartwatch and an external IMU-Inertial Measurement Unit-, respectively). Apart from the problems related to the difficulties of detecting falls with a wrist worn device, these databases, also used by Santos in [18], incorporate a moderate number of fall events that may hamper a thorough and systematic assessment of the effectiveness of the detector.
From the previous analysis, we can conclude that the use of several benchmarking datasets has not been a major concern in the literature focused on wearable fall detection systems. However, before being applied, all machine learning strategies for fall detection (including neural methods) need to be configured by setting the values of a not-negligible number of parameters (e.g., the number and nature of input features). In most cases, these values are heuristically selected, presumably as a function of the results obtained with a set of testing samples extracted from a very particular dataset. As no other database is utilized, the study of the capability of the configured network to detect falls under conditions different from those in the reference dataset (sensor model, sampling rate, typology of ADL and falls, etc.) actually remains unaddressed. Furthermore, except for some works such as that by Yuwono [45] (which considers an outgroup dataset), the samples applied to test the system were always acquired from the same experimental subjects that provided the training (and validation) samples.
In this regard, by using three repositories (tFall, DLR, and MobiFall), Medrano et al. have already shown in [46] that the performance of a FDS noticeably decreased when the machine learning algorithms were evaluated on a dataset different from that employed for training. In this work, we show that even when the algorithm is trained and tested with data of the same datasets and users, the performance of the same method may vary dramatically depending on the contemplated repository. With this in mind, we selected SisFall as our reference dataset to parameterize the CNN in order to maximize the performance metrics. Then, we analyzed the resulting network configuration when it was trained and tested with other 13 different datasets (ticked with a check mark in Table 1). We opted to use SisFall [36] as the basis of our analysis as it is one of the most employed datasets in the literature (see Table 4). In addition, it was generated by one of the largest sets of participants (38 subjects including 19 males and 19 females) with the widest age range (19-75 years). SisFall contains a significant volume of traces (2707 ADLs and 1798 mimicked falls) with a duration (between 10 s and 180 s per movement, with a mean value of 15 s) long enough to apply different analysis strategies. The nature of the emulated activities also exhibits a noteworthy variety: 19 categories of ADLs (ranging from jogging or stumbling to basic movements such as sitting down) and 15 different types of falls (generated as a function of the direction of the fall and the initial user's position). In the testbed deployed to collect the SisFall samples, the volunteers transported the sensing on a belt. The waist is believed to be a good (and ergonomic) position to characterize the user's movements [60] as it is near the gravity center of the body and not strongly associated to the individual mobility of a limb (such as the wrist or the ankle). For comparison purposes, we also selected those datasets where the data were collected on the waist (or at least, on the upper part of the thigh). In order to keep a common evaluation framework, in those cases where the traces were gathered with several sensors simultaneously located on several parts of the body (e.g., Erciyes or UMAFall datasets), the analysis focused on the data obtained on the waist and the data from the other sensors were not utilized.
Nevertheless, we have to remark that the appropriateness of investigating fall detection systems with falls emulated by young healthy participants on a cushioned surface is still a controversial topic out of the scope of this work. In this respect, Klenk found remarkable differences between the mobility patterns of emulated and real-life [61] falls. In contrast, after analyzing the dynamics of actual falls endured by elderly people, Jämsa et al. concluded in [62] that intentional and real life falls exhibited analogous characteristics.

Selection of the Input Features
The selection of the input features is a key design decision for the performance of machine learning strategies.
In most practical implementations of FDS, the detector is expected to be (at least partly) implemented on a sensing mote with heavy hardware resource limitations of the battery and computation power. Thus, to facilitate the real-time operation of the wearable, input features should be derived from the data collected by the sensors without requiring any complex preprocessing of the signals. In this respect, the architectures of the CNNs are particularly suited to learn the internal structure of the signals directly from the raw sensor data without any previous heuristic extraction of input features. Accordingly, we propose to directly feed the input of the CNN with the raw inertial measurements provided by the repositories instead of using other parameters computed from the data (extreme values, statistical moments, autocorrelation, time between 'peaks' or 'valleys' of the signals, wavelet or discrete Fourier transform coefficients, frequency domain features, etc.). As long as some datasets include other types of measurements, in this paper, we focused on the analysis of the triaxial accelerometry signals (which are the basis for fall detection in most wearable FDSs existing in the literature).
Falls are normally associated with one or several sudden upsurges of the acceleration magnitude caused by the impacts of the body against the ground [63]. Hence, our analysis will be concentrated on a time interval of fixed duration around the instant in which the maximum value of the acceleration magnitude is detected, implicitly assuming that in the case of the sequence with a fall event, the fall has occurred during this interval.
The acceleration magnitude or Signal Magnitude Vector (SMV i ) for the i-th sample can be directly computed from the values measured by the triaxial accelerometer: where A xi , A yi , and A zi define the three components of the acceleration vector for that i-th sample in the direction of the x, y, and z-axis, respectively. These components are periodically measured by the tri-axial accelerometer embedded in the smartphone and the external sensors. The acceleration peak or maximum of the SMV (SMV max ) is defined as: where N indicates the length (number of samples) of the analyzed trace while t o is the index of the sample at which the acceleration peak is located. Following the typical fall pattern, a "free fall" period (in which the acceleration magnitude tends to be zero) normally precedes the impact against the floor. Furthermore, the dynamics of a fall is usually characterized by brusque modifications of the body orientation, which are reflected in abrupt changes in the sequence of the three acceleration components. In this context, the typical duration of fall has been reported to span between 1 s and 3 s [64]. Thus, in order to capture the most significant elements of the dynamics of a fall, we propose setting an observation window of up to ±2.5 s around the instant t o (four different window sizes will be considered). Hence, the CNN is fed with the raw data collected by the accelerometer during that period. Figure 1 illustrates an example of the evolution of both the acceleration components and magnitude for a particular ADL (climbing and descending stairs rapidly) and a forward (mimicked) fall caused by a trip. Figure 2 represents the time series that will be analyzed by the CNN after extracting the values corresponding to the observation window around the peak magnitude (for two window sizes: 1 and 5 s).
The sub-figures, in which the value SMV max is indicated with a square marker, clearly show that window sizes larger than 5 s are not required to apprehend the variability of the mobility patterns during the fall. In our analysis, we will take into consideration two alternative variants for the input sequences of the CNN:

1.
The series of the acceleration modules (SMV j ) computed from the samples collected during the observation window: where f s indicates the sampling rate of the accelerometer and T W represents the duration (in seconds) of the window.

2.
As the second set of input features, we directly consider the series of the triaxial acceleration components (A x j , A y j , A z j ) obtained from the sensor:  The sub-figures, in which the value is indicated with a square marker, clearly show that window sizes larger than 5 s are not required to apprehend the variability of the mobility patterns during the fall. In our analysis, we will take into consideration two alternative variants for the input   The sub-figures, in which the value is indicated with a square marker, clearly show that window sizes larger than 5 s are not required to apprehend the variability of the mobility patterns during the fall. In our analysis, we will take into consideration two alternative variants for the input sequences of the CNN: 1. The series of the acceleration modules (SMVj) computed from the samples collected during the observation window: In the case of SisFall traces, as the sampling frequency is 200 Hz, a 5 s window encompasses 1001 values of the acceleration magnitudes and 3003 input features when the triaxial components are employed.

Structure of the CNN
The basic objective of a neural network is to autonomously discover and implement a relationship between a set of fixed-size input features (here, the accelerometer data) and a known set of fixed-size output labels (here, a binary decision of 0 or 1, depending on the movement type, ADL or fall).
Classical multilayer perceptrons (MLPs) present full neuron connectivity between contiguous layers (which dramatically increases the number of synaptic weights). In contrast with this repetitive structure of neurons of MLPs, CNN are composed of specialized layers conceived for different purposes. Thus, some elements are only responsive to a particular zone or 'region' (for images) or interval (for time series) of the original input data. Benefitting from this 'clustering', the initial (convolutional) layers in an CNN are in charge of learning and extracting the 'features' that characterize the different pattern types to enable the discrimination at the final stage.
In MLPs or under other machine learning strategies, the features (or internal representation of the raw data) required to feed the classifiers must be 'manually' or heuristically selected. Hence, the performance of the system strongly relies on the expertise of the designer [19].
In a CNN, the learned features from a certain group of neurons in a layer (computed through simple but nonlinear combinations of the neuron inputs) are the inputs for some neurons of the following convolutional layer. Therefore, the high-level abstraction of the raw data was carried out in an automatic way and with multiple levels of representation. To this end, in the convolutional layers, every neuron convolves the received data with a set of adjustable kernels (or filters) of a predefined size. The coefficients of these filters are adjusted during the training phase to optimize the representativeness of the features. After the convolution, the resulting values are passed through a non-linear activation function. In our case, we used rectified linear unit (ReLU), which is widely extended to deploy CNNs. ReLU can be easily computed as a ramp function: f(z) = max (z,0) [19].
In order to reduce and compact the feature vectors produced by the convolutional filters into a 'down-sampled feature map', we utilized pooling layers after the convolutional layers. As a result, every element or neuron in the pooling layer is capable of condensing the information generated by a region of neurons of the previous layer. In particular, we employed the popular max-pooling filters, which simply extract the highest value of the input region.
After the sequential feature extractors, a final classifier is required to produce the final discrimination decision based on the global features learned by the closing convolutional layer. In our scheme, although other conventional machine learning strategies could have been considered, we used an architecture comprising one fully connected layer and a softmax function, which normalizes the weighted input feature vector into two values, which describes the probabilities of detecting an ADL or a fall. The final classification of the movement is simply based on the maximum of these two probabilities.
To implement the CNN, we utilized MATLAB [65] scripts by leveraging the Deep Learning MATLAB Toolbox TM [66]. To operate with these scripts, the CNN is directly fed with an equivalent 'image' of (1 × width) 'pixels', where the term width describes the number of acceleration samples contained in the observation window around the acceleration peak (1001 or 3003, for the case of SisFall dataset). Thus, the system was trained to categorize the 'images' as two different output types (ADL or fall).

Training Procedure
In order to prevent overfitting during the training process, the original traces of the employed datasets were divided into three independent sets of samples: training, validation, and test sets, following the typical ratio of 60% (for training), 20% (validation), and 20% (testing). The repositories were randomly split, but preserving the same proportion of falls and ADLs in the three groups.
The validation set was used to assess the performance of the network after a certain number of iterations or epochs in which the CNN was trained with the training sample group. As the training progresses, the error (or loss) committed when classifying the validation samples is expected to gradually decrease. Thus, the process continues until this validation loss stops decreasing and keeps increasing for a predetermined number of attempts ('validation patience'). This fact indicates that the network is beginning to overfit the training data and, consequently, that the learning phase needs to be concluded. To reduce the effects of overfitting in deep learning, we also employed two common complementary techniques: dropout and L2 Regularization, aimed at avoiding co-adaptations on training samples and minimizing the sum of the values of the weight coefficients, respectively.
We established a validation patient of three epochs. If this limit is not exceeded, the learning phase stops after 20 epochs. Figure 3 illustrates the rapid convergence of the accuracy and loss for the training and validation sample sets during the training process when the SisFall dataset is employed.

Training Procedure
In order to prevent overfitting during the training process, the original traces of the employed datasets were divided into three independent sets of samples: training, validation, and test sets, following the typical ratio of 60% (for training), 20% (validation), and 20% (testing). The repositories were randomly split, but preserving the same proportion of falls and ADLs in the three groups.
The validation set was used to assess the performance of the network after a certain number of iterations or epochs in which the CNN was trained with the training sample group. As the training progresses, the error (or loss) committed when classifying the validation samples is expected to gradually decrease. Thus, the process continues until this validation loss stops decreasing and keeps increasing for a predetermined number of attempts ('validation patience'). This fact indicates that the network is beginning to overfit the training data and, consequently, that the learning phase needs to be concluded. To reduce the effects of overfitting in deep learning, we also employed two common complementary techniques: dropout and L2 Regularization, aimed at avoiding co-adaptations on training samples and minimizing the sum of the values of the weight coefficients, respectively.
We established a validation patient of three epochs. If this limit is not exceeded, the learning phase stops after 20 epochs. Figure 3 illustrates the rapid convergence of the accuracy and loss for the training and validation sample sets during the training process when the SisFall dataset is employed. In short, Table 5 summarizes the final characteristics and hyper-parameters of the utilized CNN as well as the procedures considered for the training phase. Through an initial phase of hyperparameter optimization and a systematic evaluation of the architecture, the network was dimensioned and hyper-parameterized to maximize the performance metrics achieved with the SisFall repository.
As previously commented, the complexity of a CNN is determined by the typology and number of layers. A basic architecture with just one or two convolutional layers may be sufficient to learn the features in a small set of unsophisticated data. However, more layers are normally required to detect complex patterns in datasets as those used in our study. Therefore, as can be seen in Table 5, the CNN consisted of four consecutive feature extractor and one final classifier. Every feature extraction layer In short, Table 5 summarizes the final characteristics and hyper-parameters of the utilized CNN as well as the procedures considered for the training phase. Through an initial phase of hyperparameter optimization and a systematic evaluation of the architecture, the network was dimensioned and hyper-parameterized to maximize the performance metrics achieved with the SisFall repository.
As previously commented, the complexity of a CNN is determined by the typology and number of layers. A basic architecture with just one or two convolutional layers may be sufficient to learn the features in a small set of unsophisticated data. However, more layers are normally required to detect complex patterns in datasets as those used in our study. Therefore, as can be seen in Table 5, the CNN consisted of four consecutive feature extractor and one final classifier. Every feature extraction layer includes one convolutional layer, followed by one batch normalization layer, a nonlinear ReLU activation function, and one down-sampling max pooling layer (except for the final convolutional layer, which does not incorporate the max pooling operation). Likewise, the final classifying layer is formed of one fully-connected layer, one softmax function, and one final classifying step. The training process uses the cross-entropy loss function.

Numerical Results
After training the neural system, the performance of the trained CNN was evaluated by computing the efficacy of the detection decisions when the network was fed with the independent set of test samples of every dataset.
To evaluate the capacity of the CNN to discriminate falls from ADLs, we calculated three traditional quality metrics to characterize the performance of binary classifiers: the sensitivity (Se); hit rate or recall, which describes the ability to recognize falls; the specificity (Sp) or selectivity, which portrays the effectiveness of the FDS to prevent false alarms (i.e., ADLs misinterpreted as falls); and accuracy (Acc), as a global measurement of the system efficiency.
These quality metrics (defined as percentages) can be straightforwardly computed as: Accuracy(%) = 100· TP + TN TN + FN + TP + FP where TP and TN define the number of 'True Positives' and 'True Negatives' (i.e., falls and ADLs that have been adequately identified), respectively. Similarly, FP and FN describe the number of 'False Positive' and 'False Negatives' (ADLs and falls that have been misidentified). The results obtained when the network was trained and tested with the reference dataset (SisFall) are available in Table 6 (also presented in [67]). The table displays the quality metrics achieved for an observation window of ±2.5 s for the two alternative input signals (the SMV and the triaxial acceleration components). The results seem to indicate that the effectiveness of the detector (namely the sensitivity) noticeably improves when the 3-axis signals are utilized. This could be justified by the fact that, when compared to the acceleration magnitude, triaxial components offer a better insight about the sudden changes of the body orientation provoked by the falls, which could facilitate the CNN in the detection decision.
In any case, the obtained results (with both specificity and sensitivity near 99%) were better than those published by other studies on FDS that employed the same SisFall dataset as a benchmarking tool [36,52,56,68,69] and in which a specificity and a sensitivity superior to 0.98 were not attained simultaneously.
However, the performance of the system considerably worsens if we extend the evaluation to the other datasets. For that purpose, we also selected those existing databases that incorporate samples obtained with an accelerometer located in a similar position (waist or, if not possible, thigh). Tables 7 and 8 shows the results for the 14 considered repositories when the SMV and the triaxial acceleration components are respectively considered as the input features (the results for reference SisFall dataset are highlighted in bold font). The tables include the metrics achieved for four different observation windows (±0.5 s, ±1 s, ±1.5 s, and ±2.5 s around the acceleration peak). In two cases (UniMiB and DLR), the short duration of the samples prevented the study for the highest values of the window size.
We have to remark that in all cases, the CNN architecture was trained, validated, and tested with data extracted from the same dataset (no cross-validation between different datasets was contemplated) and followed the same procedure as that applied to the SisFall dataset. By the same token, for each repository, the number of inputs of the CNN were dimensioned, taking into account the sampling rate used to generate the dataset and the desired observation window (no sub-sampling or over-sampling of the dataset measurements was performed). In this way, we evaluated the capability of the deep learning architecture to self-adapt to the different conditions under which the traces were collected in each repository.
From the tables, we can conclude that the performance of the system visibly depends on the employed benchmark. No clear trend can be deduced about the behavior of the detector: in some cases, an acceptable specificity (higher than 95%) was achieved at the cost of an inadmissible sensitivity while for other datasets, the detector seems to prioritize the sensitivity (with independence of the number of ADLs and falls included in the repositories). In some cases, the accuracy is even worse than that achieved with a random classification of the movements with balanced data (50%). No conclusions can be drawn either from the analysis of the importance of the observation window or from the election of the input features.
This 'erratic' performance of the detector can be justified by the huge variability of the factors that affect the generation of the mobility patterns: parameters of the sensor (range, sampling rate), typology of the emulated movements, characteristics of the experimental subjects or configuration of the scenario of the simulations (pads, mattresses, etc.).
Consequently, in opposition to the procedure commonly followed by most works in the literature, we consider that it is essential to evaluate any proposal on a FDS against diverse databases to state that the accuracy of the classifier has been validated.
In any case, this issue is just another example of the existence of a global problem associated with the research on fall detection systems: the lack of commonly accepted methodology to evaluate and benchmark the new proposals on FDSs existing in the literature. This lack of consensus, which has been highlighted and discussed by different authors [70][71][72][73], does not only affect the employed datasets, but also other important operational aspects of the evaluation policy (performance metrics, availability of the code of the proposed algorithms, etc.). Thus, the problem of the proper selection (number and typology of movements, participants, etc.) of the datasets should be approached under a general procedure to generate a consensual framework that eases the comparison of FDSs. Table 7. Comparison of the obtained performance metrics for the 14 datasets and the four different observation windows (T W ) around the peak when the acceleration magnitude (SMV) was used to feed the convolutional neural network (the results for the reference dataset -SisFall-are marked in bold). Note: * Some observation windows could not be applied to these datasets due to the short duration of the samples.  Table 8. Comparison of the obtained performance metrics for the 14 datasets and the four different observation windows (T w ) around the peak when the triaxial components of the acceleration magnitude were used to feed the convolutional neural network. (the results for the reference dataset -SisFall-are marked in bold). Note: * Some observation windows could not be applied to these datasets due to the short duration of the samples.

Conclusions
The last decade has witnessed a considerable number of research efforts that propose new algorithms to detect falls based on the signals captured by wearable inertial sensors.
This work has evaluated the effectiveness of a Convolutional Neural Network (CNN) to discriminate falls and ADLs (Activities of Daily Living) from datasets containing accelerometry signals.
In contrast to other machine learning strategies, the system reduces the need of the required preprocessing of the signals and avoids the 'handcrafted' selection of input features as it takes advantage of the ability of CNNs to automatically extract knowledge from complex raw time series. Thus, the CNN is directly fed with the signals collected by an accelerometer. In particular, the analysis of the CNN focuses on the acceleration samples gathered during an observation window around the moment where a certain peak in the acceleration magnitude is detected.
The performance of the classifier was initially evaluated by using one of the largest public datasets of movements with emulated falls. The achieved performance metrics (with specificity and sensitivity over 98%) revealed that the effectiveness is augmented when the three acceleration components (instead of the acceleration magnitude) were used as input features by the CNN. Then, the obtained architecture was trained and tested with the samples of 13 other public databases, collected with an accelerometer in a similar position but in different testbeds.
Results show that the dataset employed as a benchmark dramatically impacts the performance, which could be justified by the variability of the sensors and the configuration of the tests deployed by the testbeds of the different repositories. These results challenge the methodology usually followed by the related literature, by which the proposed algorithms are evaluated against one (or at the most two) datasets.
By using a deep learning architecture, capable in principle of automatically extracting the most representative features of the training patterns, we have shown the difficulty of extrapolating the results achieved for a dataset when the same fall detection architecture is evaluated with samples of the same nature, but collected in different scenarios (employed sensors, sampling rate, type of movements, characteristics of the experimental subjects, etc.). Therefore, these results at the very least question the effectiveness of many machine learning mechanisms that achieve excellent performance metrics (with sensitivities and specificities close to 100%) when they are trained and validated with samples recorded in a very particular experimental setup. This conclusion is particularly relevant if we take into account that in a realistic application of a FDS, it is very unlikely that the system can be trained with samples of actual falls of the target user.
In any case, future studies should analyze in detail whether these limitations in the extrapolation capability of the FDS can be resolved if more sophisticated configurations of the CNN are tested or if other machine learning mechanisms (either the traditional feature engineering-based methods or other neuronal mechanisms such as LSTM (Long-Short Term Memory Networks) are considered.
To the best of our knowledge, this is the first proposal that utilizes up to 14 different public datasets to assess the effectiveness of a FDS. Further studies should be devoted to defining a framework to characterize the quality and representativeness of the existing datasets, so that a commonly accepted procedure to evaluate fall detection systems could be defined.
Author Contributions: Proposed the experimental setup, defined the architecture and the evaluation procedure, co-analyzed the results, elaborated the critical review, and wrote the paper, E.C.; Programmed the classifier and executed the tests to evaluate the algorithms R.L.-R.; Proposed the neural architecture, co-discussed the results, and revised the paper F.G.-L. All authors have read and agreed to the published version of the manuscript.