Automatically Segmenting Physical Performance Test Items for Older Adults Using a Doppler Radar: a Proof of Concept Study

Assessing the performance of physical activities through the modified physical performance test (mPPT) is a known approach for predicting the frailty level in older adults. This study proposes a system comprising of a continuous-wave (CW) radar for data acquisition and deep neural network (DNN) models (convolutional neural network (CNN) and convolutional recurrent neural network (CRNN)) as classifiers to automatically segment the mPPT items. These two DNN models were trained and evaluated in a leave-one- participant-out (LOPO) cross-validation procedure with a transfer learning method. To segment the mPPT items during recording by the radar, an additional flag activity was employed, which involves having the volunteer wave their hands at the start of each activity. Compared to the CNN, the CRNN achieved better classification performance, with the f1-score ranging from 0.3445 (lifting a book) to 0.9509 (standing balance). The recognition result was then used to segment the time series data and predict each item’s duration. The average absolute duration prediction error ranged from 0.78 s (standing balance) to 2.78 s (climbing stairs). The result implied that the system has the potential to segment mPPT items automatically. Future works will be focused on accomplish all the evaluation criteria automatically, e.g., the steadiness and continuity of steps while turning 360°, and improve the low classification result of some mPPT items like lifting a book.


I. INTRODUCTION A. BACKGROUND
Daily living consists of multiple physical activities such as walking and climbing stairs. The performance level of these Activities of Daily Living (ADL) reflects the physical functionality status [1], [2]. Assessing the performance on ADL can be used to evaluate the frailty level [3]. The assessment result is beneficial to the older adults themselves for the evaluation of their independent living ability [4]. Moreover, it is helpful to caregivers in administering individual healthcare services and to doctors or physiotherapists for determining or changing the course of treatment [5], [6].
The physical activity performance is evaluated via clinical scales, either in the form of a questionnaire, e.g., Katz scale [7], the Barthel ADL index [8], and InterRAI [9], or physical test, e.g., modified Physical Performance Test (mPPT) [10] and short physical performance battery [11]. Compared to the questionnaire, the result of a physical performance test is more objective and easier to quantify. The primary evaluation standard of a physical performance test is the duration to perform a certain task. In contrast, the questionnaire's standard is usually based on the performing frequency or the level of independence at which patients can finish required activities. However, there are still limitations of the physical performance test. It requires a specified place like a cabinet of a physiotherapist and under the supervision of healthcare professionals. Therefore, the physical performance can only be assessed by the physician, not in the home situation and not sporadically.
To overcome these limitations, this study proposes a system that includes a contactless Doppler radar for monitoring the performance of the test items and a data processing unit for automatic segmentation of the test items. The study is the first stage towards automatic evaluation of the physical performance of older adults in their own home, prediction of their frailty level, and informing the health care providers on performance changes.

1) Frailty Level Prediction
Many studies have already been conducted to monitor the performance of physical activities by means of technology and assess the frailty level or older adults' ability to live independently. Various types of sensors have been applied in this field as well. In the study of Tegou et al. [12], beacon sensors were used to collect the indoor localisation information. Transition or movement features, e.g., the total number of passed rooms and the average time of room transitions, were extracted to predict the frailty level of older adults. In the study, the frailty was grouped into three levels: non-frail, pre-frail, and frail following the Fried Frailty Index [13].
Wearable devices have also been applied in frailty prediction. In the study conducted by Razjouyan et al. [14], a chest band implemented with an accelerometer sensor was applied. This study aimed to find specific patterns which can be used for distinguishing pre-frail from none-frailty and frailty, based on the Fried Frailty Index. Their study proved that frailty was correlated with three patterns: physical activity pattern (walking, sitting, standing), physical activity behaviour (sedentary, light, moderate-to-vigorous) and stepping patterns. Abril-Jiménez et al. [15] used internet of things (IoT) and wearable devices (a wearable band and a smartphone) to assess frailty levels. This study verified that the walking speed, distance and number of steps were sensitive to the change of the level of frailty.
Aside IoT and wearable devices, Doppler radars have also been used in frailty assessment. Compared to the two sensor types aforementioned, radars have two advantages. First, radars can collect movement signals without contact, which is convenient to the users. Second, a single radar sensor can detect more types of activities than an IoT device. Hence, the number of sensors and the complexity of the devices setup are less when compared to use cases with IoT devices. Although these studies have proved the feasibility of radars in detecting activities, only a few have attempted predicting frailty levels [16]- [19]. These studies will be discussed separately as follows.
Most of the studies using radars in frailty assessment focused on detecting walking and chair rising up & down activities. For example, Saho et al. [16] used one Doppler radar to monitor the movement of walking and chair rising up & down. Kinematic features were extracted from radar signals. The logistic regression model was then used to classify apathetic and non-apathetic status, which was one type of mental frailty. Features extracted from these two activities were input to the model separately and the classification performance was compared. With an f1-score of 0.822, chair rising up & down was proved to be more effective in detecting apathy than walking (0.696 for the f1-score).
Walking patterns were also analysed in the study of Hayashi et al. [17] and Alshamaa et al. [18]. In the study by Hayashi et al. [17], a continuous-wave (CW) Doppler radar was applied to record radar signals of walking from both younger and older adults who were recruited as participants. The long short-term memory (LSTM) model was used to classify the walking patterns of younger and older adults, with the spectrogram of radar signals as input to the model. The model achieved an accuracy of 94.9%. In the study by Alshamaa et al. [18], three walking patterns namely slow, usual, and fast walking were analysed. In each pattern, the walking speed was predicted, and different phases (walking towards the sensor, away from the sensor and turning) were automatically segmented using the statistical method proposed by researchers.
Other studies related to frailty were focused on detecting indoor physical activities. For example, Li et al. [19] applied a frequency-modulated continuous-wave (FMCW) radar to classify six activities: walking, sitting down, standing up, picking up an object from the floor, drinking water and falling. Three wearable devices attached to the wrist, waist, and ankle were used as well. The bidirectional-LSTM (Bi-LSTM) model was applied and compared with the support vector machine (SVM) model. The result verified that, while only using a radar signal, the Bi-LSTM model outperformed the SVM model, especially for picking up an object (86.2% vs. 66.0% for sensitivity) and drinking water (85.3% vs. 74.0%). After fusion with the wearable device on the wrist, the sensitivity of all the activities were above 91.0%.
It is worth noting that there are several limitations in the current studies: • In the studies that predicted frailty level, frailty status was normally grouped into three levels, and many of the studies were focused on predicting the pre-frailty level; however, the level of frailty severity has not been studied yet, which is related to the change of healthcare services. • Current studies also focused mostly on monitoring limited types of activities, usually walking, sitting and standing. Thus, they do not represent the physical functional status thoroughly.
2 VOLUME , • To the best of the authors' knowledge, very few studies [16]- [19] have attempted to automatically segment activity classes in the sequence data.

2) Activity recognition and Segmentation
Activity segmentation is based on the result of activity recognition. With Transfer Learning [20], deep neural network (DNN) models designed for images or audio classification has outperformed traditional machine learning techniques in radar-based activity recognition. The detected activities varies from hand gestures with an FMCW radar [21], [22] or an impulse radar [23] to ADL with an FMCW radar [24] or a CW radar [25]. Compared to traditional machine learning techniques, DNN models have the following two advantages: firstly, instead of using handcrafted features from the spectrogram [26], [27], DNN models directly use spectrograms as input features, which simplifies the data pre-processing step; secondly, handcrafted features usually only include lowlevel statistical features, e.g. mean and standard deviation, and finding representative features is highly dependent on the researchers' experience and domain knowledge [28]. Several studies [24], [25], [29] have verified the potential of DNN models in activity recognition. Among DNN models, the convolutional neural network (CNN) is beneficial in feature extraction with considering local dependency, which means that the extracted features at a position are related to adjacent positions [30]. On the other hand, the recurrent neural network (RNN) can extract sequential features. Two common types of RNN units are gated recurrent units (GRU) and LSTM. Li et al. [19] applied a Bi-LSTM model to classify six physical activities, and its performance was better than the SVM model. The studies by Du et al. [31] and Chun et al. [29] have also verified that the GRU unit outperforms the LSTM unit in faster training speed while achieving similar results. In addition, the convolutional recurrent neural network (CRNN) has also been applied to extract both the local dependency and sequence features. Previously one study [21] has combined 3D CNN and LSTM units for hand gesture recognition and one other study [29] has combined CNN and GRU for recognising physical activities such as boxing, kicking, walking, jumping, running and standing.
In this study, both CNN and CRNN models will be developed to classify more types of activities and also the classification result will be used to segment the activities in the sequence data offline.

C. CONTRIBUTION OF THE STUDY
To sum up, the contributions of this study are: • Inclusion of more types of activities. This study investigates not only basic physical activities, like walking and sitting but also daily living activities, like wearing a jacket and picking up a coin. The applied frailty scale is mPPT [10], which includes more types of activities and classifies the frailty status into more levels. The detailed information of the mPPT is introduced in Section III.
• Investigation of DNN models in recognising mPPT items. Instead of using traditional machine learning algorithms, two types of DNN models (the CNN and the CRNN models) are compared in recognising mPPT items, using the collected radar signals. • Application of the transfer learning method to overcome the subject-difference impact. This study used the transfer learning method to increase the classification performance in the leave-one-participant-out (LOPO) mode, which is especially effective with a small dataset. • Classification and segmentation of activities. Based on the segmentation result, this study also predicts the duration of each mPPT item, which is also the major evaluation criterion of this test. The remaining of this article is organised as follows: Section II discusses the architecture and theory of the radar. The data collection, segmentation and activity recognition are introduced in Section III. Further, Sections IV and V present the result and discussion, followed by the conclusion in Section VI.

II. RADAR SENSOR A. RADAR ARCHITECTURE
The architecture of the radar sensor employed for this study is shown in Fig. 1. This is a simple CW radar with a transmit path radiating CW signals at 5.8 GHz through the TX antenna. The radiated signal bounces off surfaces, and a portion of the reflected signal will be picked by the receiver path of the radar sensor through the receiver (RX) antenna. By analysing the frequency shift of the received signals, the target's speed or displacement about a fixed position relative to the radar sensor can be estimated. Thus, the received signal of different moving body parts of a human subject will have their own signatures, which can be used in activity recognition.

VOLUME ,
The architecture of Fig. 1 was designed and fabricated at the ESAT-WAVECORE laboratory of KU Leuven. The capabilities of this contactless CW radar sensor in use cases such as fall detection, indoor localisation and vital sign detection has been investigated by [32]. Further application areas such as in physical activity analyses are presented in this study.

B. RADAR THEORY
The transmitted signal of a CW radar is given as [33] where φ(t) is the phase noise of the oscillator, which results in random fluctuations in the signal's phase. The transmitted signal can be reflected off both stationary objects and targets with time-varying displacement x(t); thus, the distance between the transmitter and the reflections can be expressed as The reflected signal received by the receiver antenna of the radar will be a time-delayed version of the transmitted signal, given as [33] where the time delay is Substituting t d of (3) into R(t) of (2) and assuming that x(t) << d o , the received signal is better expressed as (4).
Here, c is the speed of light, and λ is the wavelength. θ o is the sum of the phase difference between the mixer and the antenna and the phase shift at the target surface.
The received signal R(t) is down-converted in a quadrature mixer to obtain the baseband signal before sampling. Quadrature mixing happens in two steps. First using a copy of the transmitted signal from the local oscillator, the quadrature mixer 'generates' two signals with phases that are π/2 different from each other. These signals can be called LI(t) and LQ(t) and are given as [33] The signals are then mixed with the received signal to produce two quadrature baseband signals (otherwise known as in-phase and quadrature signals) that are also π/2 different from each other. The baseband signals are given as [33].
The ADC of the micro-controller on board the radar sensor as shown in Fig. 1 is then used to sample the I and Q baseband signals on separate channels. The sampling rate is 250 samples per second for each channel and at 10 bits resolution. Further, the UART protocol is used to transfer the data from the sensor board to the Raspberry Pi microcomputer. Prior to data transfer, byte mapping is done such that four 'do not care' bits are added to the 20 bits from the two channels of the ADC, and the resulting 24 bits are mapped into three bytes. The byte mapping needs to be carefully done by strategically positioning the 'do not care' bits to avoid the need of transmitting bytes of 'all zeros' (i.e. eight consecutive zeros). Transmitting more than eight consecutive zeros will violate the requirements of the UART data transfer protocol.
To prepare the acquired baseband data for analysis, the 'do not care' bits are removed in the received UART packets, and demodulation is performed. The demodulation technique implemented here is the complex signal demodulation proposed by Li and Lin [34]. The technique is not susceptible to DC offset sensitivity, as it is a problem in arc-tangent demodulation. To remove DC offset effects, the technique implements removing an average signal from the defined sliding window of the time domain signal. A complex signal, S, is generated in the demodulation technique from These complex signals are then used to generate spectrograms as discussed in Data Analysis of subsection III-B.

A. DATA COLLECTION 1) MPPT Test
As aforementioned in Section I-C, activities included in the existing physical tests will be detected. Among the available physical tests, the mPPT test [10] is selected for this study. This test comprises nine activity items: standing balance for three times with feet in the position of side-by-side, semi tandem, or full tandem, separately, chair rising up & down, lifting a book, putting on and removing a jacket, picking up a coin, turning 360 • , walking 15.24m, climbing one flight of stairs( = 10 steps), climbing stairs (maximum for two times of up and down). The duration i.e., time taken to perform each activity is the evaluation standard of most of the mPPT items, except turning 360 • and climbing stairs. Turning 360 • is evaluated based on the continuity and steadiness of the step, and climbing stairs is evaluated by the number of finished steps. Also, the parameters of turning 360 • and climbing stairs cannot be completely monitored only using the radar. Thus, this study focuses on segmenting items in the sequence data and predicting the duration of each item. The frailty levels are segmented into none frail, mild frail, moderate frail and unlikely to function in the community. The mPPT test has the following advantages: Firstly, it includes activities making use of both the upper body and lower body movements, and these items are daily activities familiar to older adults. Secondly, the whole test's duration is short and can be completed in less than 30 minutes [10]. Thirdly, the scoring system of the activities can be easily quantified. Lastly, it devises frailty levels in more detail than other scales.

2) MPPT performance protocol
The performance protocol of each item in mPPT is outlined as follows, and participants were required to perform the items in the following the order: 1) Standing balance: The participant was required to stand on his/her feet for a maximum of ten seconds with three styles of foot positions: side-by-side, semitandem and full tandem. During the performance, the participant was required to keep still and avoid making any movement. 2) Chair rising up & down: The participant was required to stand up and sit down repeatedly five times with his/her arms across the chest. 3) Lifting a book: The participant was required to lift a 2.5 kg weighted book from the waist height to a shelf 30 cm above the shoulder level while sitting. Before starting, the participant sat on the stool, and his/her hands should be placed at the body side. 4) Putting on and removing a jacket (wearing a jacket): The same jacket was used during the experiment. After the start, the participant was required to put on the jacket and then remove the jacket completely. 5) Picking up a coin: One coin was placed 30 cm in front of the participant's foot on the dominant-hand side.
The participant was required to use the dominant hand to pick up the coin from the floor and stand upright afterwards. 6) Turning 360 • : The participant was required to turn 360 • . The duration was not recorded in the original item description, but the steadiness and ability to produce continuous turning movement by the person being evaluated was subjectively graded. 7) Walking 15.24 m: The participant was asked to walk 7.62 m straight down a path and then walk back to the starting point on the same straight path. 8) Climbing one flight of stairs (ten steps): The participant was asked to climb up ten steps. 9) Climbing stairs: The participant was required to climb flights for maximum of four times. The score was graded based on the number of completed flights.
To mark the activities, waving hands was included as the flag activity. Before the start and after the end of every item, the participant was asked to wave their hands three times towards the radar. Hence, ten activities are recognised and these activities were performed one after the other in the same order without breaks.

3) Experimental Setup
Illustrations of the experimental setup are given in Fig. 2. Fig.  2a and 2b depict the setup in a room for performing items 1 -7 and Fig. 2c illustrates the setup on a staircase for items 8 -9. In the room, a stool with height of 0.59 m was placed 1 m in front of the radar sensor. The sensor itself was positioned on a stand of one meter above ground level. One bookshelf was placed 0.5 m to the right side of the chair, and where a book of 2.5 kg of weight was placed, and a jacket was provided. In the setup on the staircase, the radar was placed midway, which enables the radar to detect all movements. Items performed in each environment are measured separately, and data collected in the two environments are finally concatenated, following the order of items of the protocols.

4) Recruitment of participants
Eight volunteers, five males and three females, participated in the experiment. The participants were aged 24 to 33 years old, with their heights ranging from 160 to 183 cm. During the experiment, they were asked to imitate the walking styles of older adults by lowering their walking speed. To obtain an extensive dataset and ensure the reliability of mPPT results, every participant repeated the test set three times. During the experiment, two researchers were present (outside the radar's field of view), one for recording activities and the other for giving the instructions. The experiment was approved by the research ethics committee UZ/ KU LEUVEN (EC RESEARCH), with the assigned serial number S62736. All participants voluntarily participated in the experiment by signing the informed consent form. Fig. 3 illustrates the procedure in the mPPT items segmentation. With segmentation, the duration of every mPPT item is predicted. The procedure is split into two stages. In the first stage, the radar signal will be sliced with a fixed-size sliding window, and activities will be recognised at the window level. In the second stage, windows are grouped to segment mPPT items in the sequence stream based on the prediction result in the first stage, from which the duration of each item is calculated.

1) MPPT items Recognition on the window level
The collected radar signals are first sliced based on the annotation record, then demodulated and used to generate the dataset for training classifiers. The detailed information of each step can be explained as follows:

a: Windowing
After demodulation introduced in Section II, the radar signals were sliced by a sliding window. Among the activities, the shortest duration was 2 s for waving hands. Thus, the segment window sizes of 0.5 s, 1 s and 2 s were tested. Besides, the amount of overlapping size was tested for values 50%, 75% and 95%. The detailed information can be found in [35]. Initial experiments indicated that the 2-second window with 95% overlapping performed best in terms of the recognition performance. VOLUME , Spectrograms are usually derived from the radar signal as they can provide both time and frequency domain information. Furthermore, spectrograms generated from radar signals have already been verified to be feasible in activity recognition [24], [25], [36], [37]. The Short-time Fourier transform (STFT) is used for transforming the signal to the frequency domain. Within the 2s window, the STFT is operated in a Hamming window at the length of 128 data points, and the window shifted 1 data point each time. A sample output spectrogram of the process is shown in Fig. 4a where it can be seen that the energy is mostly larger than -50dB in the whole frequency range. This high energy content is due to harmonics and multiple reflections. To filter out the high energy contents, -50dB is used as the energy threshold value. The filtering result is shown in Fig. 4b.
As observed in Fig. 4a and 4b, the energy of the signal around 0 Hz is relatively high (in dark red). This energy indicates the presence of reflections due to stationary objects in the background in the environment. As these objects are stationary, their reflected signal contributes to the energy level of around 0 Hz. Therefore, this signal masks the signal related to the low-frequency activities, which is evident in Fig. 4b, where the DC energy is masking the information of the signal of the participant's low-frequency movement. To remove this DC level energy, the mean value of the complex signal (S) is subtracted from (9), with example spectrogram of each mPPT item shown in Fig 5. c: Model Training As previously stated, two DNN architectures are investigated, which are the CNN model and the CNN-GRU (CRNN) hybrid model, architectures of which are shown in Fig. 6.
All models are trained and tested in the LOPO mode, which means that at each iteration, the dataset of one participant is taken as the test dataset, the dataset of another participant is taken as the validation dataset, and the dataset of the other six participants is for training. The models are trained on the Google Cloud Platform [38], with the NVIDIA TESLA P100 GPU (with a 16 GB memory). The training batch size is 128. During model training, two steps: data balancing and hyper-parameter tuning, are performed.
Data balancing The data balancing technique is used on the imbalanced dataset. Considering the computational cost, the spectrogram dataset is balanced by randomly oversampling the smaller dataset and undersampling the larger dataset and finally achieving the same number (1500) of windows for each class.
Hyper-parameter tuning The CNN model's architecture is the same as the model in study [24], with the same kernel size and the number of filters of the convolutional and maxpooling layers. As illustrated in Fig. 6, in the CRNN model, one GRU layer is added after the last max-pooling layer. The tuned parameters and the searching range were: 1) The number of neurons of the first dense layer The Tree-structured Parzen Estimator Approach (TPE) [39] is used for hyper-parameter tuning of the classifiers due to its high efficiency. The optimal combination of the hyperparameter values is tuned for 60 iterations.
The evaluation metric is the f1-score for the reason that the dataset is imbalanced. To avoid over-fitting, the early stopping method is used with the macro averaged f1-score of the validation data as the evaluation metric. When the f1-score has not increased over 0.005 after five epochs, the model training is stopped.
The models are trained in two scenarios: 1) The model is trained using the spectrogram matrix in LOPO. 2) Transfer learning is used on the best performance classifier in Scenario 1. In transfer learning, the dataset of one manually segmented mPPT item of the testing participant is included as the training set, which can be seen as a system calibration. Datasets of the remaining two tests are used for testing.

2) MPPT Items Segmentation
The predicted label of the windows will be processed to calculate the duration of each mPPT item via counting the number of windows each item covered.

a: Processing of Windows Prediction Result
Considering that a performed activity can not change abruptly, this step is mainly used to filter out prediction errors via majority voting. As illustrated in Fig. 7, a sliding window is shifted through the sequence of predicted labels of the windows. After trials, the window size is set up as 2.8 s (nine predicted windows), which is the median duration of waving hands. In one window for majority voting, the number of occurrences of each predicted label is counted. The label of the first predicted window in the window is reassigned to be the same as the label with the highest occurrence number. If more than one label has the same highest occurrence number, the label is the same as the one appearing first.

b: Segmentation by Flag Activity
A flag activity is included in each measurement to mark the start of the items in the mPPT. As this is the initial study and other unrelated activities are not considered, the item sequence is followed in the order of flag activity, item 1, flag activity, item 2, flag activity,..., item 9, flag activity. Waving hands is used to segment mPPT items. Between each two adjacent waving hands event, there is only one mPPT item. The class of the item is decided using majority voting as well.
For instance, as illustrated in Fig. 7, the turning 360 • was partly predicted as wearing a jacket. Hence, after segmented by waving hands, that interval includes two classes; however, after the second time of majority voting, it is only identified as turning 360 • . VOLUME ,

c: Calculation of Duration
After segmentation, the duration of each item is predicted via transforming the number of windows. The result will be shown in Section IV.

A. MPPT GROUND-TRUTH RESULT
During items prediction, climbing stairs is separated into two classes: going upstairs and going downstairs. In addition, climbing one flight is also labelled as going upstairs. As a result, ten classes of activities are investigated in total, including waving hands. Table 1 lists the duration of each activity and the number of sliced windows.

B. RECOGNITION OF MPPT ITEMS
The architectures of the DNN classifiers are shown in Fig. 6. The f1-scores of each classification model are illustrated in Table 2  hands while the standard deviation (std) of the f1-score of six activities (standing balance, chair rising up & down, wearing a jacket, turning 360 • , walking and going upstairs) are also decreased.
In the second scenario where transfer learning was applied on the CRNN hybrid model by including part of the testing participant's as training dataset, the recognition result of each activity is significantly increased. Especially for lifting a book, turning 360 • , picking up a coin, going upstairs and waving hands, the average f1-score increased over 0.1 compared to those of the CRNN hybrid model where transfer learning was not applied. The detailed results of each activity  [24]. The input size is 128 × 373 × 1, with 128 representing the number of frequency levels, 373 representing the number of timestamps, and 1 representing one channel as the input is the spectrogram matrix. The kernel size of the convolutional unit is (5,5), with the stride step of 1. The numbers of filters of Conv1, Conv2, and Conv3 are 5, 4, and 2, respectively. Each convolutional layer also includes zero-padding. The kernel size of the max-pooling unit is (2,2), with the stride step of 2.  using the transfer learning method are illustrated in Fig. 8 and Table 3.
Considering the result of each activity, lifting a book has the lowest f1-score of 0.3445, which implies that the radar is less sensitive to movements orthogonal to its signal transmission direction. Going downstairs and going upstairs are also confused with each other, which is due to the placement of the radar midway the staircase. It could also be due to the similarity in the radar signals of climbing up or down the stairs which happens in opposite directions. A limitation with the radar device is the inability to provide information about the direction or angle of arrival but only the distance and speed. This could explain why going up-/down-the stairs were misclassified with walking. An implementable solution for future studies will be the use of MIMO radars and placement of the radar on the ceiling for a better 3D view.  Table 4 displays the absolute errors of the prediction of duration of each activity of every participant. For each participant, two mPPT test sequences are evaluated. Hence, the result shown in Table 4 contains the average errors of two sequences. Among the activities, standing balance has the smallest averaged error at 0.78 s. On the contrary, climbing stairs has the largest error of 2.78 s. The errors could be due to the recognition performance. Fig. 7 illustrates the It shows that a part of the turning 360 • is misclassified as wearing a jacket.
Comparing among participants, participants 3 and 6 (P3 and P6) achieved the most significant average error. This results from the low activity recognition result, especially for waving hands. At each time the participants waved their hands, they performed with different movement amplitude and speed. The sample spectrograms are shown in Fig.9. In addition, waving hands had a shorter duration; thus, the number of training examples were limited.

A. DEVICES SETUP
The results of recognition and duration prediction verified that radars could be used for monitoring mPPT items; however, there are still limitations to the use of the current device setup. In this study, one radar device setup was used for detecting all the indoor activities. In addition, as aforementioned, the radar sensor is more sensitive to movements orthogonal to its signals transmission direction. Therefore, the activity recognition performance is affected by the relative movement direction. For example, when lifting a book, the bookshelf was placed by the side of the radar sensor during the experiment, as presented in Fig. 2b, hence the movement was orthogonal to the radar signal transmission direction. Therefore, recognition result of lifting a book is poorer compared to other activities. Additionally, the radar   has limited range and angle of detection. Hence, participants were required to perform the activities at specified locations only. To solve these problems in a future implementation, the radar can be improved to have a wider field of view or include a multiple-input and multiple-output (MIMO) variant that enables spatial filtering. Another method is to include additional radars to be placed at more strategic locations such as next to the participant with the radar beam pointing to the bookshelf.
Going upstairs and going downstairs are classified with each other as shown in the confusion matrix of Fig. 8. This could be related to the hardware limitation of detectable distance, with the radar placed halfway along the stairs. Hence, signals collected regarding these two activities are similar to each other, which is walking towards and then away from the sensor. This can also be solved by using multiple radar devices or improving the detectable distance of the radar device.

B. PARTICIPANTS
As this is a preliminary study, the data were collected from younger adults to test the feasibility of the system in a laboratory setting. Though they tried to mimic older adults and imitate their cautious gaits, some typical movements by older adults could not be perfectly replicated. An example is that an older adult's body will likely sway, or their hands would try to find support to keep balance during the standing balance test; however, due to the better body function of the younger participants, they remained stationary during the test without making any movement. Hence, the efficacy of the system to classify and segment mPPT items in real applications will be investigated in future work, with performing a real-life ex-periment and recruiting older adults as participants. Finally, the small number of participants and the slight variance of their frailty status impacted the classifiers' training. Future studies will address these drawbacks by obtaining a larger dataset via recruiting older adults with different frailty levels as participants. A software application with the instruction videos will be considered to be developed to help older adults learn how to perform the items.

1) Compared with other studies
To the best of the authors' knowledge, there are no studies yet making the use of radars to recognise or segment mPPT items. Hence, only the recognition results are compared with previous studies, which included some of the activities, as listed in Table 5. All the studies trained the model in userindependent mode, either LOPO or using only part of participant's data for training. Limited by the provided information in these papers, the sensitivity and f1-score are both provided.
Compared to other studies, the performance of the proposed method is not the best in all activities. The reason could be that: 1) the number of participants was much less than those in other studies. Therefore, the size of the dataset was smaller, which would affect the training of the DNN models; 2) our study included ten activities, which increased the complexity of the machine learning model; 3) some activities included multiple postures, e.g., chair rising up & down can be split into standing up and sitting down. Hence, the classification performance of one activity could be affected by the movement complexity of the activity. In other words, how many postures are accomplished. This study still has two advantages: 1) more types of activities were included, especially some complex activities, like wearing a jacket; 2) activities of this study are not only recognised but also segmented automatically in a continuous sequence data.

2) Comparison between models
With the addition of the RNN layer to formulate the CRNN model, the classification performances of eight activities improved. This is clearly observed that f1-scores of wearing a jacket, picking up a coin, and turning 360 • were increased by more than 0.15, which is an implication of the effectiveness of the sequential information. This is also because these activities have similar movement patterns among participants.
With the introduction of transfer learning, the classification performance is significantly improved. The reason could be that the trained model is not robust enough to cover all the participants' variability, especially participants in different frailty levels. Furthermore, in actual applications or deployment, a one-time measurement setup of the mPPT will also be sufficient to train the classifier for new users. This initial test can be conducted when new users are getting familiar with the system and acquiring the knowledge of performing the test on themselves. The training process can be implemented with a user interface showing instructional videos.

3) Comparison among activities
Among the ten activities, standing balance achieves the best performance with the average f1-score of 0.9509, followed by chair rising up & down (0.8510), and walking 15.24 m (0.8406). This could be because the movements of these activities have more distinguishable patterns. As shown in Fig. 5, chair rising up & down is a periodic activity which is similar to walking 15.24 m. In addition, these two activities only include simple postures, which are getting up and sitting down for chair rising up & down and walking for walking 15.24 m. Moreover, waving hands also achieves a high f1score of 0.8336. This is because participants waved their hands close to the radar sensor (5cm away from the radar sensor). Hence, the collected signals have distinct patterns as well. In contrast, the results of lifting a book (with the f1score of 0.3345) and going downstairs (with the f1-score of 0.4628) are the lowest ones. This is suspected to be due to the setup of devices, as discussed in Section V-A.

4) Comparison among participants
Within the group of eight participants, participants 4 (P4) and 8 (P8) obtained the lowest average result, as shown in Fig. 3. This mainly resulted from low results of lifting a book, going upstairs, or turning 360 • . The low scores are mainly because these two participants performed these activities much slower than average or significantly different from others. For example, while participant 8 went upstairs, (s)he stopped halfway for a while to mimic the situation when older adults may feel tired and then continued going upstairs. Thus, the spectrogram of that period was not similar to the spectrogram of going upstairs of other participants. This results a small number of windows with the special situations for the classifiers to learn. It could be interesting for future research to recruit participants at different frailty levels and conduct an experiment in more realistic environments to help improve the recognition performance of classifiers.

D. SEGMENTATION
In the presented study, although the participants were asked to perform the test items following the fixed order, the order was not considered during the segmentation of the mPPT items. Thus, in future applications, users can perform the items in their preferred order without restrictions. As shown in the result, the prediction error for most activities is within 1 s; however, the result is based on the high recognition performance of the flag activity, namely waving hands. In a future study, the aim will be to increase the recognition performance of some mPPT items, like lifting a book and going up-/ down-stairs. Therefore, the participants do not need to wave hands during the mPPT test. Extra attention will be paid to monitor the movements of the arm/hand. This can be achieved by adding an extra radar to the current setup and placing it at 90 degrees to the current radar. The assistance of wearable sensors can also be explored.

E. ASSESSMENT OF MPPT ITEMS
The reported study still has limitations in detecting the different foot positions of standing balance and extracting the evaluation metrics of turning 360 • and climbing stairs. For standing balance, extra sensors like pressure mats can be added for the differentiation of the three foot positions of standing balance (side-by-side, semi-tandem, full-tandem). This mat can also be used for evaluating the steadiness and continuity of the steps while turning 360 • . For climbing stairs, the situation when participants feel exhausted and stop halfway is to be taken into account. The proposed system has directly followed the principles of frailty testing in mPPT, from detecting the activity to calculating the score and finally computing the frailty level. To improve the system, future work is aimed at analysing the typical forms of these activities by directly mapping them to frailty levels. In such case, the intermediate steps of mPPT items recognition and predicting the duration of mPPT items can be skipped and features directly related to frailty will be investigated.

VI. CONCLUSION
This study was aimed at developing a system that automatically segments physical performance test items. To accomplish this, a CW radar device setup was used to receive the movement signal of each mPPT item, and a DNN model was applied to recognise and segment items of the mPPT test.
The results verify the feasibility of using the radar device to collect movement signals and the CRNN hybrid model to be a suitable model for segmenting mPPT items. Using the CRNN hybrid model with majority voting method, the order of performing mPPT items is not necessary while segmenting the sequence data. This study also proves the effectiveness of the transfer learning method to increase the robustness of the classification applied to unknown users; however, some limitations such as the low classification performance of some mPPT items, like lifting a book, the requirement of performing the flag activity, and having not automatically accomplished all evaluation criteria, still need to be further investigated. To overcome some of these limitations, further studies will focus on the radar sensors and participants.