Metrics for Performance Evaluation of Patient Exercises during Physical Therapy

Functional recovery from neuromotor disabilities, various surgical procedures, or musculoskeletal trauma is strongly dependent on patient participation in a physical therapy program. While a large portion of all therapy exercises is performed by patients in a homebased setting, the lack of supervision and motivation for continued involvement in the therapy program in outpatient environment conduce low adherence to prescribed treatment regimens [1]. The presented work in this article was motivated by our belief that the latest progress in machine learning furnishes a potential to be harnessed for analysis and monitoring of patient progress toward recovery during in home physical rehabilitation, and accordingly, can greatly benefit both patients and healthcare providers.


Introduction
Functional recovery from neuromotor disabilities, various surgical procedures, or musculoskeletal trauma is strongly dependent on patient participation in a physical therapy program. While a large portion of all therapy exercises is performed by patients in a homebased setting, the lack of supervision and motivation for continued involvement in the therapy program in outpatient environment conduce low adherence to prescribed treatment regimens [1]. The presented work in this article was motivated by our belief that the latest progress in machine learning furnishes a potential to be harnessed for analysis and monitoring of patient progress toward recovery during in home physical rehabilitation, and accordingly, can greatly benefit both patients and healthcare providers.
The recent rapid advancements in artificial intelligence (AI), driven predominantly by its sub field machine learning, have been reflected by ubiquitous deployment across a wide spectrum of application domains, ranging from miscellaneous image-, text-, and voice-processing apps in smart phones and computers to autonomous cars and personalized recommender systems.
It is expected that as the field further evolves in the years to come, AI-enabled systems will have even more pronounced and transformative impact on society as a whole and on all aspects of our lives as individuals.
In the medical field, the number of machine learning applications has proliferated recently due to the demonstrated capacity for discovering complex patterns by analysing large numbers of electronic medical records. Not surprisingly, the most notable medical AI success has been in the domain of medical image processing. For example, the medical team at Deep Mind have applied deep artificial neural networks (ANNs) for analysis of digital scans of the eye in diagnosis of age-related macular degeneration and diabetic retinopathy [2], and for analysis of radiotherapy scans for detection of oral and neck cancer [3]. Other exemplary AI applications include image processing of skin lesions in screening and detection of melanoma cancer [4], and image processing of scans for detection of invasive brain cancer cells [5]. Machine learning approaches have also been implemented in a variety of other biomedical research problems [6], such as analysis of genomics sequences [7], drug discovery and repurposing [8], and robotic healthcare assistants [9].
The benefits of applying machine learning algorithms to medical data analytics are numerous, and encompass customized and personalized diagnosis and treatment, faster screening and early detection of conditions, which can potentially lead to improved healthcare quality and patient satisfaction, reduced healthcare costs, reduced need for hospital stay, and similar.
As more archived traditional medical records are transferred to digital form, and as the personal wearable devices and mobile apps unobtrusively collect massive amounts of information about our bodily functions and activities, more training data will become available, which will improve the outcomes of the machine learning algorithms and leverage the extraction of subtle health related and behavioural patterns. For instance, one creative solution employing images taken from a regular cell phone camera is the mobile app AiCure [10], which uses AI-supported image processing for monitoring users' habits in taking prescription medications, with an objective to increase the adherence rates, as well as to update the respective physician on patient habits related to taking the prescribed medications. evaluation in the published literature, to the best of our knowledge only the work by Komatireddy et al. [12] has partially addressed this topic. The authors proposed a quantitative metric, related to the number of correctly performed repetitions of an exercise, and a qualitative metric, related to ratio of optimal vs. sub-optimal repetitions of the exercise. The study does not provide a clear explanation of which discriminative approach was applied for distinguishing between optimal and suboptimal repetitions.
This article reviews metrics that have been used, or that can be potentially used, for evaluation of patient therapy motions. Motivated by the work in Komatireddy et al. [12], we employ a taxonomy that classifies the metrics as quantitative and qualitative. Further, quantitative metrics are categorized into model-less and model-based metrics. Model-less metrics perform the assessment based on the raw time series of the motions as acquired by a sensory system. Metrics is this category are: root-mean square distance, and norm of jerk. Model-based metrics calculate the consistency of patient exercises in comparison to a mathematical model of the motion as prescribed by a PT. Metrics in this category include: log-likelihood, Kullback Leibler divergence, heuristic consistency, and prediction intervals. Other related metrics not explored in this work are the Hellinger distance and the Bhattacharyya distance. While the quantitative metrics evaluate the motions at a low level of abstraction, i.e., at a level of individual measurement points in a sequence, the qualitative metrics evaluate the motions at a high level of abstraction, i.e., at a motion sequence level. Metrics in this category involve: number of optimal attempts, Fugl-Meyer Assessment, and Wolf Motor Function Test.
The article is organized as follow. The next section introduces the used mathematical notation for the human motions. Afterwards the metrics for patient performance evaluation are described. The reviewed metrics are next compared for evaluation of five human motions. The last section summarizes the presented study.

Notation
In a physical rehabilitation setting, a PT will prescribe a collection of desired therapy motions to a patient, by either performing the motions in front of the patient, or by physically moving the body parts of the patient along the required paths. It is assumed here that the PT will provide several demonstrations of each motion in order to reinforce the perception of the motion by the patient, which may be related to required range, speed of movement, and other respective constraints in the execution of the motion. Analogously, let's assume that the patient is attempting to perform the prescribed motion O in a home-based rehabilitation program in front of a sensory system for motion capturing. The patient is presumably asked to repeat the motion a predefined number of times at a predefined time period (e.g., 10 times daily). The measured motion examples performed Likewise, application of machine learning algorithms for monitoring and evaluation of patient compliance with a prescribed physical therapy program can improve the adherence rates, reduce the required time for functional recovery, and consequently, reduce treatment cost. The development of such systems requires hardware components, i.e., a dedicated computer for data processing, and a sensory system for capturing patient exercises during rehabilitation sessions. Among the different sensory systems for motion capturing, the vision-range sensors of the type of Microsoft Kinect are currently an excellent option for the task at hand, considering their affordability (price in the range of $150), reliability for different research and industrial applications, and availability of open source libraries for program development with a broad range of capabilities. Two such existing systems KiRes (Kinect Rehabilitation System) [11] and VERA (Virtual Exercise Rehabilitation Assistant) [12] utilize the motion capturing feature of Kinect to present an avatar on a computer display that reproduces patient motions in real time, and simultaneously displays the desired motions. The visualization of the performance provides an instantaneous feedback to the patient, helps in recognizing any needs for correcting the exercises, as well as motivates the patient to comply with the prescribed treatment. A comprehensive review of the technical and clinical merits of the application of Microsoft Kinect for motion capturing of patient exercises in physical rehabilitation is presented by Hondori and Khademi [13].
Equally important to the requirement for adequate hardware components is the development of a methodology for computer-driven analysis of patient therapy efforts, related to evaluating the consistency of the performance with the PT-prescribed exercises, the day-to-day patient progress, and the level of compliance with the prescribed treatment plan. Such methodology is predicated upon the provision of: (i) efficient mathematical models for representation of bodily movements undertaken during physical therapy exercises [14], and (ii) efficient metrics for quantifying the patient executed motions and collating the performance to the prescribed motions by the PT.
The objective of this article is to present a survey of the current literature in reference to the metrics for evaluation of patient performance in physical therapy. The existing practice for evaluation of physical rehabilitation has exclusively relied on assessment by a PT. For instance, a common test for evaluation of motor recovery after stroke is Fugl-Meyer Assessment [15], where a PT evaluates a patient's performance on a set of pre-defined movements and assigns a numerical score on a scale of 0 to 2 for each of the movements. Related tests for evaluation of the level of recovery after stroke include the Motor Assessment Scale [16] and the motricity index [17]. Another test for assessing the ability of upper motor movements is the Wolf Motor Function Test [18], which is a timed test consisting of several functional tasks, scored on a scale of 0 to 5. These and several other tests for assessment of patient performance and the corresponding level of functional recovery that are currently performed by a trained PT are suitable candidates for automation, since they rely on a set of standard pre-defined movements. Accordingly, drawbacks of this type of assessment include: it is time consuming, and it produces subjective scores where different PTs can provide different assessment scores due to human inability to accurately measure and quantify body trajectories. Automated performance evaluation can overcome these limitations by providing more accurate and quantified assessment, also can be involved in daily monitoring of the therapy sessions, and can provide instantaneous corrective feedback and send the performance data to the respective PT on a daily basis.
With regards to the proposed metrics for automated performance The metrics for performance evaluation are to describe in a quantitative or a qualitative manner, or both, the consistency of the patient performed examples of the motion R with the PT prescribed examples of the motion O. Due to musculoskeletal constraints, pain, or other conditions, the patient may not be able to correctly perform the motion at the beginning of the therapy program, which may, or may not, improve as the therapy program progresses.

Metrics
The reviewed metrics for performance evaluation are classified in this work into two main categories: quantitative and qualitative metrics. Accordingly, quantitative metrics assign a numerical score for the consistency of the patient performance, whereas qualitative metrics assign either a non-numerical evaluation (e.g., correct versus incorrect performance) or a discrete numerical score from a finite and limited range of values or states.

Quantitative metrics
Quantitative metrics can be also referred to as low-level metrics, since they evaluate the consistency of each measurement with regards to the prescribed sequence of measurements, or with respect to a model of the motion in the form of a probability distribution. The quantitative metrics are further classified into model-less and model-based metrics.

Model-less metrics
The model-less metrics compare the motions captured during a physical therapy exercise by a patient, with the motions captured when prescribing the therapy exercise by the PT. These metrics compare the measured raw trajectories of the body parts as acquired by the sensory system.
The following metrics are classified in this group: a) Root-mean square (RMS) distance-obtained as a sum of differences between the points of a captured trajectory R n and a set of prescribed trajectories R n and a set of prescribed trajectories One constraint of the RMS distance is the requirement that the trajectories have the same length, i.e., the same number of observations T m . Therefore, the observed trajectories need to be scaled to a same length before the RMS distance is calculated. For the case when the trajectories are linearly scaled to a same length, if there are great spatial differences along their temporal dimension, that will result in a large RMS distance between the trajectories. This limitation is typically mitigated by employing approaches for temporal alignment of the trajectories, such as Dynamic Temporal Warping (DTW) [19].
Another metric that can be derived from the RMS distance for a single motion example R n is the mean of the RMS distances for all motion sequences in the set R { } 1 This metric quantifies the level of smoothness of the movement [20], and high value of jerk can be indicative of shaky patient movements during the physical exercises. In certain rehabilitations exercises and conditions, it is expected that the patients will produce high level of jerks at the beginning of the treatment, which will gradually reduce as the recovery improves. Although this metric evaluates only one aspect of the movements, when combined with other metrics it can provide valuable information regarding the level of progress toward functional recovery.

Model-based metrics
These metrics rely on a model of the prescribed motions and/ or a model of the patient motions. Common methods used for modeling human motions include probabilistic approaches, such as Gaussian mixture models [21] and hidden Markov models [22]. These approaches model the sequences through a set of latent states that describe a statistical distribution of the motion dynamics. Other common approach for modeling human movements is by employing a set of deterministic latent states connected by weights, such as the artificial neural networks [21].
The metrics in this category include: a) Log-likelihood-expresses the probability P that a performed motion example by the patient is drawn from a model of the motions as prescribed by the PT. For a model described with a set of parameters λ, the log-likelihood of a motion example R n is calculated as a natural logarithm of the likelihood for all data points given the model parameters λ [21], that is,  b) Kullback Leibler (KL) divergence-is a measure of the similarity between two probability distributions [23]. One of the distributions is considered to represent the true theoretical distribution of the data, in this case that is the empirical distribution of the prescribed movements by the PT, i.e., P(O). The other distribution represents an approximation of the true distribution, which in this case is the distribution of the executed movements by the patient, i.e., P(R). The KL divergence between P(O) and P(R) is defined as: If the probability distributions of the motions are modelled with a parameter set, the KL divergence can be found by calculating the mean probability of the data points in the motion sequences as This metric is also known as relative entropy, and is a measure of the lost information when the probability distribution P(R) is used to approximate the probability distribution P(O) . Other alternative metrics to the KL divergence that have been used to quantify the difference between two probability distribution and can be as well considered for evaluation of human motion consistency are the Hellinger distance and Bhattacharyya distance. c) Heuristic consistency-is a simple qualitative measure that determines the proportion of patient movements that are contained within the extremums of the demonstrated movements O. The measure is defined as: The indicator function Next, the proportion of estimated means from the captured patient trajectory that is contained within the confidence interval is calculated, and averaged over all captured trajectories to obtain the metric L 6 (R,O). If the captured trajectories are consistent with the demonstrated movements then L 6 (R,O) should have a value of approximately 5%.

Qualitative metrics
Qualitative metrics can be referred to as high level metrics because they evaluate each patient's performed motion example as an individual repetition with respect to the prescribed motion examples, as opposed to evaluating the individual sequential measurements at the trajectory level.
The following metrics have been used for qualitative assessment of therapy exercises in previous works in the literature: a) Number of optimal attempts-is used in the work of Komatireddy et al. [12] to assess patient performance. As stated before, it is not clear what type of approach the authors applied in labeling the motions as either optimal or suboptimal.
On the other hand, it is possible to use any of the quantitative approaches listed above to calculate a numerical score for each repetition of a motion, and then to label it as optimal if the score is greater than a predefined threshold value. b) Fugl-Meyer assessment (FMA)-introduces a series of standardized exercises intended to evaluate the development of motor functions and balance in patients recovering from stroke [15]. The FMA test encompasses five principle domains for assessment: motor function, sensory function, balance, joint range of motion, and joint pain. Each domain involves several assessment steps related to the performance of respective movements. The movements are evaluated by a PT on a scale with 3 grades, with 0 as minimum and 2 as maximum grade. The assessment produces a cumulative numerical score representing the progression toward functional recovery of the stroke patient.
This assessment method can be employed in the development of metrics for automated performance evaluation, by either drawing insights from the PT evaluator's way of scoring the movements, or by training a machine learning algorithm to score in a similar manner by using PT's scores as inputs.
In addition, the FMA test has been reported to be complex and time consuming [16]. Consequently, an automated version of the test based on machine learning methodology could be a valuable contribution to the domain of physical rehabilitation. Another potential advantage of automated assessment is the provision of more precise evaluation than the three grades scale.
Several faster alternative tests to the FMA have been introduced, including the Motor Assessment Scale [16] and the motricity index [17]. These tests have been frequently used in practice, and can also be exploited in the development of an automated performance metric.

c) Wolf motor function test (WMFT)
-is a timed test of functional tasks used to assess the ability of upper motor movements [18]. The test relies on using a number of objects as props, such as a chair, table, weights. The required motions are performed by using the props. The tasks are timed, with each motion given a maximum time of 2 minutes. The performance of each task is scored on a scale from 0 to 5. Summary scores are calculated based on the medians of the timings of the motions, and on the means of the ratings for the functional abilities.
Similar to the observation regarding the FMA test, WMFT is also suitable for automation and can provide understanding into the development of automated performance metrics.

Evaluation Dataset
The proposed metrics were evaluated on the publically available dataset of human motion UTD-MHAD (University of Texas at Dallas -Multimodal Human Action Dataset) [24]. The dataset includes 27 actions, each performed by 8 subjects 4 times. A Kinect sensor and a wearable inertial sensor were used for collecting the data.
The following 5 actions were employed here for evaluation purposes: two hands front clap, right arm throw, draw circle clockwise, draw triangle, and tennis serve. Sample images for the actions are presented in Figure 1.

Evaluation Results
The following metrics were evaluated for the five actions: rot-mean square distance, log-likelihood, KL divergence, heuristic consistency, and prediction intervals. The results are presented in Table 1.
The data for the five actions was divided into 2 sets: a training set consisting of 21 sequences for each action, and a testing dataset consisting of 7 sequences of each action. Both the training and the testing set correspond to actions performed by the same group of subjects. One may note that it is preferred the motions to correspond to therapy exercises, and the testing set to include suboptimal examples of the motions. As part of the future work, we have plans to create a dedicated dataset related to motions performed in physical therapy.
The root-mean square distance was calculated for the recorded trajectories. The motion capture feature of Kinect provides a skeletal data, where the human skeleton (shown in Figure 1) consists of 20 joints. The temporal measurements for each joint are spatial 3-dimensional coordinates. Hence, the data comprises 60 dimensional data sequences. The recorded motion sequences were scaled to a same number of measurements by using the DTW algorithm. The provided results in Table 1 present the mean values for the root-mean square distance for the 7 motion sequences in the testing dataset.
Log-likelihood of the testing data was calculated for several different mathematical models of the training data. The dimensionality of the raw observation data was first reduced from 60 to 3-dimensions, by employing an autoencoder neural network [25]. Afterwards, the 3-dimensional sequences were modeled using a mixture density network [14], Gaussian mixture model by employing expectation maximization, and a hidden Markov model [26]. The mean loglikelihood of the testing dataset is shown in Table 1.
The mean KL divergence of the testing data is also presented in Table  1. Similar to the log-likelihood metric, an autoencoder is employed to reduce the dimensionality of the observed data, and a mixture density network is afterwards used to model the data.
The last two columns in the table present the heuristic consistency and prediction intervals metrics.

Conclusion
The article presents a survey on the current literature on the metrics for evaluation of patient performance in physical therapy. The metrics  are classified into quantitative and qualitative metrics. The quantitative metrics assign a numerical score for the patient performance, and are categorized into model-less and model-based metrics, based on whether a mathematical model of the motions is employed for performance evaluation.
The existing practice in physical therapy predominantly relies on assessment by a physical therapist. The studies related to automated assessment of therapy motions are scarce in the published literature, and consequently little attention has been paid to the development and definition of metrics for performance evaluation. This article reviews some of the reported metrics in the literature. In addition, the article reviews metrics that have been used for evaluation of human motions in other fields. Examples are root-mean square distance and norm of jerk, which have been used in the domain of robotic learning from human demonstrations. Other metrics, such as Kullback Leibler divergence, heuristic consistency, have been used in general for comparison of probability distributions.
The presented metrics in this article can be used for evaluation of human motions in other application domains, or also for assessment of sequential data in other fields, if applicable.