Predicting Upper-Extremity Function Recovery From Kinematics in Stroke Patients Following Goal- Oriented Computer-Based Training

Background: After a stroke, a wide range of deficits can occur with varying onset latencies. As a result, predicting impairment and recovery are enormous challenges in neurorehabilitation. Body function and structure, as well as activities, are assessed using clinical scales. For functional deficits of the upper extremities these include the Fugl-Meyer Assessment for the Upper Extremity (FM-UE), the Chedoke Arm and Hand Activity Inventory (CAHAI) and Barthel Index (BI), administered by clinicians. Although these scales are generally accepted for the evaluation of the motor and functional impairment of the upper-limbs, they are time-consuming, show high inter-rater variability, have low ecological validity, and are vulnerable to biases introduced by compensatory movements and action modifications. For these reasons, alternative methods need to be developed for efficient and objective assessment. Computer-based motion capture and classification tools have the potential to collect and process kinematic data to estimate impairment, function and recovery while overcoming these limitations. Methods: We present a method for estimating clinical scores from movement parameters that are entirely extracted from kinematic data recorded during unsupervised rehabilitation sessions performed with the Rehabilitation Gaming System (RGS). RGS is a rehabilitation technology that uses image-based motion capture, goal-oriented individualised training, virtual reality content delivery, and restricts compensatory trunk movements through feedback. The main protocol considered in this study asks patients to use their upper limbs to intercept spheres that are presented in a 3 dimensional virtual reality display. RGS maps the planar physical arm movements onto matching movements by an avatar presented in a first-person perspective. In this analysis, we performed a multivariate regression using clinical data from 98 stroke patients who completed a total of 191 sessions with RGS. Results: Our multivariate regression model reaches an accuracy of R : 0.38, with an error (σ : 12.8), in predicting FM-UE scores. We analyse our model by assessing reliability (r = 0.89 for test-retest), sensitivity to clinical improvements (95% true positive rate) and generalisation to other tasks that involve planar reaching movements (R : 0.39). The model achieves comparable accuracy also for the CAHAI (R : 0.40) and BI scales (R : 0.35). Conclusions: Our results highlight the clinically relevant predictive power of kinematic data collected during unsupervised goal-oriented motor training combined with automated inference techniques and provide new insight into factors underlying recovery and its biomarkers.


Background
Stroke is the second major cause of death and disability worldwide, with about 15 million new cases every year [1]. One third of these cases lead to persistent cognitive and motor disabilities [2]. About 80% of stroke survivors present weakness and partial loss of voluntary control in the upper-extremities [3], or hemiparesis, which is often associated with other sensorimotor alterations, such as hypertonia or tremor.
Although hemiparesis is a highly prevalent symptom and severely limits the independence of affected patients, its causes and recovery dynamics are not fully understood [4]. Recent literature converges onto the core idea that it is mainly due to a combination of residual corticospinal tract capacity and an upregulation of the reticulospinal tract [5,6]. Further, recovery seems to follow a temporal structure where most of the improvement occurs during the first months poststroke [7,8]. However, one dilemma in understanding the mechanism of hemiparesis and its recovery is that it is based on assessment methods such as the Fugl-Meyer and ARAT scales which have in turn their own limitations. Indeed, a recent systematic review [9] investigating a total of 225 studies (N=6197) using 151 different kinematic metrics found that kinematic assessments of upper limb sensorimotor function are poorly standardised and rarely investigate clinimetrics in an unbiased manner. Specifically, using descriptors of accuracy, efficacy, efficiency, movement planning, precision, spatial posture, speed, temporal posture, and range of movement together with clinimetric properties of these descriptors (i.e., reliability, measurement error, convergent validity, and responsiveness), the authors show that the studies analysed exclusively focused on finding correlations between measures of impairment, and only two of the studies reported explicit responsiveness metrics such as correlations in change. Actually, both cross-sectional and longitudinal construct validity have been supported by the relationship between kinematic measures of reaching and the degree of sensorimotor impairment or scores obtained with clinical rating scales [10]. However, there is very limited information regarding test-retest reliability, i.e reproducibility, and responsiveness of kinematic outcome measures of reaching performance.
In order to advance our knowledge about the hemiparesis phenotype and its progression it seems necessary to find alternative methods and measures of motor function and recovery that are objective, reliable and sensitive. One proposal by Murphy et al. [11] explored responsiveness in a number of kinematic descriptors and found a significant covariation of the ARAT scores with movement time (R 2 =0.36), smoothness (R 2 =0.31), and trunk displacement (R 2 =0.35).
Although these results are promising, the study involves a limited number of subjects (N=24) from an highly homogeneous sample (i.e., acute patients only). Further, the ARAT clinical scale presents poor robustness to compensation and is especially vulnerable to the use of explicit strategies to boost performance. Majeed et al. [12] explored the application of models based on LASSO regression to predict changes in motor ability (FM-UE) and motor function (WMFT). These models propose that recovery in both scales can be approximated by the patient's age, the patient's motor control during the execution of fast movements, and other demographic and clinical features, altogether accounting for 65% and 86% of the variability for the FM-UE and WMFT scales respectively. Although these models reach exceptional accuracy, their utility is limited because they make use of kinematic data obtained during the supervised execution of very specific pointing movements, and are based on generic unbounded linear models, with the consequence that their predicted values could be largely outside the meaningful range of the scale (e.g. predict FM-UE scores larger than 66 points).
We propose a new approach towards using kinematic data obtained in unsupervised rehabilitation sessions to predict level of impairment and functional recovery. Data is obtained from patients engaging with goaloriented embodied individualised training with the Rehabilitation Gaming system (RGS). We first explore the clinimetric properties of hand movements that were collected during unsupervised RGS sessions. Secondly, we build and analyse the performance (i.e., external validity, robustness, responsiveness, sensitivity, and generalisation) of a predictive model of motor recovery (1-3 months post-baseline) based on data of acute to chronic stroke survivors.

Subjects
Our retrospective analysis uses data of 191 RGS sessions from 98 hemiplegic patients (age in range [23,87], mean 63; days post-stroke in range [5,3045], mean 400; cf. Table 1) who were recruited between 2010 and 2015 to participate in studies conducted in Barcelona and Tarragona, Spain [13,14,15]. Participants met the following inclusion criteria: 1) ischemic strokes (middle cerebral artery territory) or hemorrhagic strokes (intracerebral); 2) mild-to-moderate upper limb hemiparesis (Medical Research Council scale for proximal muscles > 2) after a first-ever stroke; 3) age between 20 and 90 years old; and 4) the absence of any significant cognitive impairment (Mini-Mental State Evaluation > 22). Table 1 Characteristics of the 191 samples composing the main dataset (single session, Spheroids scenario). The r columns refer to the Pearson correlation coefficients with the FM-UE, CAHAI, and BI clinical scales, respectively. Correlations below the significance threshold r ∼ 0.081 (cf. Fig. 12 in Appendix) are in grey. The clinical scales are measured no more than 4 days before or after the coupled RGS session. The 'instantaneous' variables (6)(7)(8)(9)(10)(11)(12) obtained directly from RGS log-files are in Italic type. The work area is computed as the area of the complex hull of the hand movements using standard methods, e.g. Jarvis' Algorithm [16]. The distance covered refers to the total length of the hand paths during training. The performance success rate is defined as the number of spheres intercepted over the total released during the RGS session.

Protocol
Participants followed a rehabilitation protocol including 3-5 weekly sessions of 30 minutes each for 3-12 weeks using the Rehabilitation Gaming System (RGS) shown in Fig. 1. The joint movements of the user's head, shoulders and elbows are tracked and mapped onto an avatar through a biomechanical model using a custom developed vision based motion capture system. Arm movements are displayed on a screen from a firstperson perspective, realising a rehabilitation paradigm that combines goal-oriented embodied and situated action execution, motor imagery, and action observation. For the RGS sessions in the main dataset, cf. Table  1, the participants are instructed to intercept virtual spheres that move towards them by executing horizontal bimanual movements over the surface of a table ('Spheroids' protocol). The task parameters (the frequency of sphere appearance, their speed, their range, and size) are combined in a single parameter ('difficulty level') and automatically adjusted during the session in order to maintain the user's performance between 70% and 80% success rate [17,18]. The system allows for the storage and extraction of performance parameters as well as hand path trajectories derived from joints' positions and rotations recorded at about 100 Hz.
During the rehabilitation patients are evaluated using standard clinical scales: Fugl-Meyer Assessment for the upper extremity (FM-UE), Chedoke Arm and Hand Activity Inventory (CAHAI) and Barthel Index (BI). When collecting the 191 samples (Table 1), the following measures are taken to improve data quality: • The clinical score measurements (FM-UE, CA-HAI, BI) are coupled to the RGS session closest in Figure 1 The Rehabilitation Gaming System (RGS). The system consists of a PC, a 17 inch LCD touch display, a image-based motion capture device (Kinect 360, Microsoft) positioned on top of the screen [14]. In the inset we show a screenshot from the 'Spheroids' activity during a RGS session. The virtual tasks logic and graphics are implemented using the Torque 3D and Unity 3D game engines. time, with a maximum time separation of 4 days between the measurement and the RGS session; • The first two RGS sessions of a patient at the start of the rehabilitation trajectory are excluded to ensure that patients are familiar with the RGS environment for all collected samples.

Outcomes and Analysis
To analyse the potential of RGS-derived movement descriptors for capturing both impairment and recovery in standardised clinical scales, we first extract a set of variables that are known to correlate with the severity of hemiparesis [9,12]. Next, to execute a convergent validation analysis, we generate a model that combines the information of several variables to estimate the patient's score on a clinical scale. The model includes an estimation of the prediction error. We use repeated cross-validation to avoid overfitting, allocating 50% of the samples to the training set and 50% to the validation set. We conduct a robustness analysis in order to explore test-retest reliability. Additionally, we perform a sensitivity analysis by computing correlations between the change in the movement descriptors and clinical improvements. For this analysis we only consider pairs of RGS sessions from the same patient for which the time-difference between the two sessions is larger than 16 days, for a total of 54 samples (108 RGS sessions and their associated demographical descriptors and clinical assessments). Finally, we explore the potential of the model to generalise to other tasks that involve bimanual 2D planar reaching movements. To study this last property we identify a second dataset of 37 samples from a previous study in which 19 subacute stroke patients with hemiparesis were rehabilitating with a different RGS training scenario which is a variation on the well known arcade game 'Whac-A-Mole' [14].
The prediction analysis we present will focus on the FM-UE score, and we will briefly discuss the generalisation to the CAHAI and BI scales.

Identification of variables
We consider 31 variables in total, 12 are first order variables while the remaining 19 are second order variables and obtained as functions of the first order ones.
We identify two groups of first order variables: 1) demographic and physiological data at recruitment, cf. variables 1-5 in Table 1, and 2) kinematic descriptors extracted for the more affected limb during training sessions, cf. variables 6-12 in Table 1. For the evaluation of all kinematic descriptors, the first and last two minutes of each training session are discarded to avoid the interference of behaviours or events related to the start and the ending of the training session (revision of instructions, postural adjustments, exposure to the final score screen, etc.).
Second order variables include chronicity (i.e. acute, sub-acute and chronic categories) and the difference between the less and the more affected upper-limb in each of the quantitative first order variables, as well as their logarithmic transformations. The descriptive statistics of second order variables are given in Table  5 in the Appendix.
Among the above mentioned variables, most are obvious and/or well-known [9] (cf. caption of Table 1). However, we introduce also the new descriptors 'Smoothness' and 'Total-goal direct movement' (TGDM). Specifically, to extract information on UE motor function we introduce an original kinematic descriptor, J(σ), to assess the patient's movements at a specific temporal resolution, σ. This metric allows us to isolate goal-oriented movements from the hand trajectory in a certain direction, assumed to be stored over time as a function f (t). For the main dataset, we have considered the left/right direction, as it is the principal axes in the movement dynamics of 'Spheroids' protocol. We assume measurements are taken at discrete time points t i = i∆ for i = 0, 1, ..., T − 1, with ∆ being the timestep (for the 'Spheroids' dataset we have ∆ ≃ 0.01 s). We define the total hand displacement during goal-oriented movements J(σ) as the difference between the actual movements and a smoothened version of the discrete movements. The smoothened hand path f σ (t) is obtained using a Gaussian smoothing process with parameter σ where the parameter σ defines how smooth the new trajectory will be, see an example in Fig. 2. Therefore J(σ) is obtained as Following this analysis method we derive the two new variables corresponding to the value of J(σ) at the two peaks in the σ-dependent Pearson correlation with the clinical scales, cf. Fig. 3: 'Smoothness' in correspondence to the high-frequency peak, and 'TGDM' in correspondence to the low-frequency peak. The location of two peaks is weakly dependent on the clinical scale considered, yet it appears to be related to the data structure: the high-frequency peak is linked to the time resolution of the data (∆ ≃ 0.01 s), while the low-frequency peak is related to the typical timescale of the Spheroids protocol, i.e. a spheroid is launched every ∼ 10 s and moving towards opposite sides of the tasks space. We will further interpret these two new variables in the following analysis.

Predictive Models
To combine variables for the prediction of clinical scores of impairment and recovery, we introduce a model that allows for the presence of noise on both the variables Z and the score S, and we hence name it a double-noise parametric model. Its generative func- tional form is where θ = {β, β 0 , A, B, σ 1 , σ 2 } are the p + 5 model hyperparameters to be inferred: β (association parameters of p active variables), β 0 (parametric offset), A e B (range offsets), σ 1 and σ 2 (noise strengths). The  Table 1). The high-frequency peak is at about σ = 0.01 s for all three scales. The low-frequency peak is at σ = 8.8 s for FM-UE, σ = 5.9 s for CAHAI, and σ = 17 s for BI.
The probability of a particular score S, given the variables and the hyperparameters, is given by We use the following (improper) prior distribution over the parameters θ: We take p(σ 1 ) ∼ e −q/σ1 and similar for σ 2 with q being a very small number, typically of the order of the accuracy of the numerical work, e.g. q ∼ 10 −10 . These priors guarantee that the log-likelihood function is bounded from below. We adopt Maximum A Posteriori (MAP) inference: given the dataset D = {(Z 1 , S 1 ), . . . , (Z n , S n )}, the optimal parameters θ correspond to the maximum of the posterior probability or, alternatively, to the minimum of the regularised log-likelihood function: The errors on the inferred parameters θ are estimated from the curvature of the regularised log-likelihood function at the minimum. Two simpler models derived from (6) can be considered corresponding to having either score noise only or covariate noise only: • Score noise model, σ 1 = 0. Taking the limit σ 1 → 0 in (6) gives • Covariate noise model, σ 2 = 0 Taking the limit σ 2 → 0 in (6) gives provided |S i −b| < a for all i, otherwise Ω cov (θ) = ∞. This implies that for each b the minimisation over a is to be carried out strictly over the open interval a > max i |S i −b|.

Identification of features
To explore the convergent validity of RGS-derived kinematic descriptors in comparison to standardised clinical assessments, we compute Pearson correlations between all the variables (the RGS-derived descriptors and the baseline characteristics) and the clinical scores, cf. Fig. 4. By comparing these to a randomised outcome distribution we identify a significance threshold of r ≃ 0.081 for all the Pearson correlations variable-scores (cf. Fig. 12  p = .00049) and CAHAI (r = −0.17, p = .019) scores but positively with BI (r = 0.22, p = .0033). The correlations between FM-UE and CAHAI scores is high (r = 0.89, p < .0001), while the consistency with BI is significantly lower: (r = 0.34, p < .0001 with FM-UE; r = 0.5, p < .0001 with CAHAI). Generally, the correlations of the kinematic descriptors are higher with FM-UE and CAHAI than with BI. All kinematic variables show consistent inter-variables correlation. In particular, 'maximum reaching speed' ('max-sp') correlates highly with 'work area' ('w-area') (r = 0.76, p < .0001). This is interesting as these two variables are measured in a very different way, as the maximum reaching speed is obtained in a single instant while the work area is a function of the whole trajectory generated over the RGS session. The variable 'age' correlates negatively with 'TGDM' (r = −0.31, p < .0001) and 'difficulty' (r = −0.24, p = .00098), while it is uncorrelated with FM-UE and CAHAI scores, but not to BI (r = −0.30, p < .0001). This suggests that age-related technology proficiency may affect the kinematic descriptors. Nevertheless, this effect is relatively weak, i.e., the correlations of 'difficulty' and 'TGDM' with the clinical scores are significantly higher than the ones with 'age'.
In order to better understand the meaning of the two variables obtained with the smoothing techniques ('smoothness' and 'TGDM'), we also extract finite time-windowed variables for comparison. Specifically, we compute the maximum range of movement (left/right direction) within overlapping time windows of size σ, where σ is again fixed by the condition of maximum correlation with the clinical scale of interest, and we average the measurements across all possible windows along the whole RGS session. The resulting values show a very high correlation with the TGDM variable (r = 0.98, p < 0.01) and show very similar Pearson coefficients with the FM-UE (r = 0.54, p < 0.01), CAHAI (r = 0.57, p < 0.01) and BI (r = 0.44, p < 0.01) clinical scales. These results suggest that the TGDM is capturing information about the typical range of movement associated with the scenario events occurring within relevant time windows (i.e. about 10 seconds in the 'Spheroids' scenario). Following the same method, we extract time-windowed maximum reaching speed. The resulting values show a very high correlation with the smoothness variable (r = 0.87, p < 0.01) together with very similar Pearson coefficients with the FM-UE (r = 0.42, p < 0.01), CA-HAI (r = 0.39, p < 0.01) and BI (r = 0.21, p < 0.01) clinical scales. These results suggest that smoothness is linked to the ability of the patient to perform fast movements to complete the RGS tasks.

External validity: prediction of instantaneous scores
In the following we combine information from several variables to estimate clinical scores associated with a single patient's RGS session.
In this section we will focus mostly on the FM-UE scale. The other clinical scales will be discussed in Sec. 'Generalisation to CAHAI and BI scales'. Nevertheless, it is useful to anticipate here that the results for CAHAI are very similar to FM-UE. This similarity is expected as the two scales have high relative correlations and comparable correlations to most of the variables, cf. Tables 1,5. To be quantitative, we can consider the case in which the FM-UE score S FM i of the 'Spheroids' dataset is estimated by simply rescaling the corresponding CAHAI value S CAHAI i by 66/91; this leads to the standard error and a R 2 value of 1 − σ 2 FM,CAHAI /σ 2 FM ≃ 0.62. The prediction of BI scores from kinematic descriptors is instead generally harder. This last point can be explained by the lower correlation of BI scores with most of the variables (cf. Tables 1,5). If we estimate the FM-UE score by a simple rescaling of the BI value by 66/100 we get a standard error of   Table 1 ('Spheroids' scenario), using the covariate noise model with association parameters given in Table 2.
and a R 2 value of 1 − σ 2 FM,BI /σ 2 FM ≃ 0.12. So we see again that BI carries different information than FM-UE (or CAHAI).
In the following, we adopt the covariate noise model for predictions of the FM-UE scale, cf. Eq. (8). Indeed we found that the three models presented in Sec. 'Predictive models' (double noise model, score noise model, covariate noise model) offer comparable performance in prediction of FM-UE scale on the dataset in Table 1. The typical error in the final prediction is of order ∼ 10, while the inferred noise score error σ 2 is typically ∼ 0.1. In this sense the score noise has little impact in this situation and so we prefer the covariate noise model to decrease the number of parameters to be inferred.
Here we aim to estimate the instantaneous FM-UE scores. By instantaneous we mean that we use only patient's baseline characteristics and RGS-derived move-ment descriptors (extracted from a single session logfile). In particular, we rule out the two variables 'session completed so far' and 'time since stroke' and the second order variables obtained from them. In this way, the prediction is intended to anticipate the clinical status of the patient at a given moment without knowledge of the rehabilitation history.
To avoid overfitting, we perform repeated crossvalidation with 50% of samples for training and 50% for validation, obtaining the optimal active set of variables possible for our dataset [1] . The active variables are 6 and shown in Table 2. In total we have 10 parameters (6 association parameters and 4 hyperparameters) inferred from the 191 sessions. The most important variables for the prediction of the instantaneous FM-UE score are 'difficulty' and 'Diff. TGDM' (difference in TGDM between the non-paretic arm and the paretic arm). We note that the resulting active variable set does not contain patient's baseline variables and the estimation of scores on clinical scales are solely obtained from (unsupervised) RGS-derived data. This also means that predictions made by this model for different RGS sessions of the same patient are considered independent measurements. The hyperparameters of the model that predicts FM-UE are given by a = 32.72(0.85), b = 34.55(0.73), σ 1 = 0.551(0.043) and β 0 = 0.370(0.051). To evaluate the accuracy of the final regression model on unseen data, we consider the Leave-one-out cross-validation (LOOCV) for which we obtain an accuracy on the training set of 0.755 while the accuracy on the validation set is 0.746 [2] , so that the difference is about 1.2%.
Eventually, we compare the true and predicted score values, shown in

Robustness: test-retest reliability
To evaluate the test-retest reliability of our model for prediction of the instantaneous FM-UE score, we consider two unseen datasets each composed of 921 RGS sessions, for a total of 1842 unseen RGS sessions. Each session in the first dataset 'test' is associated with a session in the second dataset 'retest' and obtained from the same patient within less than 48 hours. The small [1] We have enforced the presence of 'Diff. distance covered' in the active variable set since it is relevant in the prediction of clinical change, cf. Tables 3,6. [2] Here we define the accuracy as the percentage of points that are correctly estimated above or below the median FM-UE score 47. FM value (test) Figure 6 Predicted FM-UE score of first session (test) vs predicted FM-UE score of second session (retest) for unseen 921 couples of RGS sessions. Each couple is recorded from a same patient within 48 hours. The predictions are obtained using the covariate noise model with association parameters given in Table 2.
time frame makes it plausible that the clinical state of the patient is unchanged between the two test-retest sessions, and so they can be used to assess the reliability of the regression models. These data were collected in the same trials as the main dataset, but they correspond to rehabilitation sessions for which we do not have an associated measurement of a clinical scale (so they cannot be used for training). The Pearson correlation between the 'test' and 'retest' data sets is 0.89, that is "Good correlation" according to Cronbach's alpha score, cf. Fig. 6. In Fig.  6 we also show the interval defined by the standard error of the regression E ≃ 12.7: the large majority of values are within this interval. Indeed we measure an average retest error i (S test i − S retest i ) 2 /N equal to 5.9. This value gives an upper bound to the real value and it is less than half of the average error observed on the true score prediction.
These results support the internal consistency of this assessment method to predict the instantaneous clinical scores of the patients.

Responsiveness: prediction of improvement
Starting from the original dataset ('Spheroids', Table  1), we design a new dataset ('responsiveness' dataset, Table 3) composed of 54 samples where each sample represents a couple of sessions of the same patient for which the time lapsed between the two is larger than 16 days. We observe that the Pearson correlation between change in FM-UE (∆FM-UE) and change in CAHAI (∆CAHAI) is r = 0.68, Pearson between ∆FM-UE and change in BI (∆BI) is r = 0.67, while the Pearson between ∆CAHAI and ∆BI is r = 0.72. For each sample we consider now as variables the change (between the two sessions) of the original variables.
By comparing it to a randomised outcome distribution, we identify a significance threshold of r ∼ 0.14 for the Pearson correlations variable-scores. Several of the variables correlate highly with the change in all three clinical scores, cf. Fig. 7. The highest correlated variable is 'Change in TGDM' (r = 0.48, p = .00024 with ∆FM-UE; r = 0.55, p < .0001 with ∆CAHAI; r = 0.49, p = .0011 with ∆BI). Note that, in comparison to the prediction of a single session's score, the variables 'age' and 'chronic' are more correlated with the outcome when predicting the score change (cf . Tables 1,3 and Tables 5,6 in the Appendix). The initial scores (the clinical scores at the first session) have high correlation with change because of ceiling effect (cf. Table 6 in Appendix). In Fig. 7 we show the correlogram of all the variables and clinical scales. We observe that generally the correlations between kinematic descriptors in a single session (cf. Fig. 4) are preserved also when considering the change between sessions; for example the highest inter-variables correlation is for 'Change in w-area' and 'Change in max-sp' at (r = 0.87, p < .0001).
We utilise this dataset to analyse the responsiveness of the previous model (obtained for predictions of instantaneous scores) in detecting changes of clinical status in the same patient. We therefore adopt here the same set of active variable used for the prediction of instantaneous clinical scores. The association parameters of the model that predicts ∆FM-UE are shown in Table 2 We compare the true and predicted ∆FM-UE in Fig.  8. The Pearson correlation between true ∆FM-UE and predicted ∆FM-UE is 0.76. The value of the coefficient of determination R 2 is 0.57. These results show that the model has a good responsiveness to clinical change with a precision comparable to the one obtained for the instantaneous score.

Sensitivity: prediction of recovery
In order to evaluate the sensitivity (i.e. the true positive rate) of the model to predict ∆FM-UE we first identify 38 out of the 54 samples from the 'responsiveness' dataset (cf. Table 3) for which the associated ∆FM-UE values exceed an MDC of 4 points. Table 3 Characteristics of the 54 samples composing the responsiveness dataset. We select a number of sessions of same patient with a delay of at least 16 days from the main dataset, cf. Table 1. The last three columns report the Pearson coefficient correlation between the variable and the change of clinical score between the two session. Correlations below the significance threshold r ∼ 0.14 are in grey. Characteristics of second order variables for this dataset are shown in Table 6 in Appendix.

Generalisation
For the generalisation analysis, we consider an additional 37 RGS sessions from 19 hemiparetic participants that trained in a different 2D VR-based motor rehabilitation protocol derived from 'Whac-A-Mole' [14]. The observed FM-UE scores in this dataset are in the range [5,60], with mean 37 and SD 14.  Table 3), using the covariate noise model with association parameters given in Table 4.
Unlike the 'Spheroids' protocol, the gameplay of 'Whac-A-Mole' requires movements on the full 2d plane. In response, we utilise the smoothing technique in both cardinal axes of the task. i.e. front/back and left/right directions. Pearson correlations between the clinical scales and the variable J(σ) reveal a similar pattern to the one observed in the Spheroids scenario, with a peak of the Pearson coefficients at about 1s corresponding to the variable TGDM in each direction (cf. Fig. 13 in Appendix). The location of the main peak is again close to the typical timescale of the protocol (that is faster than 'Spheroids'). For the FM-UE score, the highest Pearson coefficient is observed in the frontal direction (r = 0.54 for σ = 1.3s); the lateral hand displacement peak is (r = 0.50 at σ = 1.1s). When predicting clinical scales, we use now only 2 active variables in order to limit overfitting: the variables TGDMfb (Total-goal directed movement for front/back direction) and TGDMlr (Total-goal directed movement for left/right direction). We then infer 6 parameters (2 association parameters + 4 hyperparameters) from the 37 RGS sessions. The two association parameters are (for normalised variables) β TGDMfb = 0.15(0.13) and β TGDMlr = 0.18(0.14). The hyperparameters of the model that predicts FM-UE for 'Whac-A-Mole' scenario are given by a = 31.9(3.5), b = 31.0(2.5), σ 1 = 0.49(0.11), and β 0 = 0.21(0.13).
The FM-UE predictions are shown against the true values in Fig. 9. The Pearson correlation between true FM-UE and predicted FM-UE is 0.63. Average error is E ∼ 11.2. The value of the coefficient of determination R 2 is 0.39.

Generalisation to CAHAI and BI scales
In the previous sections we focused on the FM-UE scale but the CAHAI and BI scores are also available in the main dataset (cf. Table 1); we can then gain some insights on the differences between the three clinical scales from the point of view of the RGS kinematics.
Overall, the CAHAI scale has similar properties to FM-UE in relation to the kinematic descriptors, cf.  Figure 10 True versus predicted CAHAI for 191 data points of 'Spheroids' dataset (cf. Table 1), using the covariate noise model with association parameters given in Table 4. Fig. 4. To stress the generalisation potential of the model, we can then adopt the same model introduced in Table 2 for the prediction of instantaneous FM-UE scores also for the prediction of the instantaneous CAHAI score. The association parameters for CAHAI are reported in Table 4. The most important variables are 'difficulty' and 'TGDM'. The hyperparameters of the covariate noise model that predicts CA-HAI scores are a = 39.158(0.099), b = 51.962(0.079), σ 1 = 0.953(0.064) and β 0 = 0.0319(0.071). The predicted scores are plotted against the true CAHAI values in Fig. 10. The model predicts the CAHAI score with an average error of E CAHAI ∼ 20.1, Pearson r true-predicted of 0.66 and a coefficient of determination R 2 = 0.40. This accuracy is close to what we obtained for the prediction of the instantaneous FM-UE, cf. Fig. 5.
In Fig. 11 we compare FM-UE and CAHAI, both for the true scores and the predicted scores. We note that  Figure 11 True FM-UE versus true CAHAI (green dots) and predicted FM-UE versus predicted CAHAI (purple triangles with errorbars) for 191 data points of 'Spheroids' database (cf. Table 1). We use the covariate noise model with association parameters given in Table 2 for FM-UE and Table 4 for CAHAI.
the relationship between FM and CAHAI is generally well preserved in the predictions; for example the Pearson between FM-UE and CAHAI scores is r = 0.89 for true values and r = 0.88 for predictions. The fact that the variability in the true FM-UE vs true CAHAI is seemingly comparable to the one in the prediction model, reinforces the idea that the precision we achieve is similar to the one of estimating FM-UE directly from CAHAI, as we estimated using Eq. 9. Finally, we observe that the model that predicts FM-UE and CAHAI scores does not work well for the BI. Most kinematics variables have significantly smaller correlation with the BI (in particular 'work area', 'TGDM', 'smoothness') while baseline information and clinical history of the patient are comparatively more relevant (for example the patient's age, cf. Table 1 and Fig. 4). We then devise a different model for the prediction of the BI considering all variables, including not instantaneous ones (such as 'time since stroke' or 'session completed so far'). We use again repeated 50-50 cross-validation to avoid overfitting and select optimal active variable set. The active variable set for BI is composed by 5 variables (β values for normalised variables): 'age' (β = −0.21(0.15)), 'sessions completed so far' (β = 0.24(0.19)), 'difficulty' (β = 1.21(0.72)), 'Log. time since stroke' (β = 0.14(0.14)), 'Log. Difficulty' (β = −0.91(0.70)). Using a double noise model, cf. Eq. 6, we infer the hyperparameters a = 51.56(0.10), b = 49.22(0.12), σ 1 = 0.5948(0.0072), σ 2 = 0.178(0.027) and β 0 = 1.025(0.011). Comparing true and predicted BI scores, we measure an average standard error of E BI ∼ 16.8, a Pearson correlation true-prediction of 0.62 and a coefficient of determination R 2 of 0.35. This accuracy is comparable to the one achieved by the models for FM-UE and CAHAI scores. Nevertheless, the dataset Table 1 is very unbalanced towards high BI scores (mean score 80, with only 3 samples with a score below 25), so that the previous prediction performance will not generalise well to homogeneous unseen BI data (i.e., the precision for low scores is relatively poor).

Discussion and Conclusions
Our understanding of post-stroke motor recovery depends on our capacity to evaluate and characterise impairment and disability. Current standardised assessment methods rely on human criteria and present relevant unsystematic variability due to differences in the evaluators' training, lack of systematicity in the administration of the assessments, and often are excessively focused on one single aspect of the impairment and/or disability.
Different rehabilitation approaches show a preference for using (and even targeting) specific assessment methods for the evaluation of their therapeutic efficacy, and often these methods have been developed by the same team of authors. For example, the effectiveness of Constraint-Induced Movement Therapy [19] is usually evaluated using the Wolf Motor Function Test [20] and the Motor Analog Scale [21], while the effectivity of occupational therapy has been frequently assessed using the Barthel Index [22] and the Functional Independence Measure [23].
Thus, there is an urgent need to establish alternative methods for a common evaluation protocol and characterisation of the hemiparesis phenotype, thus allowing us to identify specific impairment features that could advance our understanding of the recovery dynamics and guide the design of effective rehabilitation therapies. In pursuing this objective, we have conducted a careful analysis of the raw kinematic data from the upper-extremities of 191 individuals with poststroke hemiparesis, and we have constructed a predictive model of instantaneous function and recovery. Our results reveal a new digital biomarker of upper-limbs motor impairment, the Total Goal Directed Movement (TGDM), which relates to the patients range of motion during the execution of meaningful goal-oriented reaching movements. The TGDM strongly correlates with the level of impairment captured by the FM-UE and the level of disability captured by the CAHAI, and also presents predictive power about the patients' progress, showing high correlations with the magnitudes of improvement and deterioration estimated by both scales. This result is especially interesting given the current limited evidence about the responsiveness of kinematic outcome measures of reaching performance in people with hemiparesis after stroke [24]. According to a recent systematic review on the clinimetric properties of kinematic upper limb assessments [9] only two papers captured responsiveness (i.e., ability to capture longitudinal changes in the measured construct), and just nine parameters showed enough evidence to predict recovery (i.e., number of velocity peaks, trunk displacement, task/movement time). The quality of evidence however was very low for all metrics. Further, current recommendations point out that wearables with integrated Inertial Measurement Units and vision-based tracking systems are insufficient to measure the quality of movement and improvement in motor function. However, our findings, together with the growing evidence supporting distance travelled as an accurate and responsive digital biomarker of recovery [25], suggest the opposite.
We built a model of instantaneous motor impairment and recovery as measured by FM-UE that capitalises on the information captured by the TGDM variable, and we validated it in terms of external validity (R 2 : 0.38), robustness (test-retest reliability) (r > 0.89), responsiveness (R 2 : 0.57), sensitivity (TPR : 95%) and generalisation (r > 0.43). The relevance of our results is emphasised by their consistency across clinimetric properties and by their generalisation potential, relying on a large and heterogeneous dataset of patients at different stages post-stroke. We believe that the applicability of the TGDM and its derived models to evaluate impairment and motor recovery is promising for a number of reasons: 1) it can be derived from unimanual displacements executed in the horizontal plane, 2) it does generalise to other tasks involving two-dimensional horizontal reaching movements towards targets, and 3) it can be estimated during unsupervised motor training. Our results provide an early example of how fully digital biomarkers of deficits and recovery post-stroke can provide new digital health methods and technologies for neurorehabilitation that can generalise beyond the clinic and serve continuous high-resolution diagnostics, prognostics and intervention.

Appendix
In Table 5 we report the descriptive statistics of the secondary variables (functions of primary variables in Table 1) for dataset 'Spheroids'.
The distribution of the Pearson correlation coefficients of the FM-UE score with all the variables in Tables 1, 5 is shown in Fig. 12. This is compared with the distribution obtained with randomised outcome, whose standard deviation is r ≃ 0.081.
In Table 6 we report the descriptive statistics of the secondary variables (functions of primary variables in Table 3) for the 'responsiveness' dataset (change of variables and clinical scales between two RGS sessions of the same patient with a delay of at least 16 days).
In Fig. 13 we show the Pearson's r between J(σ) Eq. 2 and the clinical scales as function of the timescale σ for the database 'Whac-A-Mole', cf. Sec. 'Generalisation'. This is a dataset composed of 37 RGS sessions with limited range of the clinical scales: FM-UE range of observations [5,60], mean 37, SD 14; CAHAI range of observations [7,49], mean 32, SD 15; BI range of observations [48, 100], mean 85, SD 13. The 'Whac-A-Mole' scenario requires movements in all 2D place (unlike 'Spheroids' scenario, cf. Table 1), so we define two independent J(σ): one associated to the front/back trajectory and one associated to the lateral (left/right) trajectory. In Fig. 13 we show that in both directions there is a peak in the Pearson's r at about σ = 1 s for all three clinical scales. In correspondence of these two peaks, we define two variables: TGDMfb (Totalgoal directed movement in front/back direction) and TGDMlr (Total-goal directed movement in left-right direction). We stress that the timescale of the peak is roughly equivalent to the timescale of the gameplay of the 'Whac-A-Mole' scenario. Finally, we note that for the same analysis done in the 'Spheroids' scenario we observe two clear peaks in the lateral direction (cf. Fig. 3). In 'Whac-A-Mole' instead the high-frequency peak is not clearly visible. One factor that may affect this difference is that the gameplay of 'Whac-A-Mole' is faster than 'Spheroids' (roughly 1 s instead of 10 s), so that the separation between the two potential peaks is smaller. Other factors that may influence the absence of the second peak are the fact that the gameplay is 2D (so that the speed in one direction is less informative) or the inadequate time-resolution of the camera (different from the one used for 'Spheroids').  Table 5 Characteristics of the secondary variables for the 191 samples composing the main dataset (obtained from first order variables, cf. Table 1). The r columns refers to the Pearson correlation coefficients with the FM-UE, CAHAI, and BI clinical scales, respectively. Correlations below the significance threshold r ∼ 0.081 (cf. Fig. 12 in Appendix) are in grey. From the time since stroke we obtain the categories Acute (5-90 days), Sub-acute (3-12 months) and Chronic (over 1 year). The 'instantaneous' variables obtained directly from game log-file are in Italic type. The 'Diff.' variables are obtained as the difference between the value observed for the less affected arm and the value for the more affected one. The 'Log.' variables are obtained as the natural logarithm of the corresponding first order variables.
Variables  Table 6 Characteristics of the secondary variables for the 54 samples composing the responsiveness dataset (obtained from first order variables, cf. Table 3). We select couple of sessions of same patient with a delay of at least 16 days from the main dataset, cf. Table 1.
The r columns refers to the Pearson correlation coefficients with the FM-UE, CAHAI, and BI clinical scales, respectively. Correlations below the significance threshold r ∼ 0.14 are in grey. The 'instantaneous' variables obtained directly from game log-file are in Italic type. The 'Diff.' variables are obtained as the difference between the value observed for the less affected arm and the value for the more affected one. The 'Log.' variables are obtained as the natural logarithm of the corresponding first order variables. Competing interests P. F. M. J. Verschure leads the research group SPECS that developed RGS, and is the CEO/founder of the spin-off company Eodyne Systems, SL, which commercializes RGS with the goal to achieve a large-scale distribution of low-cost science-based rehabilitation technologies.

Availability of data and materials
The datasets used and analysed in this study are available from the corresponding author on reasonable request.

Consent for publication
Not applicable.

Authors' contributions
Contributors BR, FA, and AC planned the analysis. BR, MM and PV contributed to data acquisition. FA, and AC analysed the data. All authors wrote the draft paper. PV submitted the study and is responsible for the overall content as guarantor.  For the FM-UE score, the peaks are at (r = 0.54 for σ = 1.3s) for the frontal direction and at (r = 0.50 at σ = 1.1s) for the lateral direction. For the CAHAI score, the peaks are at (r = 0.44 for σ = 0.78s) for the frontal direction and at (r = 0.43 at σ = 0.31s) for the lateral direction. For the BI, the peaks are at (r = 0.44 for σ = 1.2s) for the frontal direction and at (r = 0.49 at σ = 1.1s) for the lateral direction.