Smartphone-based remote assessment of upper extremity function for multiple sclerosis using the Draw a Shape Test

Objective: Smartphone devices may enable out-of-clinic assessments in chronic neurological diseases. We describe the Draw a Shape (DaS) Test, a smartphone-based and remotely administered test of Upper Extremity (UE) function developed for people with multiple sclerosis (PwMS). This work introduces DaS-related features that characterise UE function and impairment, and aims to demonstrate how multivariate modelling of these metrics can reliably predict the 9-Hole Peg Test (9HPT), a clinician-administered UE assessment in PwMS. Approach: The DaS Test instructed PwMS and healthy controls (HC) to trace predefined shapes on a smartphone screen. A total of 93 subjects (HC, n = 22; PwMS, n = 71) contributed both dominant and non-dominant handed DaS tests. PwMS subjects were characterised as those with normal (nPwMS, n = 50) and abnormal UE function (aPwMS, n = 21) with respect to their average 9HPT time (≤ or > 22.7 (s), respectively). L1-regularization techniques, combined with linear least squares (OLS, IRLS), or non-linear support vector (SVR) or random forest (RFR) regression were investigated as functions to map relevant DaS features to 9HPT times. Main results: It was observed that average non-dominant handed 9HPT times were more accurately predicted by DaS features (r2 = 0.41, P< 0.05; MAE: 2.08 ± 0.34 (s)) than average dominant handed 9HPTs (r2 = 0.39, P< 0.05; MAE: 2.32 ± 0.43 (s)), using simple linear IRLS ( P< 0.01). Moreover, it was found that the Mean absolute error (MAE) in predicted 9HPTs was comparable to the variability of actual 9HPT times within HC, nPwMS and aPwMS groups respectively. The 9HPT however exhibited large heteroscedasticity resulting in less stable predictions of longer 9HPT times. Significance: This study demonstrates the potential of the smartphone-based DaS Test to reliably predict 9HPT times and remotely monitor UE function in PwMS.


Introduction
Multiple sclerosis (MS) is a chronic inflammatory disease of the central nervous system, affecting more than 2 million people worldwide [1]. The impairment of upper extremity (UE) function and manual dexterity resulting from sensory and motor deficits is widely reported across all subtypes of MS, although progressive MS is associated with higher prevalence of UE dysfunction and greater impairment of manual dexterity [2,3]. UE dysfunction impacts people with MS' (PwMS) ability to perform activities of daily living, affecting their independence, work retention and quality of life [4,5].
While various performance tests and patient-reported outcome measures are available [6], the 9-Hole Peg Test (9HPT) is the most frequently used measure of manual dexterity in MS research, clinical trials and clinical practice [7,8]. The 9HPT requires participants to repeatedly place and then remove nine pegs into nine holes, one at a time, as quickly as possible [7]. Performance is commonly evaluated as the time taken to 2. Methods

Dataset
The NCT02952911 study aimed at assessing the feasibility of using smartphone-and smartwatch-based tests to remotely monitor PwMS and healthy controls (HC) over a 24 week period [31]. The DaS test instructed all study participants daily to draw six different shapes presented on the smartphone screen as fast and as accurate as possible, within a maximum time of 30 seconds per each attempted shape 1 . The six shapes to be drawn were a diagonal line bottom left to top right, a diagonal line top right to bottom left, a square, a circle, a figure-8-shape, and a spiral. The drawing had to be performed with the index finger of the tested hand, where subjects alternated each day between their dominant and non-dominant hand. Line shapes were not considered in this study. Figure 1 depicts a demonstration of the DaS test performed by a participant. A total of 93 subjects (HC, n = 22; PwMS, n = 71) contributed both dominant and non-dominant DaS tests used for analysis in this study. Subjects were divided into normal (nPwMS) and abnormal (aPwMS) subgroups with respect to their average combined dominant and non-dominant 9HPT times over the entire study [7]. The threshold for abnormal UE function was defined by the average 9HPT times greater the mean plus 2 standard deviations from normative data on a healthy population, pooled on dominant (9HPT threshold: 17.8 + 2(2.2) (s)) and non-dominant (9HPT threshold: 18.5 + 2(2.3) (s)) tests [12,32]. Hence, the aPwMS subgroup consisted of PwMS with average 9HPT times > 22.7 (s) and the nPwMS subgroup of PwMS with average 9HPT of ≤22.7(s).

Raw Data Processing
Raw sensor data was collected from the smartphone touchscreen during the active DaS test and stored as xand y-screen coordinates with a corresponding timestamp t, (x, y, t). A bespoke MATLAB script extracted attempted and completed shapes from each test, along with the corresponding hand used. All first attempts were used for further feature analysis. All data processing was performed using MATLAB vR2018a (The MathWorks, Natick, MA, USA).

Characterization of Manual Dexterity Performance
Multiple features were extracted from each shape capturing temporal, spatial and spatiotemporal aspects involved in the drawing task and potentially reflective of manual dexterity. Furthermore, overall test performance statistics were calculated, such as the time taken to complete all shapes and the number of shapes completed. For a full list of the features extracted please see the accompanying supplementary material. A selection of some relevant features are described below and illustrated in figures (2-5).

Temporal Features
Temporal features including drawing velocity, angular and radial velocities and speed distribution measures were computed to assess temporal irregularities, such as delays, smoothness, jerkiness, and rapid finger/hand movement [9,19,33]. The dominant frequency and power spectral density was measured for frequencies between 1-7 Hz, which was aimed to surface potential tremulous actions like that of cerebellar intention tremor or to record ataxic movements, both commonly exhibited in PwMS [32,33]. Examples of drawing speed and power spectral density (PSD) estimate of drawing speed are illustrated in figure 2.

Spatial Features
Features capturing Spatial aspects of finger or hand movements were captured by features based on drawing error [9,18,33]. A new approach to compute drawing error is presented in this study based on a shape-matching approach known as the Hausdorff distance [34,35]. Let X and Y be two-non empty subsets of a metric space (M, d). The Hausdorff distance d H (X, Y) is defined as where sup is the supremum and inf is the infimum and distance d is computed as the Euclidean L 2 norm. This metric compares the maximum distance of one set to the nearest point in another set [36], which can be used as a basis to compute the error between the reference way-points (interpolated into a reference shape scaled to the number of pixels drawn) and the subject's drawing attempt. The maximal Hausdorff distance is a measure of the absolute deviation from the reference shape, while the total drawing error can also be   defined as sum of the Hausdorff distances (i.e. the largest minimum distances) between the drawn and reference shape, normalized by the number of touch coordinates drawn. An example of Hausdorff distances can be found in figure 3.

Spatiotemporal Features
Digital drawings are unique in that they encapsulate spatial and temporal performance information simultaneously: each pixel point contains 3D data relating to the persons hand movement at that time. This information is exploited to create discretized heat maps of touch events (x, y, t). A heat map not only gives a visual representation of performance but can also be used to extract a further important sub-set of features which may be sensitive to motor control or disease fluctuations. In order to compare intensity maps for image analysis, each shape drawn is scaled to the same coordinates, while the timescales and relating colour intensities are based on global not local pixel counts. Pixels are binned into a coarser grid, both for graphical purposes and as a means to allow comparisons between subjects for each shape drawn. A graded single colour intensity scheme is used to represent pixel densities, which is transposed into an achromatic scale for image feature extraction.
Image features are extracted from both the shape drawings and their transposed heatmaps. An example of a discretised heatmap is illustrated in figure 4. Pixel intensities measure the structural composition of such images, while the drawings are also compared with an ideal drawing 2 for similarities using measures such as 2D image correlation coefficient, or Mutual Information (MI) between the two images [37]. Image entropy, i.e. image entropy of heat map-transposed shape drawings (converted to grayscale) was calculated using where k is the number of grey levels and p k is the probability associated with each grey level k. Entropy is a commonly used measure of disorder in a system and can be used to in image analysis for texture mapping [38]. The topography of transposed pixel intensity drawings become a function of finger movements and hence entropy a measure of smooth, non-hesitant drawing. Further features capturing hesitation times and aspects of fine directional changes during drawing were calculated to capture elements of cognitive motor inferences known to affect UE function in MS [4]. Finally, a new measure, celerity, was defined by calculating the ratio of successfully passed waypoints divided by the time taken to complete the shape.

Data Selection
This study aims to investigate the prediction of clinical 9HPT times using features relating to UE function computed from the DaS test. Considering a simple linear regression model of the form In this case, X is the design matrix and contains the median and standard deviation of each feature per subject over all available test days for dominant and non-dominant hand tests separately. The model errors µ are assumed to be normally distributed with zero mean and constant variance, σ 2 . Response variable, Y, is denoted as the average 9HPT time per subject over the entire study (all baseline, week 12, week 24/study completion observations considered) for each respective dominant and non-dominant handed 9HPT separately. We assume that drawing performance will generally vary depending on dominance [20,32], therefore independent models were evaluated based on dominant and non-dominant hands used.

Statistical Analysis
Features were assessed for non-normality by visual inspection. Those non-normal features were transformed using box-cox transformations [39]. Pre-processing assessment of the response (9HPT) displayed a highly tailed distribution, and as such the 9HPT was also transformed back towards Gaussianity to help fulfil error assumptions of linear prediction [40]. Differences in median clinical metrics (EDSS, 9HPT) and feature values between subject-groups (HC, nPwMS, aPwMS) were tested using a Kruskal-Wallis test (KWt). Categorical differences in sex were investigated using a Chi-squared (χ 2 ) test. A Brown-Forsythe test (BF) was used to evaluate the null hypothesis that the data in each categorical subject groups (HC, nPwMS, aPwMS) have equal variances, against the alternative that at least two of the data samples do not. The BF test calculates ANOVA on the absolute deviations of the data values from the group medians [41]. Differences in model residuals and model prediction errors between hands and between models were also assessed using a Wilcoxon signed rank test.
Pearson's correlation (R ps ) and Spearman's rank correlation (R sp ) was used to assess the association of features to the 9HPT time univariately. Wilcoxon signed rank tests were used to investigate differences in 9HPT values and features between dominant and non-dominant handed tests. P-values were corrected using methods described by Benjamini and Hochberg [42] in cases where multiple hypothesis testing was performed.

Model Generalisability
To determine the generalisability of our models, stratified 5-fold subject-wise cross-validation (CV) was employed. This consisted of randomly partitioning the dataset into k = 5 folds which was stratified with equal proportions of HC, nPwMS and aPwMS where possible. One set was denoted the training set (in-sample), which was further split for into smaller set for parameter selection (validation) using an internal  5-fold CV approach. The remaining data was then denoted testing set (out-of-sample) on which predictions were made. CV was repeated 10 times with new random partitions in order to reduce bias in re-sampling and dataset splitting.

Model Evaluation
In order to reveal the (potentially non-linear) functional relationship between the DaS Test (as represented by the features extracted) and the associated 9HPT, a number of regression models were evaluated based on mean absolute error (MAE) and root-mean squared error (RMSE), as measured in seconds (s). MAE is whereŶ are the model predictions; and N are the number of observations in the training or testing set, respectively. To prevent overfitting and reduce the dimensionality (M) of the DaS features, the 'Least absolute shrinkage and selection operator' (LASSO) method was employed [43]. LASSO allows removal of features by shrinking some feature coefficients, β, in our regression towards zero, filtering towards the most important measures whilst also making selection decisions on sets of collinear features. LASSO imposes the L 1 -norm penalty to the residual sum of squares over N test observations using non-negative values of shrinkage parameter λ, yieldinĝ A top feature ranking table was deduced by interrogating the feature subsets selected by LASSO at each fold and repetition. The relative stability of features selected was assessed by recording the percentage of time that feature is selected at each fold and repetition. It has been suggested that bias or prediction error can be decreased by performing a separate regression post-LASSO [44]. As such, features were selected using LASSO and those features each presented to linear models where this study investigated the performance of ordinary least squares (OLS), and iteratively re-weighted least squares (IRLS), which minimizes the weighted sum of square using a 'bisquare weighting' function [40,43].
It is possible that the DaS features do not combine linearly to predict the 9HPT and such non-linear regression was also explored. Support vector regression (SVR) is a widely used technique to perform non-linear regression by mapping the feature space to a higher dimension using a 'kernel trick' [45]. In this case, features selected per CV-fold by LASSO are presented to SVR models, which are tuned via grid-search to determine optimal values of kernel parameter γ, penalty parameter C, and L 1 soft-margin regularization parameter ε. SVR models were tested using linear and Gaussian radial bias function (RBF) kernels. Further to this, non-linear random forest regression (RFR) was also investigated using the whole feature set [46]. Regressors were built on raw features and trained with a split criterion based on mean decrease in RMSE and optimised over varying numbers of trees and the number of input variables chosen at each node.

Results
PwMS subjects in this study were stratified into those with presumably normal (nPwMS) and abnormal (aPwMS) UE function with respect to their pooled average 9HPT times. Table 1 represents the demographic information per subject group, HC, nPwMS, aPwMS. The effects between different MS phenotypes such as primary progressive pultiple sclerosis (PPMS), secondary progressive multiple sclerosis (SPMS) and relapsing remitting multiple sclerosis (RRMS) are provided in table 1 but are not considered for analysis. The effect of differences in the male to female ratio within each subject group (HC, nPwMS and aPwMS), while imbalanced, was also not considered in subsequent analysis.
In the overall population, 9HPT times were found to be significantly different between dominant and non-dominant hands (P < 0.05). Furthermore, 9HPT times were significantly different between dominant and non-dominant hands for HC (P < 0.05) and nPwMS (P < 0.01), but not for aPwMS (P = 0.46). Two subjects' average 9HPT times (pooled over dominant and non-dominant) were found to be 38.9 and 41.9 (s) respectively, which was greater than the mean plus 4 standard deviations (> 38.1 (s)) from the entire study population. These subjects were considered outliers with respect to the available data and were subsequently removed from final predictive analysis.

Feature Demonstration
A cross-section of relevant features are illustrated in figures (2)(3)(4)(5). Each figure shows an example from a representative subject from each subject group: HC, nPwMS, aPwMS. Figure 1 for instance demonstrates how Hausdorff distance is calculated for the figure-8-shape, which has been observed to increase with higher 9HPT times for both dominant (R sp : 0.49, P < 0.001) and non-dominant handed tests (R sp : 0.51, P < 0.001).
As an example from the circle shape, drawing speed can be less smooth and more variable in nPwMS and aPwMS than HC subjects, who tend to draw faster and more consistently (figure 2). The variability in absolute drawing speed for example was significantly greater in both PwMS groups (KWt, P < 0.001) for both hands. The respective frequency distribution of drawing speed also revealed dominant peaks at multiple frequencies for more variable shape drawing.
Pixel density maps can be created based upon the relative sampling stability of the smartphone screen. The longer a finger touch pointer stays in a position the more it will be sampled, and hence a heat map representation can be built from the finger movements both temporally and spatially. Figure 4 illustrates spiral drawings represented as discretized heatmaps. Areas of hesitation and non-movement are visually apparent and characterised by dense regions of heat intensity. Image entropy encodes this accumulation of hesitation and irregularity of drawing. It was observed that higher spiral entropy values significantly correlated to higher 9HPT times (dominant, R sp : 0.40, P < 0.001; non-dominant, R sp : 0.45, P < 0.001). Figure 5 demonstrates the calculation of hesitation time and over shoot at the corners of square shapes. Values of drawing speed were mapped to the original drawing for visual analysis.

Feature Evaluation
This study found 311 features were significantly correlated to the 9HPT (Spearman's rank R: P < 0.05). Wilcoxon signed rank tests between these features calculated from dominant and non-dominant handed tests revealed 70% were significantly different between handed tests (P < 0.05). It was observed that 40% of HC, 73% of nPwMS and 57% of aPwMS subject features differed significantly between hands (P < 0.05). Table 2 describes the top ten features selected by LASSO and relative frequency that were picked for both dominant and non-dominant handed models. Image entropy was the top feature for both handed tests. However, it was computed for spiral for dominant and for the figure-8-shape for non-dominant handed

Model Evaluation
It was observed that non-dominant handed models more accurately predicted 9HPT times (MAE: 2.08 ± 0.34 (s)) than dominant handed regression models (MAE: 2.32 ± 0.43 (s)), using simple IRLS across 5-fold CV and 10 repetitions (P < 0.01). Figure 6 compares the out-of-sample test MAE between hands as a function of number of features added to IRLS models across 5 fold CV and 10 repetitions. It can be seen that MAE decreases as more features are evaluated in both dominant and non-dominant handed models. Non-dominant handed tests exhibited lowest MAE with 6 features (2.21 ± 0.04 (s)) compared to dominant handed tests with 16 features (1.93 ± 0.08 (s)). Scatterplots of the raw 9HPT predictions per subject averaged over all CV-repetitions using IRLS reveal good agreement to their ground truth for dominant (r 2 = 0.39) and non-dominant (r 2 = 0.41) tests ( figure 7). Breakdown of average 9HPT predictions within each subject group demonstrated that HC and nPwMS had lower MAE compared with aPwMS for both dominant (1.81 ± 1.32, 1.93 ± 1.12 (s)) and non-dominant (1.98 ± 1.15, 1.62 ± 1.10 (s)) handed models (table 3). Subject's considered aPwMS were much more difficult to predict for both dominant (3.81 ± 2.28 (s)) and non-dominant (3.51 ± 1.56 (s)) 9HPTs.
MAE was higher in non-dominant handed models than dominant for HC subjects, whereas non-dominant handed 9HPTs were predicted more accurately than dominant for nPwMS and aPwMS. While the mean absolute error was not significantly different between hands for HC (P = 0.83) and aPwMS (P = 0.94), a larger trend was exhibited by the nPwMS group (P = 0.09). Visual corroboration between (table 3 and figure 7) reveal that at higher 9HPT values (i.e. aPwMS) the predictions were less accurate by greater magnitudes. Non-linear techniques also exhibited this pattern. Figure 7 illustrates the intra-and inter-subject variability of the 9HPT. The within-subject variability of the 9HPT increased with higher 9HPT values (r 2 = 0.54, P < 0.001; R ps : 0.73, P < 0.001; R sp : 0.57, P < 0.001). A Brown-Forsythe test for equal variances in Y between subject groups (HC, nPwMS, aPwMS)  demonstrated that the between-subject variability in 9HPT also increased with each subject group (P = 0.02). In concordance with table 3, aPwMS were shown to exhibit greater variability than HC and nPwMS, where higher 9HPT times tend to have much greater within-and between-subject variance. Finally, table 4 compares the out-of-sample test error from the four respective models built in this study. There were no significant differences observed between any of the model predictions.

Discussion
The present study examines UE function in PwMS with mild-to-moderate disability in comparison with HC using DaS, a self-administered digital drawing test captured on a smartphone, and demonstrates how modelling of DaS features from a test can reliably predict the average time of the clinician-administered 9-Hole Peg Test (9HPT). It has been proposed that smartphone-based tests developed for repeated assessments in remote settings may offer reliable and objective metrics that could capture a unique window in a subject's disease state and previously unseen or inappropriately estimated characteristics of MS disease [47].
Due to the inherently heterogeneous dissemination in space and time of multiple sclerosis, PwMS experience varying levels of dysfunction or fatigue across different physical domains [48]. As a method to characterise the PwMS population in this study we have divided PwMS into those with presumed UE function abnormality (aPwMS), and those with normal UE function (nPwMS), based on average recorded 9HPT times. Abnormal 9HPT times were considered as 9HPT times greater than two standard deviations beyond hand-matched normative data from a healthy population [32]. While applying hard thresholds on clinically administered scales is a blunt stratification method, distinct attributes of each group were apparent and will be discussed with respect to features and predictions.

Feature Discussion
Previous digital upper extremity function assessments have focused mainly on the spiral drawing [9,21,22,[26][27][28][29] in Parkinson's disease, while those incorporating other types of shapes or drawings have been sparse apropos the information they have extracted [30]. By considering other shapes such as the circle, square, and figure-8-shape, it was hoped to probe all aspects of hand function along with MS-specific pathological impairments such as ataxia, various tremor types, and spasticity [4,5,17,49]. UE hand-motor impairments manifest differently in PwMS as opposed to in Parkinson's. This lead our study to extract a more exhaustive feature space. Both previously developed and novel features were derived and tested for their clinical validity through multivariate modelling of the 9HPT a typically used UE function test in PwMS. Figures (2)(3)(4)(5) aim to characterise some of the DaS features developed in this study and how the level of UE impairment may influence each shape drawn and resultant feature value. Univariate analysis of these DaS features demonstrated moderate-to-strong Pearson's and Spearman's correlations with the 9HPT (table 2), with many coefficients comparable to a range of outcome measures for upper extremity function [7].
Consistent with our study, Feys et al [20] identified handedness as a possibly influential factor on digital drawing performance, although they were unable to test this in a healthy sub-population. Erasmus et al [32] observed a significant difference in drawing error in PwMS subgroups with cerebellar upper limb ataxia, and general worse performance in non-dominant hands across their feature set. Such differences across hands may be more amplified by MS-related impairment and in this study a greater proportion of features differed significantly between hands for nPwMS (70%) and aPwMS (57%) compared with HC (40%). Therefore, digital tests of upper extremity function that are conditioned on the hand used may be more sensitive to MS disease severity and changes in disease course.

Model Discussion
The DaS testing battery is an information-rich but dimensionally dense test. Similar features can be extracted from six different shapes, quickly accumulating the overall volume of features and contributing to redundancy. Many features exhibited collinearity within and between shapes. Hence, LASSO L 1 -regularization was employed in order to reduce the feature space, minimise the effects of collinearity and identify important predictors of 9HPT time. Recording the relative frequency at which features were selected allows an interpretation of the feature type and shapes that are most useful to probe aspects of MS disease. Novel features such as Hausdorff distance drawing error or drawing entropy calculated from heat-map    accuracy of 9HPT times can be deduced considering that the MAE for HC subjects (1.81 ± 1.32, 1.93 ± 1.12 (s)) was close to the standard deviation of HC 9HPT times (18.3 ± 1.7, 19.1 ± 1.8 (s)) for both dominant and non-dominant handed tests, respectively. Similarly it was shown that MAE for nPwMS (1.98 ± 1.15, 1.62 ± 1.10 (s)) was similar to the variability of their actual 9HPT times (19.4 ± 2.1, 20.1 ± 1.6 (s)).
Higher 9HPT times (those subject observations indicated as aPwMS) were however found to be more erroneously predicted for both handed tests (3.81 ± 2.28, 3.51 ± 1.56 (s)). Visual examination of the distribution of actual 9HPT times (figure 7) demonstrated that greater 9HPT times exhibited higher variance between subject measurements, representative of an inverse Gaussian distribution. Figure 8 corroborated this heteroscedastic observation, further demonstrating that greater 9HPT times exhibited higher variance within subject measurements (r 2 = 0.54, P < 0.001). As 9HPT times are measured in seconds and are unbounded, hesitations, incorrect movements and the erratic impact of dropping a peg outside of the board-which are more likely in those with greater UE impairment-can compound to greater magnitudes of accumulated 9HPT times. Consequently, higher 9HPT tests become more variable and less stable, both between and within subjects with greater UE impairment. Nonetheless, the MAE for aPwMS (MAE: 3.81 ± 2.28, 3.51 ± 1.56 (s)) was still less than the standard deviation of 9HPT for this group (9HPT: 26.3 ± 5.4, 27.2 ± 5.0 (s)).
Comparison of the out-of-sample test prediction error across CV repetitions demonstrated that RFR and SVR, both non-linear techniques, performed slightly better than OLS or IRLS models. RFR intrinsically uses non-linear feature selection compared to SVR which is dependent on linearly selected features from LASSO, and gave minimum prediction error. However, considering the units of measurements for 9HPT is seconds (s), differences between models of 0.5 seconds can be considered minimal. As such it can be assumed that a simple linear function can accurately and adequately capture the relationship between 9HPT and DaS features.

Limitations
While the results presented in this study demonstrate the utility of using digitally captured DaS features with high concurrent validity as demonstrated by their capacity to predict clinical 9HPT times, there are a number of important limitations which need to be addressed.
First, a limitation of this work is the reliance on estimating a narrow clinical proxy of UE function as a ground truth. As discussed, the 9HPT aims to evaluate UE impairment in PwMS, but given it is measured in seconds it can exhibit a large heteroscedasticity and can be highly variable, especially for longer completion times. It was observed that higher 9HPT times had a higher variance for intra subject measurements over clinical visits ( figure 8). As such, the 9HPT time should instead not be considered an exact measure of UE impairment, but rather an estimate of a severity range of function that may be impaired. Multiple sclerosis is a heterogeneous disease which can not only manifest differently across people, but symptoms may vary within specific domains, including UE function. Some studies even suggest only moderate test-retest reliability of the 9HPT when examined in a large healthy cohort [12], rather than more disabled PwMS of other studies [7]. Reliance should therefore not be weighted on one test administered infrequently, such as the 9HPT, to effectively capture all aspects of UE function. HC and nPwMS for example were not significantly discriminative of each other based on 9HPT times (dominant: P = 0.11; non-dominant: P = 0.08). These are all limitations that should be considered when reporting predictions of any model mapping direct to the 9HPT.
The cohort analysed in this study is relatively small (n = 93 subjects). Most MS patients were mildly disabled with respect to their overall and motor specific clinical scores (table 1), and it was observed that the distribution of the 9HPT was highly tailed and skewed towards shorter 9HPT times. As a result, our models systematically underestimated 9HPT scores for aPwMS groups yet more accurately predicted shorter 9HPT times, where a greater density of similar observations were available, i.e. the HC and nPwMS subjects. The sparsity in the representation of aPwMS-who additionally are characterized by higher intra-and inter-subject 9HPT variability-limited our ability to learn a more accurate global model on longer 9HPT times. This work may therefore be biased by uneven distributions of UE impairment despite CV stratification.
Another constraint bound by low subject numbers occurs as generalisability problems across cross-validation folds. A low standard deviation in MAE and RMSE regression error (± 0.5 (s)) was observed across CV repetitions (table 4), demonstrating that results do not change with different permutations of subjects within CVs. Despite this, it was found that feature distributions may not generalise across training, validation and test sets within CV folds, leading to sub-optimal loss minimisation during the training phase. As a result, spurious feature sets and model parameters may be chosen, which can lead to more erroneous 9HPT reconstruction. Furthermore, while cross-validation itself is a popular and robust method to determine model performance and generalisability, independent test sets should ideally be used to obtain unbiased estimates of the relationship between the 9HPT and DaS features.
This study is longitudinal and data captured can span weeks' worth of testing. A definite limitation is that the temporal aspect of this data is not fully utilised. Instead subject's features are smoothed down as the median and standard deviation across all their available data. While the standard deviation is a coarse measure of subject variance across the study, more specific time-series modelling of the DaS Test may reveal additional detailed insights to the progression and characteristics of PwMS. For example, previous work by Prince et al [50] has shown insights into the longitudinal UE behaviour of patients with Parkinson's Disease using a smartphone based tapping assessment.
Overall, it must be considered that NCT02952911 was a feasibility study with relatively few subjects. While this study helps establish a methodological foundation to construct models that can identify patterns of PwMS UE impairment, further studies-especially with a more heterogeneous and diverse set of subjects-and subsequent analysis will be needed to fully probe the clinical validity of remote smartphone assessments in PwMS.

Conclusion
This study illustrates that UE function can be assessed in remote settings using smartphone technology. The analysis from the Draw a Shape (DaS) test, a smartphone-based UE function test in which subjects trace specific shapes, expands on the feature space developed by similar studies investigating UE function in other disease areas [27][28][29]51] and contextualises how new and existing features can be used to characterise UE impairment in PwMS. Multivariate modelling of these features was shown to reliably predict 9HPT times.
While perfect reconstruction of the 9HPT was not possible due to the sparsity of the dataset and the inherent limitations of the 9HPT itself, DaS features may contain a greater wealth of information supplementing beyond discrete 9HPT scores. Key advantages of digital tests like the DaS test are that they can be administered at high frequency, longitudinally and remotely in free-living environments. More frequent and ecologically valid outcome measures of UE impairment are needed to advance progressive MS research and help make clinical trials more efficient by improving power through sensitive and responsive endpoints. In this respect, the wealth of intra-task UE functional information that encapsulates the DaS feature space administered at higher and potentially daily frequency might be better suited in capturing subtle clinical changes seen in relation to the progressive course of MS than the 9HPT, which is typically administered only every 3 to 6 months. This study with ongoing further work therefore establish the foundation of how digital sensor-based assessments may enable an out-of-clinic objective augmentation of traditional rater-administered assessments of UE impairment in MS and other neurological disorders.