Quantifying Short-Term Dynamics of Parkinson’s Disease Using Self-Reported Symptom Data From an Internet Social Network

Background Parkinson’s disease (PD) is an incurable neurological disease with approximately 0.3% prevalence. The hallmark symptom is gradual movement deterioration. Current scientific consensus about disease progression holds that symptoms will worsen smoothly over time unless treated. Accurate information about symptom dynamics is of critical importance to patients, caregivers, and the scientific community for the design of new treatments, clinical decision making, and individual disease management. Long-term studies characterize the typical time course of the disease as an early linear progression gradually reaching a plateau in later stages. However, symptom dynamics over durations of days to weeks remains unquantified. Currently, there is a scarcity of objective clinical information about symptom dynamics at intervals shorter than 3 months stretching over several years, but Internet-based patient self-report platforms may change this. Objective To assess the clinical value of online self-reported PD symptom data recorded by users of the health-focused Internet social research platform PatientsLikeMe (PLM), in which patients quantify their symptoms on a regular basis on a subset of the Unified Parkinson’s Disease Ratings Scale (UPDRS). By analyzing this data, we aim for a scientific window on the nature of symptom dynamics for assessment intervals shorter than 3 months over durations of several years. Methods Online self-reported data was validated against the gold standard Parkinson’s Disease Data and Organizing Center (PD-DOC) database, containing clinical symptom data at intervals greater than 3 months. The data were compared visually using quantile-quantile plots, and numerically using the Kolmogorov-Smirnov test. By using a simple piecewise linear trend estimation algorithm, the PLM data was smoothed to separate random fluctuations from continuous symptom dynamics. Subtracting the trends from the original data revealed random fluctuations in symptom severity. The average magnitude of fluctuations versus time since diagnosis was modeled by using a gamma generalized linear model. Results Distributions of ages at diagnosis and UPDRS in the PLM and PD-DOC databases were broadly consistent. The PLM patients were systematically younger than the PD-DOC patients and showed increased symptom severity in the PD off state. The average fluctuation in symptoms (UPDRS Parts I and II) was 2.6 points at the time of diagnosis, rising to 5.9 points 16 years after diagnosis. This fluctuation exceeds the estimated minimal and moderate clinically important differences, respectively. Not all patients conformed to the current clinical picture of gradual, smooth changes: many patients had regimes where symptom severity varied in an unpredictable manner, or underwent large rapid changes in an otherwise more stable progression. Conclusions This information about short-term PD symptom dynamics contributes new scientific understanding about the disease progression, currently very costly to obtain without self-administered Internet-based reporting. This understanding should have implications for the optimization of clinical trials into new treatments and for the choice of treatment decision timescales.

The following notation is used: the time since diagnosis for each patient is ! ! , ! = 1,2 … !, where ! is the number of observations for each patient, and the combined total Part I and Part II UPDRS values is ! ! ! . The regression smoothing is achieved by minimizing the following functional with respect to the estimated trend ! ! ! : Here, the notation . ! is the ! ! -norm. The parameter ! is the regularization constant. For each value of !, the output can be shown to consist of a series of straight lines joined together at their ends, ie it is a piecewise linear spline [1]. When ! = 0, the first term in the Eq. (1) dominates, so the output !(! ! ) is the same as the input !(! ! ). As ! increases, the output ! ! ! becomes progressively smoother. It can be shown that there is a maximum useful value of the regularization constant ! !"# : if the regularization constant is equal to or larger than this, the output consists of a single, least squares straight line fit going through the data ! ! ! .
The matrix ! ! is a second derivative matrix that takes into account the nonuniform time spacing of the UPDRS data points. It is a tridiagonal matrix encoding a second-order accurate finite difference approximation [2]: where ℎ ! = ! ! − ! !!! is the local temporal difference. After minimization of Eq. (1) to obtain (1) is in the form of a quadratic program, it is a convex optimization problem for which a unique, globally optimal solution is guaranteed to exist. Special optimization algorithms have been developed for such functionals, here, we use an efficient version of the primal-dual interior-point algorithm [1].
The regularization constant ! determines the smoothness of the output !(! ! ), and so must be chosen appropriately. In this study we use cross-validation to choose this parameter [3]. This involves a uniformly random partition of the data for each patient into a training set (80% of the data) and a testing set (the remaining 20%). Eq. (1) is optimized on the training set, then the mean absolute test error is calculated: where ! is the set of indexes, and ! is the number of data points, in the test partition. The test set values !(! ! ) are obtained by linear interpolation/extrapolation from the smooth output points !(! ! ) closest in time to the test time ! ! ; this interpolation is justified by the piecewise linear nature of the smoothing operation. The optimal ! is the value that minimizes the test error ! ! (note that this is generally unique to each patient). In order to find this optimal value, we sweep across a wide range of values of ! and calculate the curve ! ! . In order to reduce the effects of random partition sampling variation in this curve, we smooth the curve using kernel regression with Gaussian kernel of bandwidth set to 100. This makes it straightforward to find the optimal degree of smoothing for each patient. In this study, we sample 2500 values of the regularization constant over the range [0, 1000].

Gamma Generalized Linear Modeling of Residuals
In modeling the residuals of the piecewise linear regression smoothing described in section A.1 above, it is important to take into consideration the distribution of these residuals. It can be shown that minimizing Eq. (1) leads to residuals that are increasingly Laplacian, that is, they become Laplace distributed as ! increases [4,5]. From this, it follows that the absolute residuals Least-squares linear regression is the simplest approach, but this method assumes that the residuals are Gaussian distributed, which contradicts what we know about the residuals.
However, we can perform linear regression using generalized linear modeling (GLM), which allows the residuals to come from the more general class of exponential family distributions [6].
This class includes the exponential and gamma distributions as special cases. It also allows a monotonic nonlinearity (known as the link function) as part of the regression. In this study, we use the natural logarithm link function because this is the canonical choice for the gamma distribution which includes the exponential distribution as a special case: where ! refers to the mean of the absolute residuals !(! ! ) . Finding the values for the regression coefficients ! ! , ! ! that maximize the likelihood of the data given the coefficients, is a convex optimization problem solvable by iteratively reweighted least squares [6].