Well-being Forecasting using a Parametric Transfer-Learning method based on the Fisher Divergence and Hamiltonian Monte Carlo

.


Introduction
Clinical depression is a psychiatric disorder affecting mood in a pervasive manner, leading to a reduced quality of life and daily functioning for the patient [1].It is characterised by sadness, loss of interest or pleasure, feelings of guilt or low self-esteem, disturbed sleep or appetite, feelings of tiredness, poor concentration and even medically unexplained symptoms.It often co-exists with symptoms characteristic of anxiety disorders [2], and there is also evidence of a strong bidirectional association with physical illness [3].In other words, having a physical illness is a strong risk factor for developing clinical depression, while depression is also a risk factor for developing or exacerbating existing physical illness [4,5], and is linked to early death [6].
Smartphones and wearable sensors are increasingly used for prediction and management in healthcare contexts, including depression.However, few systems are based on patient-specific models, despite the fact that these are known to perform better compared to general models.In addition, many of the proposed systems and their purported benefits are often not properly backed up by evidence obtained from appropriate scientific research or clinical studies [7].Research on constructing models to predict future mood-states in patients with depression has shown that, apart from the expected variables that describe patients' historical mood, the pertinent variables that determine model performance tend to be diverse and patient-specific [8].
Personalisation requires the learning and tuning of a model to an individual user.This is a non-trivial task for two reasons.Firstly, in an ideal situation, predictions should be provided from day one; in other words, the moment a patient starts using the system, the model will be expected to start making meaningful predictions for that individual, despite the fact that no, or very limited patient-specific data will be available initially.While it is possible to train/update a model incrementally on the data available at a given time, it is difficult to give reliable predictions when little data are available.Secondly, such datasets are also likely to be 'sparse' (in the broader sense), both because of the nature of the data acquired, and because users may have the option to refuse, or postpone their interaction with the system (e.g. when prompted with a questionnaire, or asked to wear a sensor), thereby exacerbating the problem.
Challenges faced when training a model on such sporadic, limited data with traditional machine learning algorithms include overfitting, difficulties in handling outliers, and inappropriate assumptions of equivalence between training and test data distributions -a concept known as dataset shift [9].As a result, models trained on such 'sparse' datasets run the risk of being meaningless or unreliable in practice.In order to account for the uncertainty surrounding data sparsity, one could allow models to be of sufficient generality, but then one would run the risk of creating models that are of no practical value.Similarly, for any given model, sparsity complicates assessing the effective fit to the data, since there are not enough sample points to help meaningfully differentiate between more specific and more generalised models.
The above conundrum was one that was also faced in the NEVERMIND project [10].NEVERMIND is an EU-funded project, which includes a randomisedcontrolled trial, and is tasked with helping individuals at risk of developing depressive symptoms following a primary clinical event such as cancer, myocardial infarction, amputation, and kidney failure [11].It does so by providing effective, smartphone/wearable-based self-management tools, which in turn complement and guide the patients' clinical support plan.In this setting, both subjective as well as multimodal biomedical data are collected -including a collection of physiological signals, body movement, and speech -through the use of a lightweight sensorised T-shirt and patients' smartphones.For these tools to be effective, predictions need to be produced on a daily basis.However, particularly in the early stages of a patient's enrolment, the obtained data exhibit all the problematic behaviours mentioned earlier, making such personalised daily predictions a very challenging task.
Our approach to overcoming these challenges has been the following.In the early stages of a patient's enrolment, when data are limited and sporadic, rather than risk overfitting with an inappropriately personalised approach, our model relies on a more generalised prediction, borne from prior knowledge; however, as more personal data become available, the model slowly starts transitioning to more personalised predictions, in an organic, data-driven manner.In this way, patients can still benefit from useful, general information early on.The prior knowledge required for the initial generalised predictions was obtained from the other participants in the study, and used to inform the prediction of a specific patient through an appropriate transfer learning mechanism.
Transfer Learning (TL) methods are a recent class of techniques, which enable one to work around the strict requirement that the test and training data should necessarily conform to the same probability distribution [12].These methods can use data from unrelated or partially related tasks [13], and allow the domains, tasks, and distributions used in training and testing to be different up to a certain point, without having to build a completely new model from scratch [14].They rely on the basic assumption that the source and target domains, while not necessarily of the same underlying distribution, may still be related in other ways, i.e. via an explicit or implicit relationship between the feature space of the two domains.The goal of TL is to improve learning in the target domain by leveraging previously acquired knowledge gained in the source domain.These methodologies have also been employed to improve performance in the presence of scarce data [15], and techniques that take into account population heterogeneity have been proposed in domains involving sequential data modelling [16]; for a more general overview of the history, taxonomy, and state of the art in transfer learning methods for classification, regression, and clustering problems, see [14,[17][18][19][20][21].
In this paper, we propose a new parametric TL algorithm, addressing the challenge of creating userspecific models and making predictions of selfreported well-being scores, when training is performed incrementally on limited, sporadic data that become slowly available over time.The proposed method makes use of the Fisher divergence to make predictions about The main contribution of this paper is the following: we propose a personalised Bayesian inference method making use of TL in the context of Hamiltonian Monte Carlo sampling, which allows a population prior to be directly represented in the sampling process through the use of the Fisher divergence.We demonstrate the effectiveness of the above technique in the context of personalised prediction of self-reported well-being scores, using data from the NEVERMIND project [10,11].Our method allows for a seamless transition from generalised to highly personalised models, as data become gradually available.
This paper is an extended version of the work presented in [22].We extend our previous work in three important ways: first, by providing a more comprehensive literature review, particularly with respect to the theoretical background underpinning our approach; second, we have included information on diagnostics, detailing how we assessed the quality and convergence of the Markov chains; and third, we present a more rigorous mathematical derivation of the models involved, which was beyond the scope of [22].
The rest of the paper is organised as follows: Sec. 2 provides a brief overview of relevant work in the field of well-being forecasting, and the motivation for our method.Sec. 3 provides a brief description of the predictive model used in the NEVERMIND project and describes the proposed TL pipeline.Sec. 4 describes the NEVERMIND dataset in some detail, and the experimental protocol used in this work.Sec. 5 shows a rigorous evaluation of the model and an analysis of the factors affecting predictive performance.Finally, Sec. 6, lays out our conclusions and provides some indications of promising future research directions.

Background and motivation
There is evidence to suggest that 'well-being'i.e. the subjective presence, absence, fluctuation, intensity, and nature of mood-states as perceived and reported by the individual -is at least as sensitive a predictor for the risk of adverse outcomes as formal clinical assessment of depression based on established diagnostic criteria [23][24][25][26][27][28].There has therefore been a significant research interest over the past decade in monitoring and predicting mood states through non-invasive means.While some of these approaches involve laboratory-based techniques such as electroencephalography (e.g.[29]), or elaborate, bespoke equipment (e.g. the 'Smart Mirrors' project [30]), smartphones and wearable devices due to their ubiquity and convenience have now become the predominant research focus for the non-invasive collection of information and signals for this purpose (e.g.[31][32][33][34][35][36][37]).
However, the majority of studies in this field focus on mood detection and classification, and only few focus on the more challenging problem of longterm forecasting.In the latter case, studies commonly employ neural network methods for this purpose.E.g.Spathis et al. [36], used smartphones to collect a sequence of self-reported mood states over three weeks, by asking users to select a point from a two-dimensional grid corresponding to 'valence' and 'arousal' dimensions.They then trained a multi-task encoder-decoder recurrent neural network to produce a sequence of valence/arousal forecasts (expressed as points on the same grid) for up to 7 days.Their model performed well, though the authors noted performance was less reliable in participants with high mood variability.Similarly, Yu et al. [37] used data from the SNAPSHOT study [38], which involved detailed data from 251 college students, including data from surveys, mobile phones, wearables and weather information.These were used to define mood, health, and stress scores, on which they compared a series of multi-task learning approaches including Regularized Linear Models and several varieties of Neural Networks, in next-day, and up to 7-day forecast scenarios.Their findings showed good performance for next-day scenarios; however, the authors noted that even after selecting the best-performing algorithm, there was a significant reduction in accuracy when it came to 7-day forecasts.
One limitation of neural-network based approaches like the above, is that any transfer learning component learnt is typically applied as an initial point estimate of the network's parameters.This is typically then either used as an initialization point for subsequent finetuning, or the topmost layers are 'frozen', meaning that they are excluded from subsequent training [39].While this approach can achieve a significant initial speed-up in terms of learning, it is less robust, in that it does not allow for any uncertainty present in the transfer domain to be propagated to the prediction.Expressing the transfer learning component as a prior probability in the context of Bayesian inference methods [40] allows us to make use of this information.However, this is not necessarily a straightforward thing to do: when dealing with complicated distributions defined in highdimensional spaces, obtaining posterior parameter estimates expressed in closed form is typically not feasible, as the integrals involved in the inference process tend to be computationally intractable.
Markov Chain Monte Carlo (MCMC) techniques are an elegant way to work around this problem.MCMC refers to a very general and powerful framework that allows sampling from a large class of distributions and which scales well with the dimensionality of the sample space [41].It can be used to empirically obtain information about complicated distributions, and is particularly useful in estimating posterior distributions in Bayesian inference, even when complicated distributions in high-dimensional spaces are involved.
Therefore, the main motivation for this work, is the use of a suitable prior probability transferlearning component within a Bayesian inference framework, for the long-term prediction of mood states from sporadic/limited data.This prior probability is obtained empirically through the use of MCMC, where the target distribution is obtained through an optimisation process involving the minimization of the Fisher Divergence over all participants in the transfer domain.We explain the steps involved in more detail below.

Linear Dynamical System Model
In NEVERMIND, we propose to model participants and predict their self-reported well-being scores using a Linear Dynamical System (LDS) model [42] and a Bayesian TL approach relying on MCMC sampling [43].The method assumes that the well-being of the user can be represented by a state vector, and that its dynamics can be captured by an LDS of the following form: where x(t) ∈ R n x is the latent state for the model, reflecting the user's underlying state of well-being; y(t) ∈ R n y is a vector of observations corresponding to the measurements collected from the user (including biomedical signal features and self-reported well-being scores); u(t) ∈ R n u is the input vector (representing external interventions, or influences from the external environment, e.g.weather or day of the week); and µ y is the baseline value of the observation vector.Finally, ε x and ε y represent noise (i.e.uncertainty) over the state and observation vectors, and are assumed to be distributed as ε x (t) ∼ N (0, S x ) and ε y (t) ∼ N (0, S y ) respectively.The parameters of this model will be collectively referred to as θ.
The above LDS model's latent state at any time t can be extended to describe an auto-regression of arbitrary order, simply by extending the state-vector to include its most recent values, e.g. by writing x(t) = [ξ(t), ξ(t − 1), ξ(t − 2)] T , where ξ(t) is the original, 'base' latent state, and x(t) is the extended one.

Hamiltonian Monte Carlo
Our approach requires that we sample from the posterior distribution of the parameters, i.e. our beliefs about the parameters after having seen the data for a given participant.This can be achieved using an MCMC sampler, which constructs a Markov chain of samples (i.e.parameter sets), having as their equilibrium distribution the target posterior distribution.
In our previous work [43], we used the affine invariant ensemble sampler for MCMC (emcee) proposed in [44].emcee was chosen as an easy to use, well tested, pure Python module, where the underlying algorithm also has an affine invariance property that allows it to perform equally well under all linear transformations, and therefore be insensitive to covariances among parameters.It is also an ensemble method which relies on multiple walkers (the members of the ensemble) sampling in parallel.The main idea behind emcee's proposal strategy is that, for any given walker in the ensemble, their next position is proposed by randomly selecting another walker's location and performing a 'stretch-move' step towards it, i.e. performing a 'jump' in the proposal space in the direction of the randomly selected walker's location, such that the length of the jump is determined by the relative likelihood of the two proposal values.However, there are cases where the affine-invariant ensemble sampler may not perform well or shows unusual and undesirable properties.In particular, when the target density is a multimodal landscape, the walkers can become stuck in different modes [45] or in lower dimensional subspaces.Furthermore, in high dimensions, the chains can show insufficient convergence and slow mixing, or appear to have converged when they have not [46].
For these reasons, in this work, we decided to utilise a Hamiltonian Monte Carlo (HMC) sampler [47], written in the Stan language [48], and more specifically its adaptive extension, the No-U-Turn Sampler (NUTS) [49].
The HMC approach exploits Hamiltonian dynamics in order to propose future states in the Markov chain.Effectively, the system simulates the movement of particles over a surface, such that the overall energy of the system is conserved, and can be expressed as the sum of two energy components -a 'kinetic energy', and a 'potential energy' component.The kinetic energy component is generated via a pre-determined probability distribution, and thus plays the part of the proposal component in MCMC, whereas the potential energy component maps directly to the underlying probability distribution we are trying to sample from.Standard HMC algorithms generally depend on, and are sensitive to an appropriate choice of hyperparameters, namely the step-size and number of steps to use during exploration of the domain; the NUTS variant modifies the proposal component of the base algorithm slightly, in that it evolves the initial system both forwards and backwards in time to form a balanced binary tree.The system then stops automatically when the algorithm detects that the sampler has started retracing earlier steps (i.e.making a "U-turn"), thus eliminating the need to define a pre-determined number of steps; at the same time, the step-size parameter is adapted on the fly, completely eliminating the need to hand-tune HMC.
The HMC algorithm is advantageous, in that it organically makes use of gradient information, enabling it to move faster toward regions of high probability and explore the parameter space more efficiently compared to standard random walks.Consequently, with this sampler we obtain faster convergence in highdimensional target distributions, while the resulting Markov chain is less correlated.In addition, like in emcee, multiple chains can be allowed to run in parallel.Finally, the use of HMC allows for straightforward scaling up of models to even higher dimensionality and complexity, which may be required in future work.
According to Bayes' Theorem, given a vector of observations y, and a vector of parameters θ, the posterior probability p(θ | y) is related to the likelihood term p(y | θ) and the prior term p(θ) via ( Therefore, given a way to compute the product p(y|θ)p(θ), the HMC sampler allows one to generate K random vectors θ k , distributed according to p(θ|y).One can then use this fact to estimate a posterior expectation In our case, the likelihood term p(y|θ) in ( 2) is the marginal likelihood of the LDS model in Sec.3.1, marginalised over the latent state, i.e., This likelihood term can be readily obtained from the LDS model using a Kalman filter applied to the participant's data (see [42]).
Additionally, we specify a prior probability distribution p(θ) to inform and constrain our model.In this work, we use a simplified version of the LDS model described by (1), which ignores the influences from the external environment; in other words, the inputs u(t) are absent and, therefore, the matrix B is unused.Furthermore, the observation vector y(t) is limited to reflect only self-reported well-being scales.Given the above, we consider a unit-root, third-order autoregressive model, which can be represented by the LDS model in (1) with:

Bayesian Transfer Learning
When making a prediction, we want to take into consideration the information coming from both the individual patient, as well as more general information available from other patients acting as prior knowledge in a general sense.Formally, we want to obtain the posterior predictive distribution p( y | y, Y N ) for a given patient (without loss of generality we consider the one with index N +1), where y is the desired prediction, y represents the patient's existing observations, and Y N = {y 1 , . . ., y N } corresponds to the information coming from all other N participants' observations.In theory, this expression can be obtained by marginalising θ out as follows: Unfortunately, in practice this integral is generally intractable.Both our previous TL approach based on Bayesian Model Averaging (BMA) [43] and the one proposed in this paper, are essentially machinelearning techniques for estimating the intractable integral in (4); however, they do so in a substantially different way.The two approaches are contrasted and explained in more detail in the sections below.

Transfer Learning based on BMA
To explain our motivation for developing a new approach, we first briefly review our previous approach based on BMA (see [43] for more details).Assuming conditional independence with respect to θ across data coming from different participants, we expanded (4) as: with each of the K samples in {θ k } K k=1 distributed according to p(θ | Y N ), which represents our beliefs about the parameters for patient N +1 after observing the data from the other N participants, but prior to observing any data for the new patient.Effectively, p(θ | Y N ) embodies the knowledge transfer from the other N participants to the new one.In our BMAbased approach, p(θ | Y N ) was approximated by running the MCMC sampler on each of the N participants and then pooling together the resulting samples.Under this scheme, assuming each run creates M samples (i.e.M vectors of model parameters), we obtain the K vectors used in (5) via uniform sampling from the mixed sample pool of N •M vectors of model parameters.This corresponds to sampling from the mixture distribution obtained by combining the posterior distributions of the N previous participants.The probabilities p( y | y, θ k ) and p(y | θ k ) are then obtained by using the Kalman filter as described earlier.The fractional term in the summation shown in ( 5) represents the probability that, out of the K models considered, the given model θ k generated the observed data y.Therefore, using (5) to estimate (4) corresponds to performing BMA [50] over the K candidate models.
As reported in [43], this method showed an improvement over previous work, which relied on Maximum Likelihood parameter estimation using a standard Expectation Maximisation approach [42].However, such a BMA approach does not exploit the full potential of the MCMC sampler with respect to estimating the integral in (4).In [43], the BMA approach makes use of MCMC, simply in order to explore and generate samples from p(θ | Y N ); these samples are then weighted using p(y | θ k ) as shown in (5), to obtain the probability p(θ | y, Y N ) required to approximate (4).To make fuller use of the potential of the MCMC sampler for the estimation of the integral in (4), we formulated a new algorithm, which allows MCMC to explore and sample from p(θ | y, Y N ) directly.This improved algorithm is the main contribution of this paper and is described in the next section.

Transfer Learning based on the Fisher divergence
The approach delineated in this paper is a parametric TL method based on the Fisher divergence, which can be used to fit a sample of data points to given probabilistic models defined up to a normalisation constant [51][52][53].In this approach, the HMC sampler uses the data from each participant directly, to create chains of parameter vectors reflecting the posterior probability distribution of their personalised models; however, in doing so, the MCMC process itself makes use of a TL component internally, in the sense that the generated chains are obtained based on a modified prior, where this prior accounts for the knowledge available from all other participants, in a manner reminiscent of empirical Bayes approaches.Stated formally, we estimate (4) as where the samples θ k ∼ p(θ | y, Y N ) are obtained via MCMC using the likelihood p(y|θ) from (3) and a prior p(θ | Y N ) obtained as a mixture of the posterior distributions of the N previous participants: that we approximate in parametric form as: where q β (θ) is a function governed by a vector of hyperparameters β.Note that while p(θ | Y N ) in ( 7) is, from a theoretical point of view, the same as in the BMA approach, the fact that we now consider an approximation of it in parametric form allows us to explore it fully using the HMC sampler, rather than being constrained to using only the N •M samples previously obtained from the other participants.
In this work, the specific q β (θ) used is an exponentiated quadratic w.r.t. a non-linear mapping of θ: where ) and g is a vector function such that its i-th element is log(θ i ) if θ i is a parameter representing a variance (e.g.s x ) and θ i otherwise.The quadratic parameter Q β is chosen in the set of positive semi-definite matrices so that q β (θ) is bounded and q β (θ) p(θ) is a proper prior.The hyperparameters β leading to the best approximation (8) can then be found by minimising the Fisher divergence from q β (θ) p(θ) to The Fisher divergence.The Fisher divergence from a distribution q(x) to a distribution p(x), denoted D F (p q), is defined as: Much like its better known counterpart -the Kullback-Leibler divergence (also known as relative entropy) -the Fisher divergence can be understood as an asymmetric measure of distance between a target distribution p, and an approximating distribution q serving as a model for p.In the same manner that the Kullback-Leibler divergence is tightly linked to the concept of entropy (in that it corresponds to the entropy difference between p and q) the definition of the Fisher divergence is similarly tightly linked to the concept of the Fisher information, defined1 by J(p) = p(x) ∇ x log p(x) 2 dx.
A practical disadvantage of the Kullback-Leibler divergence is that, for the result to be meaningful, it requires that both the target and the approximating function be expressed as appropriately normalised probability density functions.However, when one only has unnormalised quantities to work with, the computation of an appropriate normalisation constant, whose proper evaluation requires integration over the entire domain of the function, tends to be intractable in the context of high-dimensional problems.By contrast, the Fisher divergence obviates the need for computing such a normalisation constant, since the fractional nature of the calculation with respect to both the target and the approximating function, means that a normalization constant would cancel out from either of those two terms anyway, and therefore lack of appropriate normalization does not affect the final result.This makes the Fisher divergence an advantageous measure of distance to use when dealing with high-dimensional, unnormalised probability density functions; this is indeed the case in our TL approach, since both our q β (θ) model, and any distribution represented by the output of an MCMC sampler, will necessarily represent unnormalised quantities.

Minimising the Fisher divergence of the mixture distribution.
As shown in appendix A, for a mixture distribution, the Fisher divergence of the mixture is simply the weighted sum of the individual divergences from q β (θ) p(θ) to the mixture components.From a collection of samples distributed according to p(θ | y n ), obtained by running MCMC separately for each one of the N "prior" patients, we can derive the Fisher divergence for each mixture component as follows: with θ k ∼ p(θ|y n ).Therefore, it becomes possible to obtain an optimal value β * by solving the following constrained optimisation problem: where S + is the set of symmetric positive semi-definite matrices.The problem in ( 12) is an instance of a cone quadratic program, which we solve efficiently using the cvxopt library [54].
The prior q β * (θ) p(θ) so obtained is then used alongside the likelihood provided by the LDS model in the context of MCMC, to produce the K samples required for (6), thus giving rise to our final prediction as the average of K individual prediction components.
The mean and variance of the overall prediction can be obtained at each future timepoint by pooling the means and variances of the individual prediction components (i.e.µ k and σ 2 k respectively) as follows:

Dataset
For the purposes of this work, we used data collected over 112 participants in the context of the NEVERMIND randomised controlled trial [11].This dataset consists of participants aged 18 or older, who have received a diagnosis of a severe somatic disease, including myocardial infarction, breast cancer, prostate cancer, kidney failure and lower limb amputation.The data were collected in Pisa, Turin and Lisbon, with appropriate informed consent obtained from the patients in writing, and experiments approved by local ethical committees.
The NEVERMIND dataset consists of subjective data in the form of questionnaires, as well as other multimodal data, collected over time from individual subjects via a smartphone and a specialised lightweight sensorised T-shirt.The full dataset includes a collection of physiological signals, accelerometer data, and voice recordings; however, for the purposes of this work, as mentioned earlier, we only consider the three selfreported well-being scales that the user is prompted to provide on a daily basis.The resulting daily scores from each scale are fed into the LDS model as the observation vector y(t).Each scale's numerical input is obtained from the participant via a sliding scale, which takes values from 1.0 to 6.0 (at 0.2 increments), where lower values represent better outcomes. 2The three scales correspond to the following questions: -"How are you feeling today?" -the Feel score: a measure of the participant's subjective assessment of their morning / waking mood; -"How was your sleep?" -the Sleep score: a measure of the participant's subjective assessment of sleep quality for the night before; and -"How was your day?" -the Day score: a measure of the participant's subjective assessment of the quality of (potentially stressful) events over the course of the day.
Each question is prompted daily, and the participants may refuse to provide an answer, contributing to the sporadic nature of the dataset; human review may be triggered if no significant interaction has occurred for a certain time interval, according to the clinical protocol used in the randomised controlled trial.Participants for whom there were no available data (e.g.patients who had already been enrolled in NEVERMIND, but had not yet started using the system), or whose total data recording-length was less than two weeks, were excluded from the analysis carried out here.

Experimental protocol
This section describes the specific values and implementation of the model described above, as used in the experiments, as well as the approach used to validate the method.
We have chosen to use weakly informative priors on model parameters, expressing vague or general information; this has the effect that model selection is primarily driven by the likelihood function, such that in the presence of adequate data, the specific choice of prior has a minimal effect on the final inference, relative to the data.
Specifically, with regard to the parameters described in Sec.3.2, we placed a diffuse Gaussian prior over the elements a 1 , a 2 ∼ N (0, 0.5 2 ) of the transition matrix A, and over the coefficients c i ∼ N (0, 1) of the observation matrix C. A diffuse Gaussian prior was also placed over the initial state vector x(0) where ξ 1 ∼ N (1, 2); this distribution was centred away from zero to break the symmetry of the problem and reduce the occurrence of multiple equivalent modes.We further placed an inverse gamma prior over the non-zero diagonal element of the state-noise matrix S x , as s x ∼ Γ -1 (α, β) with shape parameter α=2 and scale parameter β ≈ 0.06.Small values of α lead to wide distributions and in particular α=2 corresponds to a prior with infinite variance, thus allowing the inference mechanism to explore values of s x as large as needed.The mode of the inverse gamma distribution is given by s * x = β(α + 1) -1 .We also set the baseline value of the observation vector to the fixed value µ y =3, and the parameter s y to the fixed value of 0.04 (i.e.0.2 2 ); the latter was chosen empirically, by estimating the variance of the error made by the subjects when they provide answers to the questionnaires, given the fact that they use a slider in order to do so, and that this results in the scales being quantised at a resolution of 0.2.
The HMC sampler was set to compute Markov chains using 8 walkers working in parallel, such that each sample corresponds to a vector θ consisting of the scalar parameters described in Sec.3.2 (i.e. a 1 , s x , etc).Each walker was set to create 275 samples, where the first 150 obtained samples were discarded as 'burnin', leaving 125 representative samples per chain.The individual chains generated from each walker were then combined into a single larger chain, having a total of 1000 samples.
We further monitored the convergence of the chains, by computing the potential scale reduction factor on split chains, typically referred to more concisely as the 'split-R' measure [40,Sec. 11.4].The split-R provides a measure of convergence and mixing quality of the chains in an MCMC simulation, which can be used to gain insight into the rate and degree of convergence, as well as in terms of detecting non-stationarity, allowing for better evaluation of the underlying algorithms.We also obtained the log-posterior density (denoted by the 'lp__' variable in Stan) and summary-statistics for each model parameter, including means, standard deviations (SD) and various quantiles computed from the draws.The summary also reports the Monte Carlo Standard Errors (SE mean ), and the effective sample sizes (n eff ).The Monte Carlo Standard Error is the uncertainty about a statistic in the sample due to sampling error; the smaller the standard error, the closer the mean estimate of the posterior draws of the parameter is expected to be to the true value.The effective sample size, n eff , measures the amount by which autocorrelation in samples increases uncertainty (standard errors) relative to an independent sample; if the samples are independent, the effective sample-size equals the actual sample-size.It is particularly important in terms of gauging the reliability of the split-R measure, as a small n eff can lead to unreliable values for R. Table S1 (see Supplementary Material) presents an indicative example of a summary for the parameters of interest, as estimated from a collection of samples corresponding to one of the participants in the study.The results show that all values for the split-R are approximately 1.0 (above 0.9 and below 1.1) and n eff is well above the minimum recommended value of 100 effective samples per chain [55], indicating that the chains had mixed well and the model had successfully converged.
The predictive performance of the current TL approach -namely the Fisher-divergence minimization approach (referred to as the M fd model henceforth)was evaluated in relation to a number of competing models.In the first instance, we compared this model against the BMA one used in [43] (model M bma ).To ensure a fairer comparison between M fd and M bma , the chains for M bma were created using HMC rather than emcee as previously done in [43].Additionally, we also compared M fd against a Maximum A Posteriori (MAP) model (M map ) and 4 'baseline' models: and d) An ordinary least squares regression model (M r ).The M map model was obtained by running Stan in 'optimization' mode instead of 'sampling' mode, which uses the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimization algorithm under the hood [48]; this directly provides a single 'best' estimate for the parameter vector θ, corresponding to the MAP estimate of the posterior distribution, as computed by

Stan.
Given that new data arrive incrementally, it is necessary to rebuild each patient's chains on a daily basis in order to keep the models up-to-date, and ensure that all observations available for that participant are being used.It is important to note that, for any given time-interval, given the fact that the patients are allowed to refuse to answer some or all questions on any particular day, the number of observations present within that time-interval may well differ between patients.
We define the following notation.Let: -L tr be the length of the 'training period', i.e. the number of weeks used for training (regardless of the number of actual observations that happen to be contained within), chosen from the set {1, 2, ..., 10}, -L fc be the length of the 'forecasting period', i.e. the number of future datapoints (one per day) to be forecasted, chosen from the set {1, 3, 7}, µ t and σ 2 t be the mean and variance of the forecasted prediction at timepoint t, r t be the corresponding actual value of a wellbeing score observed at time t, (which may be missing if no answer was provided), used as a target value for validation.
In our experiments we evaluated the forecasting scenarios that result from all possible combinations of training and forecasting period-length pairs {L tr , L fc }; however, for brevity, we only show here a representative subset from these experiments, as explained further in Sec 5. Please note that in each case, the number of participants for which it is possible to obtain predictions, depends on the choice of L tr and L fc .Since our goal is to obtain predictions with an associated measure of uncertainty (i.e.how much we can trust the prediction itself), the quality of our predictive algorithms must be assessed using a measure of accuracy that takes both the mean prediction accuracy and the estimated uncertainty into account.
One such measure is the Log Likelihood (LL), which, for a predictive model M i , is given by where t corresponds to timepoints within the forecasting period for which an actual observation is available.The higher the LL measure, the better the probabilistic predictions are.Note that, while the output of M fd and M bma is not strictly speaking a Gaussian, for the purposes of obtaining an LL measure, we represent them as Gaussians of their respective mean and variance.
While we believe LL is the correct measure to account for both mean prediction accuracy and accuracy in the uncertainty around the prediction, we also calculate the actual forecasting error for the predictions of each individual, using the more traditional Root Mean Squared Error (RMSE) calculated as RMSE = 1 n t (r t − µ t ) 2 1  2 , where t here again corresponds to timepoints within the forecasting period, n is the number of targets present when missing values are excluded, and where r t − µ t def = 0 when r t is missing.Note that the RMSE ignores how accurate our estimates of the prediction variance are.
Furthermore, for any pair of competing models, we can calculate a 'winning percentage', as a measure of predictive superiority for one model over another.This is computed as: where games represents the total number of participants for whom it was possible to obtain predictions given a specific {L tr , L fc } pair, and wins corresponds to the subset of those participants, for whom the model in question performed better than its counterpart, with respect to a particular performance measure (i.e.either LL or RMSE).Furthermore, we used the exact Wilcoxon Signed-Rank test-statistic [56] to make pairwise comparisons for these methods, effectively investigating the extent to which the winning percentages represent a 9 EAI Endorsed Transactions on Bioengineering and Bioinformatics Online First genuine and statistically significant improvement, per pair of competing models.

Model output
The output of the model at each timepoint is a 3-dimensional probability distribution expressing a probabilistic prediction for the values of the three questions involved, i.e. the Feel, Sleep, and Day scores.For a different value of the length-of-training hyperparameter L tr , a different model output is obtained over both the training and forecasting period.
For visualisation purposes, we graph the individual questions independently as three separate graphs, each reporting score as a function of time t.Fig. 1 shows a typical example of the means and variances of the model's probabilistic outputs per timepoint as learned by our model, along with the sporadic self-reported well-being scores for one of our participants, with L tr =1 and L fc =7.In the figure, the mean prediction is represented as a dashed line, and the uncertainty around the prediction is indicated by a shaded ±σ area around the mean.As expected, most (but not all) reported scores fall within the shaded region, even in the forecast phase, where, however, as expected σ grows progressively bigger due to the absence of inputs to the LDS.The LL and RMSE measures (see Sec. 4.2) corresponding to the model outputs over the timepoints in Fig. 1 were -4.41 and 0.5447, respectively.

Comparison against competing models
We compared the performance of the M fd method against the competing models outlined in Sec.4.2 in two ways: a) by analysing the distribution of performance differences directly; b) by analysing 'winning percentages' as per (15).Both analyses were performed using the LL and RMSE measures, separately.
A representative example of the first type of analysis can be seen in Fig. 2 for the LL differences and Fig. 3 for the RMSE differences between models.The results were obtained for L tr =3 and L fc =7, which was the most conservative choice for comparing M fd and M bma (more on this later).It is clear that, for both performance measures, M fd performs better across the board.Also, among all other competitors, M bma is the one with the least spread in terms of pairwise differences over all patients.Similar results were obtained with other values of L tr and L fc .
Figs. 4 and 5 look at the same predictions using the second type of analysis, again when training with 3 weeks of past data, and predicting 7 days ahead.These show that the M fd model scores significantly more wins (at the 5% level, using a one-tailed hypothesis) compared to all of its competitors, when both the accuracy and the uncertainty of the predictions is taken into account (i.e. when using the LL as the performance measure).When only the RMSE is used, M fd scores significantly better than four out of the six competitors, but shows no significant difference compared to its M bma and M a competitors.

Effect of training / testing period length on performance
One would generally expect that increasing trainingperiod length would improve performance over all models, and that predictions further away from the last training timepoint would diminish in accuracy.
An analysis was therefore conducted to confirm this, for training periods L tr ∈{1, 3, 7} and forecasting periods L fc ∈{1, 3, 7}.Table 1 shows the results obtained with respect to the LL evaluation measure. 3The top half of the table shows the median LL values, obtained over all patients for whom data was available for the corresponding {L competing models.Note that because of the nature of these intervals, each {L tr , L fc , } pair will consist of a different number of 'valid' participants (i.e. the participants who have data within this period), and therefore it is important to note that the above medians are calculated over different sets of patients.The last row shows the number of such valid participants per {L tr , L fc } pair.Also note that, pairwise median differences are generally not equivalent to pairwise differences between medians.As shown in the table, some of the methods under comparison initially struggle to make acceptable 7-day predictions, when only one week or three weeks of data are available to them (e.g.M map , M p and M r ); it is not until 7 weeks of training that most models can predict one or three days ahead with reasonable accuracy.By contrast, looking at M fd , we see that, not only is it able to make predictions from week one, but it can even make 7-day predictions with remarkable stability for any number of training weeks.Similarly, as expected and consistent with the findings by Yu Model Type et al. [37], there is some reduction in accuracy going from next-day forecasts to 7-day forecasts, regardless of training length.However, it can be seen that the reduction in accuracy in the M fd model tends to be relatively small compared to the competitors.Finally, as shown in the bottom half of Table 1, we can see that M fd is superior to most methods in most conditions, with the median LL difference being significantly greater than zero in 45 out of 54 comparisons (83.3%), and scoring more 'wins' than its competitors in 50 out of 54 comparisons (92.6%).In fact, focusing on the M fd vs M bma row, we see that the new method is also superior to its predecessor, under all scenarios considered; the choice L tr =3 and L fc =7 corresponding to the graphs shown earlier in Secs.5.1 and 5.2 showing a relatively narrow interquartile range between the two, was simply selected on the basis that it represented the worst-case scenario for M fd in relation to M bma .

Conclusions
The use of smartphones and wearable sensors for quantifying and predicting well-being states through personalised predictions, is an actively growing field, with potential applications in the prevention and self-management of depression and other disorders.Personalisation refers to the ability to learn a model, which is specifically tuned with respect to the individual it is intended to be applied to.However, a major obstacle to success in this field so far has been that the more traditional machine learning approaches to this -which typically require the availability of large datasets of uniformly sampled data -are generally not applicable to this domain.There are two main reasons for this; the first is that the kind of data provided by patients through smartphones and wearable sensors tend to be sporadic or intermittently available.The second is that, realistically, for such a personalised system to be useful, users need predictions virtually from day one, whereas in a typical situation the data available to the system for personalisation will initially be very limited, and acquired incrementally over time.As has been demonstrated in the sections above, it could take weeks before a decent amount of data has accumulated to guarantee reliable prediction.
In this paper, we have proposed a parametric transfer learning approach based on the Fisher divergence in the context of Hamiltonian Monte Carlo sampling and Bayesian inference to address these challenges.Our approach makes it possible to create patient-specific models and make useful predictions of self-reported well-being scores, even when the data available for initial training are sporadic and limited, such that training is performed incrementally as more data become slowly available over time.This approach allows us to make informed predictions even in the early stages of data collection, by leveraging external information coming from other patients, in the form of a prior used within a Markov-Chain Monte Carlo process.
We demonstrated this approach on data obtained by the NEVERMIND clinical trial, and measured its performance against previous work (e.g. the BMA method introduced in [43]), and a number of baseline approaches.Our results show that this approach yields a significant improvement over its competitors, and is particularly useful in difficult training/forecasting scenarios, e.g. when one requires a distant, patientspecific forecast, with only a limited initial amount of patient-specific data available for training.
One limitation of this study is that the Linear Dynamical System model used is limited to a single latent state, and the observations reflected questionnaire responses only.The addition of physiological signals in the observation vector could strengthen the model's predictive abilities further, albeit at the cost of increased complexity.Similarly, a single latent state reflecting wellbeing in a general sense may not be powerful enough to capture the underlying complexity of depressive states.It is possible that an appropriately extended latent state might allow a richer representation of the underlying biological states, as well as allow linking latent states to observations directly.All this will be addressed in future work.
Additionally, in this study we found that the background patient population acting as the source domain for the transfer learning component was informative, as shown by the performance of TL.However, a limitation is that we made no attempt to quantify, or investigate ways in which this background knowledge could be made more informative.Future work will focus on investigating whether applying preprocessing strategies that promote individuals in the population that are known to be similar in some way to the person being modeled, enhance transfer learning.
Finally, we recognise that our method has wider applicability to other domains, such as finance, recommender systems, training initiatives, etc, and generally any scenario where limited or sporadic data arrive in a sequential manner, and a seamless transition from generalised to personalised models is required.Therefore in the first instance, future work will also focus on verifying the performance and generality of this approach, both on the complete NEVERMIND dataset (which will become available at the end of the clinical trial), as well as other known external datasets (such as the MIMIC-III critical care database [57]).
In the meantime, the tools we have produced can already be used by clinicians and carers to monitor patients.However, the broader aim is to also use them in the context of a self-management tool.It is therefore important to recognise that however accurate such tools may be in the absence of user feedback, they also need to be evaluated when users are provided with the system's predictions.Since it is likely that patients will be affected by the system's forecasts, this has not only practical implications (which we believe the learning element of the system will likely be able to deal with automatically) but also ethical ones.It is issues of this type that the randomised controlled trial leg of the NEVERMIND project will be called to address.

Appendix A.
In this appendix we derive a relationship between the Fisher divergence D F (p q) from a given distribution q to a mixture distribution p and the Fisher divergences D F (p i q) from q to p's mixture components.Given a mixture distribution p(x) = i w i p i (x) with i w i = 1 and w i ≥ 0, the Fisher divergence from q to p can be computed as: + p(x) ∇ x log q(x) 2 dx.(A.1) We rearrange equation (A.1) by adding and subtracting a convenient term, then breaking up the mixture distribution and regrouping, obtaining: where J(p) = x p(x) ∇ x log p(x) 2 dx is the Fisher information [53] of p while J(p i ) is that of p i .This is especially useful when looking for the best approximation q β to a mixture distribution p.Since J(p) and J(p i ) do not depend on β, the best approximation to the mixture distribution can be computed by minimising a weighted sum of the Fisher divergences between the approximant and the mixture components: θ|y) dθ with respect to an arbitrary function h(θ), as the sample average 1 K K k=1 h(θ k ), evaluated at the posterior samples θ k [40, Sec.10.1].

Figure 1 .
Figure 1.An example of self-reported well-being score modelling and prediction.The dashed vertical lines mark the last time point available to the model for training and visually separate past observations from future predictions.The solid red circles mark the reported scores that were used by the model, while the empty ones are reported to visually assess the prediction accuracy.The dashed black and blue lines and the associated yellow and green shadows represent the mean and standard deviation, respectively, of the distribution of the model outputs.

Figure 2 .Figure 3 .
Figure 2. Box and whisker graph plots showing the median, interquartile range, and extreme cases of the LL differences when training with 3 weeks of past data and predicting 7 days ahead.Values above 0 represent cases where M is better than its competitors.

Figure 4 .
Figure 4. Comparison of the winning results based on LL for the M against all competing models when training with 3 weeks of past data and predicting 7 days ahead.The * indicates statistical significance using a one-tailed Wilcoxon Signed-Rank test.

Table 1 .
Medians of the LL (top) and of LL-differences (bottom) for various durations of training and forecast periods.
∇ x p i (x) T ∇ x log q(x)+ p i (x) ∇ x log q(x) 2 dx Rows correspond to model parameters, and columns to the various summary metrics.mean denotes the posterior mean, se_mean denotes the Monte Carlo standard error, and sd denotes the posterior standard deviation.The numbers 2.5%, 25%, 50%, 75%, and 97.5% denote quantiles.n_eff denotes the effective sample size, and Rhat denotes the split-R statistic.
arg minβ i w i D F (p i q β ).(A.3)16EAI Endorsed Transactions on Bioengineering and BioinformaticsOnline First