Predicting CDR status over 36 months with a recall-based digital cognitive biomarker

Abstract


Background
Word list recall tests are routinely employed in clinical practice for the assessment of verbal memory ability, including in individuals with Alzheimer's disease (AD) and other dementias [1].Most commonly-employed neuropsychological tests of word-list recall were developed originally for the purposes of identifying individuals with well-defined memory loss predating dementia [2].However, alongside the promise of disease-modifying drugs for AD, more emphasis is currently being put on identifying individuals who present subtle signs of underlying pathology at the earliest stages, as these subjects may benefit the most from pharmacological interventions.Therefore, together with advancements in neuroimaging and fluid biomarkers, there is also a critical need for development of more accurate neuropsychological testing, including with word lists, particularly as cognitive assessment typically is cost-effective and requires relatively little training, compared to neuroimaging and fluid biomarkers capture [3].
Process scoring and latent modeling of cognitive tests, which allow for the identification of underlying neurocognitive mechanisms of test performance, including with word list recall tests, have shown potential to enhance test accuracy without requiring test redesign [3][4][5][6][7].For example, simple adjustments in the way popular word-list tests are scored have yielded enhanced sensitivity to cerebrospinal fluid biomarkers of AD [8], and post-mortem pathology [9].
One approach to latent modeling of word-list recall tests is with hidden Markov modeling (HMM), a class of established cognitive models [10,11].In the context of word list recall tests, an HMM characterizes episodic memory for a list item as existing in one of a set of latent storage states upon each observation of recall or non-recall during the test.Retrieval parameters are associated with each storage state from which an item is capable of being recalled, and encoding parameters describe the transitions among the storage states [12].The sequence of encoding and retrieval transitions cannot be directly observed and are therefore "hidden" or "latent" parameters, quantifiable as probabilities that together produce the observed sequence of recall and non-recall.
One HMM that has been applied to word list recall tests is the hierarchical Bayesian cognitive processing (HBCP) model.The HBCP model posits that an item exists within one of three states upon each observation: Pre-task Storage (P), which contains only semantic memory of the non-novel word yet no episodic memory of its presentation during the wordlist recall test; Transient Storage (T), which contains temporarily-stored episodic memory of the word presentation such that it can be retrieved on immediate free recall (IFR) tasks but not after a delay; and Durable Storage (D), which contains episodic memory of the word presentation that is capable of being retrieved on immediate or delayed free recall (DFR) tasks.Three retrieval parameters quantify the probability of recall from the latter two states: Transient Retrieval (R1), from T on IFR tasks; Durable Retrieval (R2), from D on IFR tasks; and Delayed Retrieval (R3), from D on DFR tasks.Four encoding parameters quantify the probability of an item transitioning from one state to another during each task: One-shot Encoding (N1), from P to D; Transient Encoding (N2), from P to T; Consolidated Encoding (N3), from T to D on a task subsequent to its encoding into T; and Testing Effect Encoding (N4), from T to D upon successful recall from T. In the HBCP, these parameters are employed as probabilities for each branch of a multinomial processing tree that reproduces the observed recall behavior [13].For example, a particular word for a particular assessment that is not recalled on IFR 1 or 2 or DFR 1, but recalled on IFR 3, will express as increased probability of the encoding parameters that result in the word existing in the T or D state by IFR 3, and the retrieval parameters R1 and R2, respectively.See Figure 1.
Subsequent to generating the encoding and retrieval parameters, measures of recall ability can be estimated by recombining specific subsets of the multinomial processing tree branches comprised of these parameters.Word recall through T (M1) includes the probabilities for N2 and R1; word recall through D on IFR tasks (M2) includes N1, N2, N3, N4, and R2; and word recall through D on DFR tasks (M3) includes N1, N2, N3, N4, and R3.In the above example of word recall on only IFR 3, M1 is the summation of the subset of branches that results in retrieval through R1 on IFR 3, including encoding through N2 into T but excluding branches that encode through N1, N2, N3, and N4 into D; the reverse is true for the generation of M2 via retrieval from D through R2.These memory (M) values represent generalized recall rate through the various processes across words and tests.For example, an individual with an M3 of .71 is expected to recall 7.1 words out of 10 on average across delayed free recall tests.In unpublished findings, these M parameters have been used to distinguish individuals with mild cognitive impairment from those who were cognitively normal at time of assessment, with an AUC of .79.Collectively, the N, R, and M parameters are referred to as digital cognitive biomarkers (DCBs).In the present study, parameters are averaged together (e.g., average of M1, M2, and M3 = M) to further generalize recall ability across word-list recall test tasks.
In this paper, we aim to confirm previous, internal findings, by testing whether HBCPderived DCBs (N, R, and M: indexing encoding, retrieval, and recall, respectively) are useful predictors of cognitive decline, as measured by the Clinical Dementia Rating (CDR) overall score.In particular, we want to ascertain whether baseline estimates of N, R, and M yield more predictive power than the traditional ADAS-Cog memory measures, such as immediate and delayed recall scores also derived from item response data.

Participants
Data were drawn from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, which has detailed methods [14] reported elsewhere [15,16] participants had a CDR score of 0; 300 scored 0.5; and one participant scored 1.At the 36month follow-up visit, 64 participants had a CDR score of 0; 214 scored 0.5; 39 scored 1; 12 scored 2; and one participant scored 3.All activities for this study were approved by the ethics committees of the authors' universities and competed in accordance with the Declaration of Helsinki.All participants provided informed consent prior to testing.

Materials
The Alzheimer's Disease Assessment Scale-Cognitive subscale (ADAS-Cog) [17] is a neuropsychological test battery consisting of 13 subtests, including immediate and delayed word recall.In the word recall task, participants are visually and audibly presented a list of 10 words, then asked to recall as many words as possible over 3 tasks: the order of presentation is varied across tasks, and the score is calculated as the mean number of words not recalled across the three tasks.After a 10-minute delay with distraction, participants are once more asked to recall as many words as possible, this time without presentation, and the score is calculated as the number of words not recalled.
The Clinical Dementia Rating (CDR) scale [18] is a semi-structured clinical interview that assesses six cognitive areas: memory, orientation, judgement and problem-solving, community affairs, home and hobbies, and personal care.Each cognitive domain is then scored between 0 and 3, and to obtain a global score, the sum of each of the domains is calculated with equal weightings.The global CDR stages are 0, indicating normal cognition; 0.5, indicating mild cognitive impairment; and 1, 2, and 3, indicating mild, moderate, and severe dementia, respectively.

Digital Cognitive Biomarkers
Digital cognitive biomarkers scores were generated using ADAS-Cog Word Recall item response data with the HBCP model, using Bayesian inference with a Markov-chain Monte Carlo (MCMC) algorithm [13].Each assessment's observed sequence of recall and non-recall was used to update prior information of DCB distributions for typical individuals in the general population who come from demographic groups (age, sex, and education level) specific to the participant who performed the assessment.The HBCP model additionally included adjustment for word presentation position effects on each of the three ADAS-Cog English word lists.These DCBs are proprietary (Embic Corporation).

Polygenic Hazard Score
The Desikan AD PHS was computed based on a Cox proportional hazard regression model combining 31 AD-associated single nucleotide polymorphisms (SNPs) with two APOE variants (ε2/ε4), trained with genetic data from an independent cohort.The PHS, composed of a weighted score of 33 risk-or protection-conferring SNPs, was calculated for each participant as previously described [19].

Analysis plan
First, we carried out a longitudinal Bayesian linear regression analysis with CDR score at 36 months as outcome.Predictors were N (as average of N1, N2, N3, N4), R (as average of R1, R2 and R3), and M (as average of M1, M2, and M3), and the traditional ADAS-Cog memory scores (immediate and delayed recall), all measured at baseline, and control variables were years of education, gender, polygenic hazard score, age at baseline, and baseline CDR score.
Credible intervals (CIs) were set to 95%.The prior was set to JZS, and the model prior was set to Uniform.One thousand MCMC simulations were conducted to determine parameters and compensate for possible violations of normality, but we also evaluated q-q plots of residuals to estimate normality.Following that, we carried out two sensitivity tests.We conducted a frequentist ordinal regression analysis with the same outcome and covariates but limited predictors to those that emerged from the initial Bayesian regression as strongest.
Finally, to evaluate clinical validity of these predictive metrics, we conducted a frequentist logistic regression analysis: we used the increase in CDR score between baseline to 36 months of at least 0.5 as outcome and used the same predictor(s) and covariates (minus the baseline CDR score, already included in the change score) as in the ordinal analysis.

Table 1 reports demographics, CDR, and memory scores in the cohort under examination.
Results indicated that the CDR score at 36 months was best predicted by a model including only M (extreme evidence: BF10 > 1 billion, BFinclusion = 26.9)-this model's odds were about three times as high (BFM = 9.3 vs. BFM = 3.2) as the next best model, including M and R. In comparison to models including the traditional ADAS-Cog immediate and delayed recall scores, the model with M alone performed over 3.5 times better than the model with immediate recall and M combined (BFM = 9.2 vs. BFM = 2.6, respectively), and over 5 times better than the model with delayed recall and M combined (whose BFM was 1.7).The best model without DCBs (including both immediate and delayed recall) had a BFM of 0.1: this finding indicates that the top model, with M alone, had model odds about 90 times greater than the best model with only traditional ADAS-Cog metrics.
The higher the M score was, the lower the CDR score at 36 months (mean coefficient = -2.38,SD = 0.79): a cross-sectional difference of 0.2 M points corresponds to a CDR difference of about 0.5 (credibility intervals: -4.52 to -1.05).
Given that the q-q plot for the analysis above displayed some degree of non-normality, we also carried out the same analysis on square-root transformed follow up CDR scores, which gave us more linear q-q plots.The overall pattern of results was unchanged.
Finally, the frequentist logistic regression (269 individuals did not increase their CDR score by 0.5 or more, and 61 did) showed that adding M to the model reduced the AIC from 293.95 to 245.24.M was a significant predictor in this analysis (unstandardized coefficient estimate = -14.94,standardized coefficient estimate = -1.23,Wald coefficient = 39.75,Odds Ratio < 0.001, p < .001).Setting M at approximately 0.58 yielded the following performance diagnostics: the area under the curve (AUC) was 0.84 (without M and only covariates the AUC was 0.71), negative predictive value was 0.87 (258 correct rejections vs. 38 misses), and positive predictive value was 0.68 (23 hits vs. 11 false alarm).Furthermore, specificity was very high (0.96; 258 correct rejections vs. 11 false alarms), while sensitivity was lower (0.38; 23 hits vs. 38 misses).Figure 2 displays the association between the baseline M score and the probability of declining by at least 0.5 CDR points at follow-up (36 months), and Figure 3 reports the receiver operating characteristics (ROC) curve.

Discussion
In this analysis of ADNI data, we examined how HBCP-derived DCBs (N, R, and M, indexing encoding, retrieval, and recall, respectively) compared to the traditional ADAS-Cog assessment metrics (immediate and delayed recall scores) in predicting CDR score over a 36month span.Our analysis included 330 individuals and showed that DCB M was the better overall predictor in the test.These findings are in line with recent efforts demonstrating the validity of process metrics for early detection of cognitive impairment [3][4][5][6][7][21][22][23][24][25].
One observation is that the M metric, comparably to other cognitive tools in recent literature [3,25,26], appears more useful to exclude false negatives than to identify targets correctly, as indexed by the high negative predictive value and specificity.In other words, individuals scoring at M = 0.58 or higher at baseline were unlikely to be declining after 36 months.While identification of positive cases appears more difficult with process scoring compared to, for example, fluid biomarkers [27,28], a high negative predictive value still yields great utility.
Typically, cognitive assessment is cost-and resource-effective when compared to most biomarker assessments, as cognitive assessments are cheaper (many are non-proprietary), require less administrator training, and are less invasive for delivery.Therefore, especially in addressing global need where biomarker assessment is cost-restrictive, there is value in cognitive screening which may help exclude individuals who, despite possible subjective concerns, are unlikely to be on a disease trajectory.Future assessments of M and related DCBs should include direct comparisons to state-of-the-art fluid and imaging biomarkers.
Further research should also use the latest in process scoring and latent modeling.In the time since these analyses were performed, a second generation of DCBs was generated and included in the ADNI database as quantified cognitive processes (qCP).These qCPs account for additional differences in word features across the alternate lists of the ADAS-Cog Word Recall test and include alternate M parameters representative of recall on specific immediate and delayed tasks.Future analyses can be performed to evaluate the predictive capability of these qCPs.This secondary and preliminary assessment of HBCP-derived DCBs has a definite limitation worth noting.The outcome (CDR score) is based upon clinical assessment of primarily cognitive function, and the predictors (N, R, M, and traditional immediate and delayed recall scores) are also measures of cognitive function, specifically memory, which risks issues of circularity.However, to note, 1) the memory scores are not contributors to the CDR score, and 2) M outperformed other measures of recall performance.In addition to measures of cognitive function, further confirmation of the utility of HBCP-derived DCBs should come from tests comparing this score to measures of pathology.Moreover, further studies should consider adding more demographic variety, such as including younger cohorts (as the present cohort was on average 70+ at baseline) and more ethnic diversity.
To conclude, word list memory tests are widely utilized for evaluating cognitive function, especially in Alzheimer's disease research and screening.These tests vary in their elements, such as list length, number of learning attempts, sequence of presentation across attempts, and inclusion of semantic categories.Traditionally, scoring techniques, such as overall scores and more recently composite scoring, have not adequately addressed differences among these elements or their impact on learning and memory during the test [29].Recent advancements in process scoring and latent modeling offer promise in overcoming these limitations to provide better ways to assess cognitive performance.In this study, we show a specific example in HBCP-derived DCBs, M.
. The ADNI is a longitudinal study launched in 2003 with measures of cognitive impairment and AD including clinical and neuropsychological assessment.To be included in this secondary data analysis study, participants had to have: Alzheimer's Disease Assessment Scale-Cognitive subscale (ADAS-Cog) scores at baseline, from which traditional scores and DCBs were extracted; a CDR score at baseline and after a 36-month follow up; and a Polygenic Hazard Score (PHS) to determine genetic risk.The initial reference pool comprised of 3,418 participants, then reduced to 330 participants (mean age at baseline = 71.4,SD = 7.2), after applying the inclusion criteria above.Of these, 184 were males, and were 146 females.At baseline, 29

Figure 1 .
Figure 1.Hierarchical Bayesian Cognitive Processing Model.The model has three episodic