Analyzing large Alzheimer's disease cognitive datasets: Considerations and challenges

Abstract Recent data‐sharing initiatives of clinical and preclinical Alzheimer's disease (AD) have led to a growing number of non‐clinical researchers analyzing these datasets using modern data‐driven computational methods. Cognitive tests are key components of such datasets, representing the principal clinical tool to establish phenotypes and monitor symptomatic progression. Despite the potential of computational analyses in complementing the clinical understanding of AD, the characteristics and multifactorial nature of cognitive tests are often unfamiliar to computational researchers and other non‐specialist audiences. This perspective paper outlines core features, idiosyncrasies, and applications of cognitive test data. We report tests commonly featured in data‐sharing initiatives, highlight key considerations in their selection and analysis, and provide suggestions to avoid risks of misinterpretation. Ultimately, the greater transparency of cognitive measures will maximize insights offered in AD, particularly regarding understanding the extent and basis of AD phenotypic heterogeneity.

evident in AD, 2 ranging from typical memory-led AD to canonical atypical clinical phenotypes including visual-/spatial, 4 language-, motor-, or executive-led presentations, and understanding factors associated with cognitive resilience.
Big data collection initiatives offer an unparalleled opportunity to advance these research areas. There are, however, frequent inconsistencies and misconceptions in the use of neurocognitive data. Common methodological and analytical mistakes include overinterpreting the correspondence between an individual test and a specific cognitive domain or function, 5 inappropriate definition of "impairment" based on normative data, 6 and underappreciation of test properties, such as susceptibility to practice, ceiling, and floor effects. 7 Compounding these are the diversity of cognitive domains, AD presentation (typical, atypical) and progression (preclinical, prodromal, syndromic), the enormous array of tests, and their properties and idiosyncrasies.
This position paper aims to present common pitfalls and promote best practices for data-driven computational analyses of cognitive measures to maximize their value in the global efforts to understand and manage AD. We highlight key challenges and common pitfalls through examples using cognitive tests commonly available in open access AD datasets.

Cognitive testing
Cognitive tests are used near-ubiquitously to understand the impact of neurodegenerative disease on patients. 2 Standardized cognitive tests aim to measure impairment objectively, adjusting for demographic factors that could independently impact scores, minimizing use of subjective and self-reported measures, while being relatively cheap, widely available for English-speaking countries, quick to administer, minimally invasive, and with quantifiable reliability for their use in clinical work.
A complete assessment is typically composed of several tasks, each intended to examine a broad function or domain, such as memory, attention, executive function, language, and visuospatial processing.
Cognitive domains can also be conceptualized in the context of altered function and/or structure of particular brain regions or networks. Additional information can come from behavioral observations and qualitative evaluation. It is not possible to completely isolate measurements for individual domains, as correspondence is limited between individual tests and cognitive function and impairment is multifactorial (eg, poor memory might be attributable to impaired attention or visual processing, rather than a primary memory deficit). In clinical practice it is therefore vital that individual tests scores are always interpreted within the context of an individual patient's overall profile, 2 rather than in isolation. We report on cognitive tests that are common among the protocols of the free access initiatives listed in Table 1. Building an exhaustive picture of all the cognitive tests used in AD clinical practice is outside the scope of this work, as it varies for each clinical context, location, and purpose of assessment. However, we report detailed information for measures and test batteries commonly featured in data-sharing initiatives (Tables S1-S6 in supporting information), assessment description, subscales, and scoring system.

Contribution from data-driven methods
Analyses afforded by data-sharing initiatives may offer promise in complementing aspects of current, often qualitative, clinical practice.
Data-driven models have been developed intending to identify patterns from unlabeled data while requiring limited or no human input 9 (for examples of discriminative, generative, and other generative approaches, see Figure 1). One example relevant to AD is the event- Discriminative models See Table 1  • Models not including a structure between available biomarker data: e.g. event-based and scalar trajectory / differential equation models. • Models for structured data/data with a well-defined spatial organisation: e.g. spatiotemporal, network propagation and dynamical systems models  Dashed arrows indicate revisiting steps, for example, revisiting test selection owing to missing data cortical atrophy. 4,7 As with other statistical methods, models have different assumptions; 9 these may include the assumption of a common disease trajectory across individuals and biomarker/test independence, which may be violated by clinical heterogeneity (typical versus atypical presentation) and dependency between tests and biomarkers, respectively.

PERSPECTIVES
Quantitative and qualitative methods are complementary for advancing our understanding of AD progression. However, quantitative research must maintain clinical relevance, which requires some domain knowledge that most data scientists do not have. This is particularly important with cognitive test data, being one of the primary markers used to track disease progression. In the following sections we present key considerations for selecting tests for inclusion in data-driven modelling studies and for avoiding common misinterpretations. Sections are cross-referenced in Figure 1, outlining example research ques-tions and processes regarding test selection, quality control and standardization, and computational/statistical methods. We outline recent directions in cognitive assessment, and make suggestions for improving these data resources.

Considerations for test selection
Batteries of tests are generally rich and diverse, with correspondingly diverse options for tests to select as either input to a model or for validating a model. Taking into account the characteristics of different tests can support best use and more accurate contribution to knowledge (see Figure 1B)   Measures may be confounded in their interpretation when administered to certain patients, particularly those exhibiting prominent atypical non-memory symptoms, eg, measures of executive function featuring prominent visual components being susceptible to visuospatial impairment.

Considerations in test analysis and interpretation
Cognitive test data variously depend on multiple factors such as the scoring system, the task, the domains that task is preferentially measuring, inter-rater reliability, and numerous other elements related to individual characteristics and psychological status (fatigue and anxiety are good examples of this). Providing an exhaustive summary of these factors is outside the scope of this article. In this section we summarize considerations to minimize misinterpretation and misuse of cognitive data by computational scientists developing data-driven predictive models of the disease.

3.2.1
Scoring issues Tests might also include qualitative indicators such as "remembering test instructions," "spoken language ability," "word-finding difficulty," and "comprehension" in ADAS-Cog.

Characteristic effects of cognitive tests
Practice effect. One of the main uses of cognitive tests is the repeated administration for tracking progression, for example, in clinical trials. It is therefore vital to be aware of practice effects, defined as "the improvement in serial cognitive tests with the same or similar test materials." 22 Such effects may be particularly evident on measures of episodic memory, between initial retesting, diminishing across subsequent visits, and in MCI and AD patients as well as healthy participants. 23 It can also substantially alter interpretation of findings with inadequate control or inappropriate analysis. To overcome this limitation, many cognitive tests have validated alternative forms administered in a counterbalanced order, although there is evidence that they only attenuate and do not eliminate the effect. 24 Goldberg et al. 25 suggest three different approaches to attenuate the consequences of practice effect with varying advantages and disadvantages: introducing massed practice to increase task familiarity, adopting cognitive science principles to reduce practice-related gains, and developing well-matched alternate forms. While the above efforts are intended to mitigate practice effects, there is increasing evidence on the clinical utility of characterizing practice effects themselves, for example in determining their associations with AD risk factors and biomarkers, or predicting subsequent cognitive decline. 26,27 Floor or ceiling effects. These occur when the test cannot measure performance outside the test range, which overestimates or underestimates performance and skews score distributions. This is a common issue with brief cognitive tests that measure a limited range of task performance. Patients, particularly at an early disease stage, may make few or no errors on common tests, such as MMSE or ADAS-Cog. 28,29 A key challenge is selecting tests on which patients at intermediate disease stages might perform adequately, while being of sufficient difficulty to be sensitive for high-functioning patients and healthy control participants. Tests meeting such criteria might still yield variability in task performance that differs considerably between patients and healthy controls, or between patient groups stratified by severity.
Although not all measures are susceptible to floor and ceiling effects, 30 many cognitive tests used for computational purposes might need further analysis or subscales selection 31 before comparing them to other markers. Approaches that are less prone to floor or ceiling effects include tests whose measurement characteristics include both accuracy and timed components, tests without a fixed maximum score, and experimental designs not featured in data initiatives (eg, using a staircase paradigm) or composite measures. 3

Cognitive composites
There is a recent surge in composites derived from batteries of tests in AD research. 5 They have been developed for multiple purposes, including sensitivity to global disease severity, 32 individual cognitive domains, 2,3 or longitudinal change 33 -particularly in the preclinical phase relevant to secondary prevention trials. 34 In their recent review, Schneider and Goldberg 5 identified 12 composite scales that have been used in clinical trials to assess cognitive functions. Multi-domain composites may mitigate previously discussed inability to isolate single domains, and may be sensitive to domains that are affected in the preclinical stages of the disease. 34 Various methods have explored composite development, such as psychometric; 3 a combination of statistical, theoretical, and empirical approaches; 33 and computationally sophisticated data-driven algorithms. 35 However, cognitive composites are still prone to a number of issues. 5 Lim et al. 36

CONCLUSIONS
Research in AD is moving toward increasing collaboration between disciplines to better understand and address this condition. The creation and sharing of big datasets are important vehicles guiding this effort in the coming years. In particular, cognitive measures are currently one of the most used quantitative methods in clinical practice, although not necessarily familiar to non-clinical disciplines. We have intended to promote understanding and address knowledge gaps around use and misuse of cognitive tests for a broad audience of researchers from different fields. Ultimately, we hope that better appreciation of the promises and applications of cognitive data will stimulate timely interdisciplinary advances in our understanding of AD.

ACKNOWLEDGMENTS
We would like to thank the reviewers and editor for their constructive comments on a previous version of this paper. We would also like to thank Jennifer Nicholas for assisting with queries regarding approaches to handle missing data.

FUNDING INFORMATION
This work is supported by the EPSRC CDT in Medical Imaging

FINANCIAL DECLARATIONS
Nothing to declare.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest.