Word Recall: Cognitive Performance Within Internet Surveys

Background: The use of online surveys for data collection has increased exponentially, yet it is often unclear whether interview-based cognitive assessments (such as face-to-face or telephonic word recall tasks) can be adapted for use in application-based research settings. Objective: The objective of the current study was to compare and characterize the results of online word recall tasks to those of the Health and Retirement Study (HRS) and determine the feasibility and reliability of incorporating word recall tasks into application-based cognitive assessments. Methods: The results of the online immediate and delayed word recall assessment, included within the Women’s Health and Valuation (WHV) study, were compared to the results of the immediate and delayed recall tasks of Waves 5-11 (2000-2012) of the HRS. Results: Performance on the WHV immediate and delayed tasks demonstrated strong concordance with performance on the HRS tasks ( ρ c=.79, 95% CI 0.67-0.91), despite significant differences between study populations ( P <.001) and study design. Sociodemographic characteristics and self-reported memory demonstrated similar relationships with performance on both the HRS and WHV tasks. Conclusions: The key finding of this study is that the HRS word recall tasks performed similarly when used as an online cognitive assessment in the WHV. Online administration of cognitive tests, which has the potential to significantly reduce participant and administrative burden, should be considered in future research studies and health assessments.


Introduction
The use of Internet-enabled devices, such as computers, smartphones, and tablets, to conduct cognitive research has increased dramatically over the past decade [1][2][3]. These devices allow researchers to use application-based cognitive assessments that have distinct advantages over more traditional assessment methods (ie, face-to-face interviews), including rapid data collection, reduced participant and administrative burden, and access to diverse or hard-to-reach populations [4,5]. When used in either a community or clinic setting, such online applications may detect cognitive and behavioral information that is missed with face-to-face assessments [6], including millisecond changes in cognitive processes [2]. Furthermore, in light of recent recommendations that cognitive screenings be included as a part of routine personalized health care [7], online cognitive assessments may play an important role in detecting subtle changes in cognitive function for both healthy and clinical populations at times when prevention and intervention strategies may have an optimal impact [8].
Application-based administration of cognitive tests has the potential to significantly advance research examining changes in cognition due to aging or illness. Repeated, short online cognitive batteries can provide a fine-grained assessment of cognitive capabilities in everyday life. For example, studies could examine situations or times of day in which cognitive lapses are most likely to occur (ie, during stress) [9,10], which can be used to devise targeted behavioral interventions to improve cognition. Similarly, more frequent cognitive assessments may help to better understand patterns of cognitive change over time in research cohorts or clinical settings.
Frequent use of cognitive assessments may be particularly important in clinical and primary care settings, where early indicators of mild cognitive impairment can be misdiagnosed as typical age-related declines in as many as 91% of cases [11]. This rate of misdiagnosis may be attributable to the frequent use of the Mini-Mental Status Examination, which lacks sensitivity to detect subclinical levels of cognitive decline compared to other assessments [12,13]. Rates of misdiagnosis are further exacerbated by individual subjective memory complaints [14]. Measures that evaluate more specific cognitive domains like episodic memory may be more specific for the detection of early changes in cognitive performance.
Episodic memory is one of the first domains in which people experience subclinical changes in cognitive performance [15,16]. Broadly described as a person's ability to recall temporally related events or dates [17], episodic memory is particularly sensitive to the effects of aging [18][19][20]. This is likely a reflection of age-related neurobiological changes that occur in areas of the brain associated with episodic memory (eg, prefrontal cortex, medial temporal lobes, and hippocampus [19,21,22]), such as the decreased availability of the neurotransmitter dopamine [23], changes in functional connectivity between brain regions [24,25], and volumetric reductions of the hippocampus and prefrontal cortex [21].
Recent evidence indicates that subtle changes in episodic memory can be detected in individuals with normal or slightly impaired cognitive abilities [26]. Examining episodic memory in clinical or research settings may be particularly valuable since lower baseline scores and greater rates of changes in episodic memory are likely to precede the onset of clinical symptoms of cognitive decline [16,27], especially for individuals with a genetic risk for Alzheimer's disease [28]. Recall tests are frequently used to estimate episodic memory as a part of larger interview-based [26,29] and online [3] neuropsychological batteries. Despite the clear advantages and potential benefits of application-based cognitive assessments, researchers often fail to demonstrate equivalence between their application-based assessment and its interview-based counterpart [3]. Ideally, equivalence between assessments (ie, construct validity) would be evaluated using a gold standard measure [30]. In the absence of such a standard, it is preferable to use an internally consistent and valid measure that has demonstrated response stability across samples [31,32].
In response to this gap, the current study opted to replicate the episodic memory tasks (immediate and delayed recall) of the Health and Retirement Study (HRS) in an online survey. These tasks were selected for a number of reasons. First, performance on the cognitive measures of the HRS has shown to be stable from wave to wave, after controlling for cohort effects and test-retest bias [33]. Second, none of these measures has been adapted for use in application-based assessments and tested for equivalence. Third, the format and presentation of the episodic tasks of the HRS were most easily replicated in an online format and would not require the use of complex computer technology that may be difficult or unavailable for older populations (eg, microphones). Finally, due to the authors' interest in age, this study was further motivated by evidence that episodic memory is more susceptible to increasing age compared to semantic memory (ie, abilities related to vocabulary and general knowledge [34]), which has been shown to remain stable well into later decades of life [18,20]. Given the age range of the online sample in the current study (40-69 years), as well as the previous methodological considerations, the replication of the episodic memory tasks was prioritized higher than the other HRS measures.
This study examines the performance of an online word recall task that was originally developed as part of the HRS for cognitively healthy adults. Specifically, the results of an online immediate and delayed word recall task in a nationally representative sample of women aged 40 to 69 years were compared to the results of female respondents from waves 5-11 (2000-2012) of the HRS. Using these primary and secondary data, two questions were examined: (1) Do the online word recall tasks demonstrate sufficient equivalence to the HRS word recall tasks? (2) Does word recall performance vary as a function of respondent characteristics and task modality? Ultimately, the results of this study will aid in the evaluation of the potential of cognitive assessments in online surveys and health assessments.

The Health and Retirement Study
Since its launch in 1992, the goal of the Health and Retirement Study (HRS) has been to provide a detailed, national representation of US adults aged 50 years and older. Jointly managed through the National Institute on Aging (U01 AG009740), the Institute for Social Research, and the University of Michigan (IRB Protocols HUM00056464, HUM00061128, HUM00002562, HUM00079949, HUM00080925, and HUM00074501), the HRS is widely cited as an excellent source of data for use in examining cognitive trends and abilities of the aging US population [35]. Data is collected via telephone and face-to-face interviews in 2-year cycles, with new cohorts added every 6 years. The HRS uses a dual modality approach, where initial interviews are conducted face-to-face and the majority of successive interviews are conducted over the telephone (unless participants are older than 80 years of age). Hispanic and black adults are oversampled. Spouses of HRS participants are also included, regardless of age.

The Women's Health Valuation Study
Conducted at Moffitt Cancer Center in Tampa, Florida, the Women's Health Valuation (WHV) study is an Internet-based health valuation study that included health measures and a discrete choice experiment (DCE) where respondents reported their preferences between possible health outcomes. The approach and methods, including its sampling design and survey instrument, were adapted from the PROMIS-29 valuation study (1R01CA160104) [36] and approved by the University of South Florida Institutional Review Board (USF IRB Protocol 8236).
The WHV online survey instrument had four components: screener, health, DCE, and follow-up. Each component had a series of questions distributed across a continuous series of pages, and responses were recorded by clicking or typing answers and then hitting the Next button. Each page included a Back button so the respondent could return to previous pages and change previous answers; however, to discourage participants from returning to previous pages of the survey, the Back button was disabled. To exit the survey, respondents could close their browser at any time. If the browser was closed prior to completing the survey, the data were not recorded. Responses to all questions were mandatory in order to proceed to the next page.
Participants were recruited from a pre-existing national panel of US adults. To promote concordance with the 2010 US Census, participants were sampled according to 6 demographic quotas: age in years (40-54 and 55-69) and race/ethnicity (Hispanic; black, non-Hispanic; white; and other, non-Hispanic). Further details about the methods of this study are available online [37]. Overall, 4474 women completed the survey between April 3, 2013 and April 21, 2013.

Episodic Memory
The cognitive battery of the HRS has been evaluated for internal consistency and validity [38]. Latent factor path modeling has identified three cognitive domains: episodic memory (immediate and delayed recall), mental status (serial 7s, backward counting from 20, naming), and vocabulary (ie, semantic memory) [35]. Measures of episodic memory include an immediate and delayed recall task. Mental status is measured by a serial 7s subtraction test, counting backwards from 20, and naming (the last name of the current president and vice president; two objects [scissors and cactus] based on a brief verbal description; and the current month, day, year, and day of week). Semantic memory is assessed using a baseline measure of vocabulary (5 words) [39].
As a measure of episodic memory, the immediately and delayed recall tasks are drawn from four categorized lists of 10 English nouns that did not overlap in content. Respondents are randomly assigned to one of the four lists at the initial interview. Longitudinally, each respondent is randomly assigned to receive an alternative word list, such that each respondent is assigned to a different set of words for the three successive waves of data collection. With this counterbalanced approach, each respondent was assigned to each word list only once over 4 waves of data collection, and approximately 8 years will pass before a respondent is reassigned to the same set of words as their initial interview.
During the immediate recall task, an interviewer reads a list of 10 words at a rate of approximately 2 seconds per word to each respondent, who verbally recalled as many words as possible. Approximately 5 minutes after the immediate word recall test, during which respondents answered questions about their emotional state and completed two mental status tasks (eg, counting backwards, serial 7s), respondents were asked to recall the words from the immediate recall task. For each task, the number of correctly recalled words is scored, with higher scores indicating better performance.

Self-Reported Memory
In addition to episodic memory, HRS respondents are also asked to self-report their memory at the present time (excellent, very good, good, fair, or poor) and compare their current memory to their memory 2 years ago (better, same, or worse).
For the purpose of comparison, this study examines all word recall responses from waves 5-11 (2000-2012) of the HRS. Since the WHV was restricted to female respondents, we excluded male respondents from the HRS to decrease the risk of gender bias. Participants of the HRS who reported using a proxy respondent; refused to respond to word recall tasks; or had missing data on demographic, memory, or word recall variables (less than 2.0% of the sample) were also excluded. Aside from these exclusion criteria, 12,545 women completed between 1 and 7 word recall tasks with a median (interquartile range) of 3 tasks (2-5 tasks). These tasks were restructured to represent a cross-sectional dataset with a total of 43,417 word recall tasks.

Episodic Memory
The episodic memory of the WHV replicated the word recall task conducted as part of the HRS. All respondents were asked to recall 10 English nouns immediately after they were presented on-screen (immediate recall) and after a delay (delayed recall). Each respondent received one of four randomly assigned sets of words, which were taken verbatim from the HRS and presented in the same order. Prior to the immediate recall task, respondents were presented with a screen that informed them that they would be shown a set of 10 words and would be asked to recall as many words as they could. These instructions were largely based on those given to HRS respondents but modified for online presentation. Words appeared on the computer screen one at a time for approximately 3 seconds. Respondents were asked to recall the words directly after the presentation of all 10 words (immediate recall) and then approximately 20 minutes later at the end of the DCE component (delayed recall). For each recall, respondents typed as many words as they could remember, in any order, in empty text boxes within the survey. As with the HRS, the primary measure of episodic memory was the sum of correctly recalled words for each task, regardless of order.

Self-Reported Memory
The self-reported memory questions of the WHV were replicated from the self-reported memory questions of the HRS. As part of the health component, the self-reported memory questions asked participants to rate their memory at the present time (excellent, very good, good, fair, or poor) and compare their current memory to their memory 2 years ago (better, same, or worse).
Compared to the word recall task in the HRS, the online task in WHV differed in the several ways. The word lists were displayed visually on a computer device/browser as opposed to being spoken by an interviewer (basic literacy skills were required, with less reliance on verbal communication), respondents recalled words by typing them versus speaking them (basic typing skills were required, with less reliance on verbal communication), and the words can sound the same with different spelling (eg, see vs sea and rock vs roc), which may make the WHV task more specific. In addition, the delay between the immediate and delayed recalls task was shorter (5 minutes vs 20 minutes) and the WHV version was purely cross-sectional, whereas HRS respondents may have completed the tasks up to seven times. Nevertheless, the study took all available steps possible to replicate the original HRS tasks.

Statistical Analyses
Demographic and descriptive statistics (Table 1) obtained on both groups were analyzed using independent sample t tests, Pearson chi-square, and one-way analyses of variance, where appropriate. In order to estimate the precision and accuracy of the two word recall tasks, Lin's concordance correlation coefficient (ρ c ) [30] was used to collectively compare the average frequency with which the WHV and HRS participants recalled each word. Unlike Pearson's correlation coefficient, which estimates only the linear covariation between variables, Lin's concordance quantifies the degree of agreement between two measures of the same variable by providing a measure of covariation and correspondence [30]. Finally, multivariate linear regression models adjusted for cluster errors (ie, multiple tasks per respondent) were used to estimate the associations between characteristics of each study sample and number of correctly recalled words for the immediate and delayed recall tasks. All analyses were conducted using Stata 13 software (StataCorp).

Overview
The WHV online survey had 4474 respondents, each of whom completed 1 word recall task. The HRS survey had 12,545 respondents who completed between 1 and 7 recall tasks. As shown in Table 1, WHV respondents differed significantly from HRS respondents along each characteristic. Overall, WHV respondents were more likely to be white or Hispanic, younger, and better educated and report excellent or very good memory compared to HRS respondents, possibly due to sampling from an online panel. Figure 1 is a scatterplot of the likelihood of immediate recall for each word by modality, which ranges from 0.49 to 0.85 for WHV respondents and 0.33 to 0.91 for HRS respondents. Out of the 40 words, 35 words had greater recall for the WHV versus HRS task with a mean difference of 11.82% (95% CI −0.31 to 0.08). At first glance, Lin's concordance correlation coefficient (ρ c =.57, 95% CI 0.42-0.722) indicated mild correspondence. Once the likelihoods were normalized (ie, subtracting the sample mean and dividing by the standard deviation), Lin's concordance correlation coefficient increased to .789 (95% CI 0.67-0.91), indicating strong correspondence. Similarly, the delayed recall task showed Lin's concordance correlation coefficient with and without normalization that suggested strong concordance (ρ c =.82, 95% CI 0.72-0.91 and ρ c =.86, 95% CI 0.76-0.94, respectively; not shown).
For the immediate and delayed recall tasks, this study assessed differences in association between the number of correctly recalled words by study sample and word list assignment (Table  2), as well as sociodemographic differences between samples (Table 3). Results from the regression analyses were interpreted using a base scenario that represents the median sociodemographic characteristics of the sample (ie, the average number of words that are correctly recalled by a white female aged 50-54 years who is married, has a high school diploma, and self-reports her current memory as good). For immediate or delayed recall, WHV respondents recalled significantly more words than HRS respondents, except for List 3 in delayed recall. For both WHV and HRS respondents, the number of correctly recalled words varied significantly depending on which list was assigned; however, these differences were small (<0.28 words).

Immediate Word Recall
Immediate word recall was significantly associated with respondent characteristics in WHV and HRS tasks, and there were significant modality differences between the online and HRS studies. Overall, WHV respondents immediately recalled about one more word (0.85) than HRS respondents did, after adjusting for respondent characteristics. In terms of demographics, age was significantly associated with immediate recall for the HRS task but not the WHV task. Specifically, younger respondents recalled more words than older respondents in the HRS tasks but not in the WHV tasks. Non-white and/or Hispanic respondents were significantly associated with reduced immediate recall for either modality; however, their associations were not significantly different by modality.
Levels of educational attainment were significantly associated with immediate recall for both the HRS and WHV tasks. Detrimental effects were seen for the lowest education level; respondents with less than a high school diploma recalled fewer words. The benefits of obtaining education beyond high school were incrementally significant, with the exception of WHV respondents who earned an associate's degree. Marital status was significantly associated with immediate recall in the HRS tasks but not the WHV tasks. Specifically, respondents who reported being partnered, separated, divorced, or never married recalled fewer words than their married counterparts. However, the only associations that differed significantly between modalities were those for individuals who were never married.
Self-reported current memory was significantly associated with immediate word recall in both modalities. As expected, those who reported their memory as excellent or very good were more likely to recall more words than those with a fair or poor memory. However, it is unclear whether those who reported excellent memory had better recall than those who reported very good memory. The association between a poor memory and immediate word recall was statistically significant with a noteworthy effect (1.53 words less than good memory). The association with fair or poor was greater for the WHV task than the HRS task, possibly because of interviewer biases (eg, slowing the task for persons who reported poor memory).

Delayed Word Recall
As with immediate word recall, the associations between respondent characteristics and delayed word recall were significant, and their associations differed by modality. Adjusting for respondent characteristics, WHV respondents recalled approximately 0.14 more words after a delay than HRS respondents. Like the immediate recall results, the association between age and delayed recall was significant for the HRS task but not the WHV task. For both modalities, respondents who were Non-white and/or Hispanic performed significantly worse on the delayed recall tasks, but the associations did not differ significantly.
Levels of educational attainment were significantly associated for both modalities and differed slightly from what was seen for the immediate recall task. Significant detrimental effects were no longer seen for WHV respondents with less than a high school diploma but persisted for HRS respondents. Higher levels of education beyond an associate's degree remained significantly associated with greater delayed recall, with the exception of WHV respondents who earned an associate's or advanced degree. The association between advanced education levels and recall was very strong for HRS respondents, who recalled approximately 0.50 more words compared to similarly educated WHV respondents. Marital status was significantly associated with delayed recall for the HRS modality but not the online modality. HRS respondents who reported being partnered, separated or divorced, or never married recalled significantly fewer words compared to married respondents. The associations between modalities were not significantly different.
Self-reported current memory was significantly associated with delayed word recall in both modalities. Similar to the immediate recall task, respondents who reported their memory as excellent or very good were more likely to recall more words than those with a fair or poor memory. The association between poor memory and delayed recall intensified for WHV respondents, who recalled nearly 2 words less compared to the base scenario and more than 1 word less compared to HRS respondents with a similar memory rating.
In order to explore the possibility that word recall scores for WHV respondents were influenced by literacy level and typing skills (ie, misspelled words would not be counted as correct), the previous analyses were rerun after correcting words that were misspelled by one letter. This arbitrary adjustment was based on the number of WHV responses that appeared to be related to misspellings (eg, doller for dollar) or mistyping (eg, ovean for ocean), and is akin to the best-judgment practice granted to HRS interviewers when determining whether a HRS response should be counted as correct (eg, woman for women or shoe for shoes). When the analyses were rerun using the spell-corrected word counts, no significant differences were seen for any of the results. Therefore, the results reported here were conducted using the uncorrected word recall responses for WHV respondents.

Principal Findings
This study compared and characterized the results of the WHV word recall task to those of a gold standard HRS word recall task in order to determine reliability for future surveys. The results of this study provide support for the inclusion of online cognitive assessments in health surveys. This is the first study attempting to replicate the HRS word recall tasks in an application-based assessment. The results indicate that the immediate and delayed word recall tasks were equivalent to the HRS tasks, as evidenced by high levels of concordance (precision) and association with self-reported memory (convergent validity). Even after controlling for age, education, and self-reported memory, WHV respondents recalled nearly one more word than HRS respondents for the immediate recall tasks. This difference decreased but remained significant for the delayed recall and may be attributed to study design differences or other unobservable sample selection biases. In summary, both HRS and WHV tasks appear to perform well despite key differences between the studies.
While our normalized results demonstrated a high level of concordance between the WHV and HRS tasks and thus support the primary goal of this study, we did note significant differences between samples that may be related to a number of potential confounders, such as differences in study design. For example, the HRS recall lists were presented verbally, whereas the words of the WHV lists were presented visually. Upon initial review, one may think that differences in how the brain processes auditory versus visual information may contribute to modality differences. However, research has shown that auditory and visual recall tasks activate overlapping regions of the brain, and while the left hemisphere of the brain is activated slightly more during visual tasks, there is no evidence that recall performance is impacted by modality [40].
An additional difference in study design is the length of time and type of activities that were completed by respondents between the immediate and delayed recall tasks. HRS respondents answered questions regarding their emotional state over the past week (eg, levels of motivation, happiness, and loneliness) and completed two mental math tasks (ie, counting backwards and subtracting 7s) for 5 minutes. WHV respondents completed a series of DCE tasks during the 20-minute delay, which may arguably require greater levels of cognitive engagement. These dissimilarities in the amount of delay and the complexity of the tasks completed during the delay may have contributed to the observed modality differences. The regression analysis may control for some of the sample selection issues, but panel and delay attributes may also explain differences by modality.
In addition to modality differences, there is a potential concern for practice effects to bias the results of repeated word recall tasks, particularly since such effects mask true declines in cognitive performance [41]. Practice effects have been associated with the cognitive data of the HRS [33,35]; however, the interpretation of these results is muddied by the complex methodology of the earliest waves of data collection. For example, Rodgers et al examined practice effects in the word recall tasks of the 1993 and 1995 waves of the Asset and Health Dynamics Among the Oldest Old Study (AHEAD) to word recall performance of the 1998 and 2000 waves of the HRS (the AHEAD and HRS were merged in 1998 due to methodological and content similarities) [33]. Although significant practice effects were identified from wave 1 (1993) to wave 2 (1995) and from wave 2 to wave 3 (1998), none were identified from wave 3 to wave 4 (2000) [33]. The authors note these results are difficult to interpret given the considerable methodological changes that were made from wave to wave, most notable of which is the implementation of the counterbalanced word recall list assignment in wave 2 of AHEAD (1995). Additionally, there is the possibility that the original word list used in 1993 was simply more difficult compared to word lists used in subsequent waves [33].
In a more recent analysis, McArdle et al found evidence of practice effects in cognitive data from earlier waves of the HRS (1992-2004) [35]; however, this result may also be affected by substantive changes in study design. Specifically, the word recall tests of 1992 and 1994 included only one word list with 20 nouns; the counterbalanced approach of randomly assigned four lists of 10 words was first implemented with the HRS in 1996. As with the results of the previous study, the presence of practice effects could be attributed to respondents receiving the same list of words in 1992 and 1994. Additionally, greater levels of recall in subsequent waves could be attributed to the fact that respondents may find it easier to recall 10 words as opposed to 20.
These methodological changes clearly restrict the interpretability of potential practice effects noted within the HRS. The results of the current study are less subjective to such biases since the analyses are restricted to the 2000-2012 waves of the HRS (ie, the counterbalanced assignment of word recall lists is uniform across waves). Despite this counterbalanced approach, it is not possible to completely rule out the potential influence of practice effects. Future studies should attempt to measure the presence and impact of practice effects in the HRS using only the waves with identical methodological approaches.
We also found several interesting associations between episodic memory performance and sociodemographic characteristics. The effect of marital status on word recall was significant only for HRS respondents; individuals who were partnered, separated or divorced, or never married performed worse compared to those who were married. The presence of significant results in the HRS sample but not the WHV sample may be related to the fact that married/partnered HRS respondents are often interviewed one after the other. Previous research has indicated that spouses who are interviewed second may be at a disadvantage in free recall tasks [42], possibly due to the fact the first interviewed spouse may be healthier. Another possible explanation of these results is that those who are partnered have been shown to perform better on episodic memory tasks in general compared to non-partnered individuals [35].
Education was another sociodemographic characteristic that was significantly associated with word recall performance, with higher levels of education significantly predicting higher episodic memory performance. Higher levels of education are thought to influence cognitive function by increasing individual levels of brain and cognitive reserve [43]. Brain reserve refers to the inherent efficiency and capability of the brain to support and execute cognitive functions [43]. Conversely, cognitive reserve represents the brain's ability to maintain this efficiency despite the accumulation of structural and neural damage that occurs as a result of natural aging, disease, or injury [43]. Increased levels of cognitive reserve may be particularly beneficial during later stages of life [44][45][46]. Previous researchers have argued against controlling for the impact of education, stating that growing levels of education represent cohort trends that contribute to overall increases in cognitive performance [33]. However, it is possible that other factors associated with higher education (eg, increased socioeconomic status, better nutrition, greater availability of resources) may have attributed to this positive relationship.
While several computer-based cognitive batteries have been developed [47,48] to date, these have lacked correspondence to HRS tasks used in large cohort studies. The goal of the current study was to develop an application-based cognitive measure for episodic memory that could be easily used in future research studies and health assessments. The potential benefits of such online tasks can be inferred from evidence showing that including short cognitive tests as a part of a routine evaluation in the clinical or community setting aids in the early detection of cognitive decline. Individuals who self-report problems with memory may be more aware of adverse changes in cognitive performance [49]. Additionally, older adults who report problems with memory but perform normally have been shown to have structural brain changes similar to those seen in mild cognitive impairment [50].

Future Research
Future research should assess additional cognitive tasks included in the HRS. This type of research might expand the results of the current study to investigate the effects of setting (eg, waiting room, hospital room, home use of online tasks) or to support the use of routine online cognitive assessments to track cognitive change in healthy older adults or clinical populations. Furthermore, clear standards for measurement using online tasks similar to the electronic patient-reported outcome literature should be created [51]. Development of such standards is likely complicated by the fact that device and software technology continues to evolve and age-related rates of cognitive change vary across a range of domains and birth cohorts with varying computer aptitudes [52,53].

Limitations
A key limitation of the study is the use of an existing panel in the community setting. While some may argue that sampling bias is introduced by using research panels who demonstrate high levels of technological capabilities (ie, use of computers, smartphones, tablets), it has also been noted that such panels allow researchers to collect large amounts of data from diverse populations [2]. A further limitation is the lack of access to medical records that verify quality of self-reported health. Older individuals tend to rate their health more highly than younger individuals despite increases in chronic medical problems [54][55][56], and this overestimation of health may inadvertently bias results. The biases associated with self-reported health and behavior measures are well documented; however, expanding the current research into clinical settings would alleviate this issue. Also, the community setting adds a lack of environmental control (eg, interruptions) that may increase variability. A future project may compare interview-based and application-based tasks in a clinical population (eg, Alzheimer patients) during set times. Additionally, the current study focuses on episodic memory; in order to obtain a more robust estimation of cognitive abilities, future efforts should identify the correspondence between interview-based and online versions of other cognitive assessments of such as measures of semantic memory and vocabulary.
Inability to monitor respondent behavior is a limitation of online and telephone surveys [1]. For example, respondents of online or telephone word recall tasks could have written down the words on paper as they were presented. Examination of eye-tracking or client-side paradata [57] (ie, information about respondent behavior recorded by respondents' computers, such as the number of times and locations of mouse clicks) has the potential to be extremely valuable in the analysis of online survey data. Nevertheless, further technological advancements are needed before such evidence can be incorporated into cognitive measures.
In summary, this study found a high level of convergent validity between the WHV and HRS word recall tasks, after controlling for age, education, and self-reported memory. Use of application-based cognitive assessments should continue to expand in community research and clinical settings, but greater efforts need to be made in regards to validating such online measures. Additionally, researchers should be wary of a number of potential biases, including modality differences, retest effects, and gender differences in cognitive performance.