Episodic and semantic memory impairments in (very) early Alzheimer’s disease: The diagnostic accuracy of paired-associate learning formats

Abstract Paired-associate learning (PAL) paradigms measure memory processes sensitive to the medial temporal lobe, which shows atrophy in early Alzheimer’s disease (AD). PAL tests have not yet been standard clinical procedure, neither are semantic memory tests. In early AD, impairments are more subtle. A literature review indicates that standard neuropsychological tests may not measure these impairments accurately. Therefore, I constructed new episodic and semantic memory tests. I investigated the diagnostic accuracy of these tests in 37 amnestic mild cognitive impairment (aMCI; of whom 21 had converted to AD at 1.3-year-follow-up), 43 early AD patients, and 80 non-demented controls. Main questions: (1) which tests best differentiate aMCI and AD from normal aging: most sensitively, most specifically?; (2) do PAL paradigms and/or semantic memory tests (fluency; naming) contribute to this differentiation? A free recall (non-PAL) test of unrelated words was most sensitive to aMCI and AD (91%), whereas a PAL-recognition-test (of semantically related word pairs of moderate association strength, including strongly related foils) was most specific (96%). Stepwise logistic regression analysis showed that differentiation was improved by a subordinate semantic fluency test. I conclude that a combination of episodic and semantic memory components best predicts AD. Future research should focus on comparing semantic and visuospatial PAL tests.

ABOUT THE AUTHORS Pauline E.J. Spaan works part-time as an assistant professor Clinical Neuropsychology at the University of Amsterdam, and part-time as a senior Clinical Neuropsychologist, conducting neuropsychological assessment and treatment, at the Department "Psychiatry and Medical Psychology" of the OLVG hospital, Amsterdam, and in her private practice.
Her research (PhD 2003) focuses on the early assessment of dementia and how various memory components may improve prediction. In particular, she is interested in the nature of semantic processing deficits in (early stage) Alzheimer's disease and how this may contribute to more reliable differential assessment.
Furthermore, she is interested in explaining patterns of age-related cognitive deficits within the normal aging spectrum: from young-old (55+) to very old age (up to 96 years old). More specifically, she studies (by means of structural equation modeling) the interplay between episodic and semantic memory components, on the one hand, and processing speed and executive functioning, on the other hand.

PUBLIC INTEREST STATEMENT
In very early Alzheimer's disease (AD), memory deficits show relatively subtle differences with memory problems seen in normal aging, non-ADdementia or depression. Rather than administering the same tests used in clinical practice for many years, we need to develop tests that capture this subtleness. To do this, we may benefit from insights from experimental psychology, especially from research into the episodic and semantic memory deficits that are specific of early AD. I first reviewed the literature in this field to explain why I constructed my tests the way I did. This particularly involved recall and recognition of word pairs that are semantically related (e.g. "pipe" -"cigar") but not "too obvious" (e.g. "pipe" -"smoke"), because this would evoke correct answers as a result of guessing instead of actually remembering.
Results largely confirmed my theoretical expectations. This encourages further research and clinical application of these insights to improve early assessment of AD.

Introduction
The neuropsychological differentiation between normal aging and the early stages of Alzheimer's disease (AD) has been investigated for many years, both from the perspective of clinical neuropsychology and from a theoretical or experimental point of view. Clinical relevance of research into the predictors of AD is obvious: early identification of AD is crucial for optimal pharmacological treatment (e.g. Flicker, 1999;Jack & Holtzman, 2013), as well as for timely provision of (psycho)social care. The first symptoms of AD are neuropsychological deficits, which may be present up to several years before a clinical diagnosis of dementia can be made (e.g. Bäckman, Jones, Berger, Laukka, & Small, 2005;Salmon, 2012;Schmand, Huizenga, & van Gool, 2010). Therefore, the search for cognitive tests that most accurately detect the underlying pattern of brain dysfunction is of great importance. This search is of theoretical importance as well: it provides knowledge about the nature of the transition from normal aging to AD by offering information on disturbed cognitive processes and neuropathology.  found that a verbal paired-associate learning (PAL) test, requiring cued recall of semantically related word pairs of moderate association strength, best predicted dementia (most likely AD) two years before diagnosis. This study will be discussed in more detail in Section 1.1.2. Several questions arose from this research. First of all, will the high predictive accuracy of the original PAL test be replicated when sample size is larger, and when cases are indeed in an early stage of AD, rather than dementia in general? Secondly, will the differentiation be improved by the addition of other types of episodic and/or semantic memory tests?
To investigate this, the original test battery was expanded with several newly developed episodic as well as semantic memory tests. This expansion was based on a review of the literature (Section 1.1), followed by a justification of the construction of these tests based on the literature (Section 1.2). Subsequently, in Section 1.3, I will describe the specific hypotheses of the current study.

Episodic memory: the differential value of PAL tests
Various large-scale longitudinal studies have reported on the predictive value of well-known clinical tests of episodic memory (e.g. Bennett et al., 2002), particularly of measures of delayed (free) recall (e.g. Bäckman et al., 2005;Perri, Serra, Carlesimo, Caltagirone, & The Early Diagnosis Group of the Italian Interdisciplinary Network on Alzheimer's Disease, 2007). The diagnostic and prognostic accuracy of memory tests may increase if modern paradigms are used (Rentz et al., 2013). Particularly, PAL paradigms measure memory processes sensitive to the functional integrity of the medial temporal lobe (MTL) (e.g. Blackwell et al., 2004;Deweer et al., 1995;Lindeboom, Schmand, Tulner, Walstra, & Jonker, 2002;Lowndes & Savage, 2007;de Rover et al., 2011;Troyer et al., 2008). This brain area is the prime location of early AD pathology (Braak & Braak, 1991). Paired associated learning entails the binding of many features of an experience to form a memory trace (e.g. Cohen et al., 1999;Eichenbaum, 1999).

Verbal paired-associate learning tests
The Wechsler memory scale (WMS; Wechsler, 1987;1997b;Wechsler, 2009) contains a verbal PAL test. The WMS-R subtest (1987) requires cued recall of both semantically related and unrelated word pairs, whereas in the WMS-III subtest (1997b) the related pairs were omitted to increase diagnostic sensitivity. The reduced sensitivity of the related word pairs may be due to the too obvious associations (e.g. rose-flower). Targets with such overlearned associations to the cue should be avoided, because the answer can easily be guessed (Elias et al., 2000;Lowndes & Savage, 2007). Moreover, superordinate semantic category knowledge (e.g. rose-flower) is intact in AD (Rogers & Friedman, 2008), which may explain the normal performance of AD patients on the strongly related semantic word pairs of the WMS.
Nonetheless, in the WMS-IV PAL subtest (2009), four "easy" (semantically related) pairs (dooropen; sky-cloud; city-town; street-road) were included again, in addition to ten "hard" (semantically unrelated) pairs (e.g. hot-quiet, bag-truck, or day-box). It should be noted that the easy and hard conditions differ in three respects here: degree of semantic association, degree of visual imaginability, and grammatical classification (nouns vs. adjectives or adverbs). The easy condition includes strongly semantically related pairs, 87.5% are visually imaginable words, and also 87.5% are nouns. The hard condition includes semantically unrelated pairs, only 50% are visually imaginable words and 75% are nouns. This mixture of word characteristics complicates the search for cognitive tests that most accurately detect the underlying pattern of brain dysfunction in early AD. If one condition appears to be more sensitive to early AD than the other, it is impossible to tell which of the word characteristics is responsible for the difference in sensitivity. In addition, the easy condition reintroduces the disadvantage of too obvious associations. Not surprisingly in view of the aforementioned argument, Pike et al. (2013) found that the delayed recall trial of the California Verbal Learning Test (CVLT-II; Delis, Kramer, Kaplan, & Ober, 2000) discriminated better between amnestic mild cognitive impairment (aMCI) and normal aging than the WMS-IV PAL subtest. Pike et al. nevertheless suspected that (these) word pairs are less suitable in taxing the MTL system than, for example, a visuospatial PAL test.
As was already mentioned above, but contrary to Pike et al., Spaan et al. (2005) found that a verbal PAL test best predicted the transition to dementia within two years in a heterogeneous sample of initially non-demented elderly participants (N = 119). At baseline, memory performance of the persons who later converted to dementia (N = 9; most likely AD) was characterized by a reduced ability to benefit at recall from semantic relations between words. The strength of semantic relations between cues and targets was moderate according to word association norms (de Groot, 1980;. In terms of Rogers and Friedman (2008), the word pairs never had a superordinate category relationship, but were usually "attributes" (e.g. tap-bath) and sometimes "category coordinates" (e.g. pipe-cigar). In this way, correct answers as a result of overlearned associations and guessing were prevented. This verbal PAL test had better predictive validity than tests of non-PAL episodic memory (i.e. free recall and recognition of unrelated words: the 10-Word List-Learning Test (10WLLT) and a task similar to the 10-Word-Recognition Test (10WRT), also included in the present study), category fluency (similar to the Main-Category Fluency Test (MCFT) included in the presented study), working memory (digit span and block span), and implicit memory (two priming measures derived from a word-stem completion task and a perceptual identification task, and a skill learning (mirror-reading) task). More detailed information is described in .

The role of semantic memory
The finding that early AD patients are less able to take advantage of semantic relations between words may be explained by poor semantic encoding (e.g. Buschke, Sliwinski, Kuslansky, & Lipton, 1997;. More precisely, Sailor, Bramwell, and Griesing (1998) suggested that AD patients have a deficit in the ability to evaluate semantic relations. They may no longer be able to discriminate between two related concepts, because the attribute knowledge (i.e. physical features or function) that distinguishes these two concepts has been lost. This seems consistent with the findings of Rogers and Friedman (2008) that AD patients encode information only at a rather global or superordinate level. In other words, memories become schematized, semanticized, or gist-like, with a loss of contextual details (Winocur & Moscovitch, 2011). These theoretical accounts may also explain why AD patients often produced semantically related but incorrect alternatives on a PAL test: e.g. smoke instead of cigar in response to the cue word pipe, or water instead of bath in response to tap . These intrusions are overlearned associations to the cue, and consequently have higher word association frequencies than the target words (de Groot, 1980;. This raises the question whether specific tests of semantic memory, such as naming or category fluency, contribute to the differentiation between AD and normal aging, in addition to episodic memory tests. This was found in several studies in which MCI or preclinical AD patients were compared to controls (e.g. Bennett et al., 2002;Dudas, Clague, Thompson, Graham, & Hodges, 2005;Hirni, Kivisaari, Monsch, & Taylor, 2013;Lambon Ralph, Patterson, Graham, Dawson, & Hodges, 2003;Mickes et al., 2007). These results suggest dysfunction of the medial perirhinal cortex (Hirni et al., 2013) and beyond the MTL, extending to the lateral temporal lobes, in an early stage of AD (e.g. Domoto-Reilly, Sapolsky, Brickhouse, Dickerson, & for the Alzheimer's Disease Neuroimaging Initiative, 2012; Dudas et al., 2005;Hänggi, Streffer, Jäncke, & Hock, 2011;Mickes et al., 2007).

Integration of episodic and semantic memory: in a PAL format
Alternatively, one could argue that the essence of memory decline in early AD is not a matter of dysfunction of episodic and/or semantic memory, as separate or parallel impairments. Nowadays, memory theorists consider these two forms of memory (initially proposed by Tulving, 1972) as interdependent (e.g. Irish, Addis, Hodges, & Piguet, 2012). Both at encoding and at retrieval, the interdependencies between episodic and semantic memory have been demonstrated (see Greenberg & Verfaellie, 2010; for a review). For instance, new episodic learning is facilitated by an intact semantic knowledge base (e.g. Kan, Alexander, & Verfaellie, 2009), and damage to the semantic network complicates the acquisition of new episodic memories, at least in the verbal modality (e.g. Graham, Simons, Pratt, Patterson, & Hodges, 2000). Semantic memories are the basic material from which complex, meaningful, and detailed episodic memories are constructed, whether one is engaged in autobiographical retrieval of the past, or in constructing a plausible scenario of an event in the future (Irish & Piguet, 2013). Consequently, the degeneration of semantic memory is associated with impoverished and overgeneral episodic memory (e.g. Maguire, Kumaran, Hassabis, & Kopelman, 2010).
Also Winocur and Moscovitch (2011) emphasized the dynamic nature of memory and its underlying functional and neural interactions. Their transformation hypothesis proposes that the progression of memories from hippocampal to extra-hippocampal structures entails a loss of detailed, contextual features; as a result, memories become semanticized or gist-like, as was noted above. Memories continue to need the hippocampus to retain contextual details, which is problematic in early AD.
When we adopt these theoretical accounts, we may better understand memory problems of early AD patients as a disturbance of the interactive process between episodic and semantic memory, instead of (only) as isolated episodic and semantic memory deficits. A verbal PAL test is ideal to capture this interaction. Thus, a PAL test that uses semantically associated words-but not word pairs that represent overlearned or superordinate category relationships, or other very strong associations-may simultaneously capture three aspects of memory functioning that are impaired in early AD: (1) the episodic, MTL-or hippocampus-based memory process, (2) the semantic, extra-hippocampally based memory process, and (3) the interaction between the two.
In other words, I recommend to use tests that require paired associate learning of semantic relations that are not overlearned or superordinate. Instead, one should use word pairs that are moderately associated, such as category coordinates or attributes, rather than unrelated word pairs. Based on the literature review above, these task characteristics may optimize sensitivity to early AD.

Episodic memory: the differential value of recognition formats
Another refinement can be made in cued recall PAL tests to improve their specificity for MTL-related memory (encoding) deficits (e.g. Spaan, Raaijmakers, & Jonker, 2003). Because information must be encoded to be later recognized as familiar, recognition formats measure encoding rather than retrieval capacity. This results in normal or near to normal recognition performance in neurological or neuropsychiatric conditions that do not primarily involve MTL dysfunction (e.g. Parkinson's disease, depression) vs. reduced recognition performance in patients with MTL-related memory deficits such as AD (e.g. Lowndes & Savage, 2007) or amnestic MCI (e.g. Troyer et al., 2012). In persons suffering from non-MTL-related memory problems, free-and cued-recall usually show greater impairments than recognition memory. Recognition formats may, therefore, improve specificity of assessment, relative to a cued recall format (e.g. Bennett, Golob, Parker, & Starr, 2006;Lowndes & Savage, 2007;Lowndes et al., 2008;Pike, Rowe, Moss, & Savage, 2008;. Lowndes and Savage (2007) argued that a PAL task with a recognition format (i.e. a paired-associate recognition test: PART) will be both sensitive and specific of early memory change in AD. It will require structural integrity of the hippocampus (Troyer et al., 2012). Therefore, I further refine my expectations regarding PAL tests as follows. Paired-associate recognition tests that call upon the encoding of word pairs of moderate semantic association strength (rather than overlearned or superordinate associations) might be particularly efficacious in the prediction of early AD. The recognition format ensures that persons without MTL damage will produce the correct answers, resulting in high specificity. I recommend the use of distractors that are superordinate associations or words that are strongly related to the cue. This will invite AD patients, and patients with any other MTL damage, to select the wrong answers, resulting in higher sensitivity.

From theory to practice: efforts to improve construct validity of memory tasks
For a valid assessment of the component processes of memory, it is crucial to administer tests that reflect these processes as purely as possible, preferably in a form that is well tolerated by elderly participants. Unfortunately, tests that are commonly used in clinical practice and in large-scale longitudinal studies do not always meet these requirements. I will discuss several disadvantages of existing tests and ways in which they may be overcome.
First of all, immediate recall measures of episodic memory such as list-learning tests are often confounded by short-term memory or attentional components as a result of recency effects. In early AD, short-term memory is unimpaired (e.g. Spaan et al., 2003). Therefore, sensitivity will improve when a test is constructed that minimizes the impact of recency effects and, instead, measures long-term memory as purely as possible. For the same reason, delayed recall measures are better predictors of AD than immediate recall measures (e.g. Bäckman et al., 2005;Perri et al., 2007). From clinical as well as empirical evidence (e.g. Bengner & Malina, 2007;Carlesimo, Marfia, Loasses, & Caltagirone, 1996), it is known that items presented last on the list are often not recalled at the delayed recall trial. However, a delayed recall trial requires a time interval that, in clinical practice, is filled in various ways: e.g. by other tests and/or breaks, which inadvertently cause varying lengths of the interval as well as varying sorts of interference, ultimately increasing measurement error. Yet, recency effects may easily be avoided in elderly subjects, for example by a 20-s Brown-Peterson distraction task between the study and the test phase (e.g. . This task requires participants to count backwards by threes. This activity itself draws on short-term memory and removes the most recent items from short-term memory, leaving recall of items from long-term memory (e.g. Baddeley & Hitch, 1977). In this way, the task purity of the episodic memory test is improved, delayed recall testing is not necessary any more, and its accompanying disadvantages are avoided. These task characteristics were also already implemented in the study by .
Secondly, the examination of naming to confrontation (as a measure of semantic memory; which reflects an expansion compared to  could be improved as well. Naming tasks mostly utilize pictures (as in the Boston naming test; , which may introduce the unintended impact of visuoperceptual deficits (e.g. Au et al., 1995;Goulet, Ska, & Kahn, 1994). Additional administration of a task that requires naming in response to verbal definitions may contribute to the diagnostic evaluation and may also be more representative of word finding complaints in daily life (e.g. . Furthermore, speeded tasks that record naming latencies might be more sensitive measures of word finding capacities than unspeeded naming (e.g. Eustache, Desgranges, Jacques, & Platel, 1998).
Thirdly, category fluency tests may better reflect semantic memory functioning when more specific, subordinate semantic cues are provided (e.g. birds) instead of the usual superordinate cues (e.g. animals). In this way, a more pure semantic memory measure is obtained, on which early AD patients might perform more poorly than on superordinate semantic categories (consistent with the results of Rogers & Friedman, 2008). In addition, a subordinate category fluency test is less influenced by self-initiated retrieval or executive control processes (e.g. Mayr & Kliegl, 2000). These task characteristics also reflect an expansion compared to .

This study
The present study addresses the question which type of memory tests differentiate best between normal aging and aMCI or the early stage of AD. I investigated which type of episodic memory (free recall (i.e. non-PAL) or cued recall or recognition (PAL-based or not), with or without semantic associations; see Section 1.1) and semantic memory tests (naming pictures or verbal definitions, with accuracy and latencies as measures; superordinate or subordinate category fluency or phonemic fluency; see Section 1.2) differentiate best between normal aging and (pre)clinical AD. More specifically (as an expansion of the current study, in comparison with the study by , I examined whether the newly developed paired-associate recognition test (PART; see Section 1.1.5.), and perhaps also the additional semantic memory tests (see Section 1.1.3), contribute to the differentiation.
I expected that PAL of moderately semantically related words (see Section 1.1.4 and ; rather than non-PAL episodic memory tests, but including a semantic component-rather than semantic memory tests per se (without an episodic component)) and paired-associate recognition with strongly related distractors (i.e. the PART; see Section 1.1.5) are most sensitive and most specific for early AD, respectively. This hypothesis was tested by comparing test performance of early AD and aMCI patients vs. elderly normal controls (NC).

Normal controls (NC)
Participants were 171 community-dwelling volunteers between 55 and 96 years of age, recruited from different municipalities in the Netherlands through flyers and referrals from other participants.
Participants were screened for history of stroke (N = 5), traumatic brain injury (N = 1) or other neurological (N = 1) or psychiatric causes of cognitive dysfunction including substance abuse (N = 3). These subjects were excluded. Persons who used psychotropic medication (N = 4), did not have Dutch as native language (N = 1), had impaired vision that interfered with test performance (N = 1) or with missing values on more than two measures (N = 4) were also excluded. From the remaining sample of 151 NC, participants were selected that could be individually matched for age, education, and gender to individuals of the included sample of patients that is described below. Matched couples were allowed to differ a maximum of five years in age and one code of education (see Note "a" in Table 1). Prior to the test session, participants gave written informed consent. The institutional ethical review board of the department of psychology of the University of Amsterdam approved the study. Table 1 presents the characteristics of the final sample.

Patients
Patients were recruited from various hospitals, to which they had been referred because of memory complaints. All patients (N = 94) underwent comprehensive clinical, neuropsychological, neuroradiological, and laboratory assessment to objectify possible causes of memory complaints. All patients were administered a comprehensive battery of standard neuropsychological tests for the purpose of clinical assessment. The administration of these tests was independent of the administration of the computerized battery of episodic and semantic memory tests that was the focus of the present study. Consistent with the criteria of the above-mentioned ethical review board, all patients gave informed consent prior to the test session.

Amnestic mild cognitive impairment:
For patients to be included in the aMCI group, the following diagnostic criteria (Albert et al., 2011;Petersen et al., 1999Petersen et al., , 2001Winblad et al., 2004) had to be met: (1) subjective memory complaints (reported by the patient and/or a significant other in a semi-structured clinical interview); (2) objective memory decline (according to the standard neuropsychological evaluation; see description below); (3) no significant decline on other cognitive functions (according to the standard neuropsychological evaluation); (4) no significant impact on carrying out daily tasks that affects independent functioning (as reported by a significant other of the patient); (5) no dementia. Exclusion criteria were a history of stroke (N = 2); neurological disease other than AD (N = 1); serious medical illness (N = 2) or psychiatric conditions (N = 1) that could cause or worsen cognitive dysfunction. Ultimately, 37 aMCI patients were included. Scores on the Mini-Mental State Examination (MMSE; Folstein, Folstein, & McHugh, 1975) ranged between 23 and 30 (M = 26.11, SD = 1.96).  (Folstein et al., 1975;Maximum = 30). CES-D = Center for Epidemiologic Studies Depression Scale (Radloff, 1977). NC = normal controls. aMCI = amnestic mild cognitive impairment. AD = Alzheimer's disease.
Consistent with the MCI criteria (e.g. Albert et al., 2011), the aMCI patients performed deficiently on the standard memory tests (criterion 2; i.e. at least 1.5 SD (pct. < 7) below age and education corrected normative data), whereas they were not impaired on other cognitive domains (i.e. they performed on (low-)average level according to age and education corrected normative data; criterion 3; data not shown for reasons of brevity, but these can be provided on request). The patient's neurologist or geriatrician made the final diagnosis according to the MCI criteria based on the standard clinical and neuropsychological evaluation, neuro-imaging results (MRI or CT), and lab results to exclude other causes of memory decline. This assessment took place after the entire test administration. The neurologist or geriatrician was blind to performance on the episodic and semantic memory measures that were the focus of the present study.
The aMCI patients were suspected to develop AD in the upcoming years. This was supported by the observation that 21 out of 37 patients (57%)-at the time of writing, according to a review of the clinical files-had converted to AD according to consensus diagnostic criteria (McKhann et al., 1984; mean interval: 1.3 years). These patients are labeled in this study as the "preAD" (preclinical AD) patients. No systematic follow-up could be organized due to practical circumstances. Therefore, the data of this sub-sample should be viewed to be of exploratory nature. To be more specific, relevant diagnostic follow-up information was unknown for several patients because they had their medical checkup at another institution or initial assessment took place too recently (N = 10), or because other (medical) conditions emerged that were more urgent (N = 3). The remaining 3 aMCI patients did not return to the hospital anymore (for reasons unknown).

Alzheimer's disease (AD):
The diagnosis of AD was made by the patient's neurologist or geriatrician according to the NINCDS-ADRDA criteria of AD (McKhann et al., 1984), based on standard neuropsychological evaluation, neuro-imaging results (MRI or CT), and lab results to exclude other causes of memory decline. Excluded were patients with a history of stroke (N = 2); neurological disease other than AD (N = 1); serious medical illness (N = 3) or psychiatric conditions (N = 2) that could cause or exacerbate cognitive dysfunction. Ultimately, 43 AD patients were included (MMSE scores ranging between 20 and 29: M = 23.58, SD = 2.26). Table 1 reports the characteristics of the final sample of 37 aMCI, 43 AD patients, and 80 matched NC.

General test procedure
Patients were tested in the hospital by a clinical neuropsychologist or a certified test technician. The NC participants were tested in their home environment by a trained neuropsychology student. The episodic and semantic memory tests were administered by means of a computer program that was operated by the experimenter. The participant only had to look at the screen, on which the stimuli were presented. The participant had to respond orally; the experimenter registered the responses. Interference between tests was prevented: e.g. there were no overlapping stimuli between tests; tests of verbal and visual nature were alternated. The sequence of administration of the tests was the same for all participants.

Test battery
Below, the episodic and semantic memory tests of the battery are briefly described. A detailed description can be found in Appendix A. Stimuli of the newly developed episodic memory tests are presented in Appendix B. See Section 1.2 for a more detailed justification of the construction of these tests, based on the literature.

Episodic memory 2.3.1.1. 10-Word list-learning test.
This test required free recall of 10 unrelated words in three trials. Between presentation and recall phase, a 20-s distraction task prevented recency effects.

10-Word-recognition test.
This test involved explicit recognition ("yes" or "no") of the words of the 10WLLT among 20 related distractors.

Paired-associate learning test: semantic pairs.
This measure consisted of cued recall of five related word pairs, constructed in the same format as the 10WLLT.

Paired-associate learning test: non-semantic pairs.
This measure consisted of cued recall of five unrelated word pairs.

Paired-associate-recognition test. This test involved explicit recognition (forced choice)
of the target of the PALT, among three related distractors, in response to the cue.

Main-category fluency test.
This measure required the generation of exemplars from the superordinate categories of animals and products to buy in a shop within 60 s per category.

Sub-category fluency test.
This measure required the generation of exemplars from the subcategories of birds, fishes, insects, products to buy at the greengrocery, clothes shop, and do-it-yourself shop within 30 s per subcategory.   and required naming of 48 descriptions of words.

Verbal naming test: response time.
This measure represented the average response time over the correctly named descriptions of the verbal naming test.

Statistical analysis
First, paired t tests were performed to test for significant performance differences between the patients (the 37 aMCI and 43 AD patients separately) and their matched controls on the episodic and semantic memory measures described above. Cohen's d was calculated to provide information on effect size of each measure. In addition, sensitivity and specificity of classification by each measure was determined by receiver operating characteristic (ROC) analyses, providing d′ as well as area under curve (AUC) values. These analyses were performed over the largest groups of patients (the 37 aMCI and the 43 AD patients together) and their controls.
Lastly, stepwise logistic regression analyses (using a forward likelihood ratio method) were performed to determine which combination of tests differentiated best between normal aging and early AD to answer the central research questions (see Section 1.3). In the "stepwise" method, the test measure that most successfully differentiates patients from controls is selected first. Subsequently, the test measure(s) is/are selected by the program that significantly improve(s) the prediction (or group classification), if any. Selected test measures will be presented in decreasing order of differential ability. Four sets of comparisons were made: (1) 43 AD patients vs. 43 matched NC participants; (2) 43 AD patients plus 37 aMCI patients vs. 80 matched NC participants; (3) 21 aMCI patients that retrospectively converted to AD according to clinical assessment at follow-up (i.e. the exploratory sub-sample of "preAD" patients) vs. 21 matched NC participants; and (4) 37 aMCI patients vs. 37 matched NC participants. Effect size of all final models was measured by Nagelkerke R 2 (Bewick, Cheek, & Ball, 2005).

Treatment of missing data
The NC participants had no missing values; .9% of the data of the aMCI patients and 2.3% of the AD patients were missing because of fatigue. The missing values were estimated using a regression approach applying other variables that best predicted performance on the concerning measure.

Differential characteristics of each episodic and semantic memory measure
Performance on the episodic and semantic memory measures is presented in Table 2 for the samples of 37 aMCI patients, 43 AD patients, and their 80 matched NC. The estimated reliability per memory measure is also presented because I used several newly developed tasks or adapted versions of well-known tasks. All measures had a high level of internal consistency or test-retest reliability. All measures showed a significant performance difference between the patients and their controls in the expected directions. The sample of AD patients, as well as the joined group of AD and aMCI patients, performed worse (p < .001) on all measures. Effect sizes were large for all episodic memory measures, both category fluency measures and the total correct score on the visual naming test (aMCI + AD vs. NC: Cohen's d values ranging from .82 to 1.93) and medium for the remaining measures (Cohen's d values ranging from .51 to .68). The aMCI patients also showed large performance differences on most measures compared to their controls (p < .001), with smaller differences on naming and on the letter fluency test (LFT). Effect sizes were large for all episodic memory measures, both category fluency measures and the response time on the visual naming test (Cohen's d values ranging from .84 to 1.62) and small for the remaining measures (Cohen's d values ranging from .36 to .42). Table 3 presents the classification characteristics of each memory measure for the aMCI and AD patients together vs. their controls. The paired-associate recognition test (PART) had the highest differential ability. Consistent with my hypothesis described in Sections 1.1.5 and 1.3, this was mainly

Table 2. Mean performance and estimated reliabilities of the episodic and semantic memory measures, per clinical group
Notes: Possible ranges per memory measure are presented between parentheses. 10WLLT = 10-word list-learning test. PALT-s = paired-associate learning test: semantic pairs. PALT-ns = paired-associate learning test: non-semantic pairs. 10WRT = 10-word-recognition test. PART = paired-associate-recognition test. MCFT = main-category fluency test. SCFT = sub-category fluency test. LFT = letter fluency test. VsNT-tc = visual naming test: total correct. VsNT-rt = visual naming test: response time. VbNT-tc = verbal naming test: total correct. VbNT-rt = verbal naming test: response time. NC = normal controls. aMCI = amnestic mild cognitive impairment. AD = Alzheimer's disease. a Cronbach's alpha determined over a sample (N = 341), also including excluded participants. b Because Cronbach's alpha could not be calculated for these measures, test-retest reliability (including level of significance) was calculated instead, derived from 24 NC that were retested two years later. c 95% confidence interval for the mean.
Paired t tests over NC, aMCI, and AD groups. *Significant difference at p < .05. **Significant difference at p < .01. ***Significant difference at p < .001 between: 1 NC and aMCI; 2 NC and AD. due to the very high level of specificity, whereas sensitivity was still satisfactory. Only the 10-word list-learning test (10WLLT) and the number of semantic pairs recalled on the paired-associate learning test (PALT-s) had higher sensitivity (i.e. the latter (not the first) also consistent with my hypothesis described in Sections 1. 1.4 and 1.3). The 10WLLT also had high specificity. The "Paired-Associate Learning Test: non-semantic pairs" (PALT-ns) differentiated less well. Performance of aMCI and AD patients on this test was close to floor level, whereas performance of NC participants was rather variable (see Table 2; this seems consistent with the rationale described in Sections 1. 1.2 and 1.1.4). The other recognition test (10WRT) had a sensitivity level similar to the PART, but its specificity was lower.

Memory
Concerning the semantic memory measures, only the sub-category fluency test (SCFT) showed adequate levels of sensitivity and specificity. This seems consistent with the rationale described in Sections 1.1.3 and 1.2. The MCFT and particularly the LFT had much lower levels of sensitivity and specificity. Sensitivity of the naming measures was also low, although specificity of the total correct score on the visual naming test (VsNT-tc) was reasonable.
For reasons of comparison, the classification characteristics of the MMSE were also determined (see Table 3). Note that the MMSE is an important variable in clinical dementia assessment, and group classification is partly dependent on the MMSE. Nevertheless, differential ability of the MMSE does not (clearly) exceed differential ability of the PART, 10WLLT, PALT-s, and SCFT. Particularly, sensitivity of the 10WLLT and specificity of the PART are evidently better than of the MMSE.

Memory measures best predicting Alzheimer's disease
Finally, I investigated which combination of tests most accurately classified the early AD patients among non-demented controls. Table 4 shows the best differentiating tests, in decreasing order of differential ability, according to "stepwise" logistic regression analyses. These analyses were performed over various subsamples of aMCI and AD patients and their matched controls, to investigate whether the best differentiating tests varied according to severity of symptoms (i.e. the proportion of clinically diagnosed AD patients).
In comparisons of AD patients vs. controls and AD plus aMCI patients vs. controls (comparisons 1 and 2; Table 4), the PART was selected first by the "stepwise" method. Thus, it may be concluded that the PART differentiated best when the patient sample included a relatively high proportion of clinically diagnosed AD patients. Also in aMCI patients that were retrospectively diagnosed with AD (comparison 3: the preAD subgroup), the PART contributes significantly to the differentiation. This seems consistent with its high specificity (see Section 3.2), and with the second part of my hypothesis. In comparison 4, the PART was not selected as one of the best differentiating tests. A closer look at the results showed that the PART did contribute significantly to aMCI classification accuracy, but at step 1 the 10WLLT (score 39.55, p < .001, vs. 38.52, p < .001) contributed slightly better, and at step 2 the PALT-s (score 13.94, p < .001, vs. 10.05, p = .002) contributed slightly better (see Table Note "b"). Therefore, I conclude that the PART is certainly not irrelevant, but slightly less successful to Table 4

. Combination of episodic and semantic memory measures best differentiating between each of four samples of patients with (preclinical) Alzheimer's disease or amnestic mild cognitive impairment and cognitively healthy elderly controls (matched for age, education, and gender) and their classification accuracy and the effect size of the final predictive model (measured by Nagelkerke R 2 )
Notes: AD = Alzheimer's disease. NC = normal controls. aMCI = amnestic mild cognitive impairment. preAD = preclinical AD (i.e. aMCI patients that were retrospectively formally diagnosed with AD). PART = paired-associate-recognition test. 10WLLT = 10-word list-learning test. SCFT = sub-category fluency test.
PALT-s = paired-associate learning test: semantic pairs. AUC = area under curve (of the ROC curve). a In decreasing order of differential ability, according to stepwise logistic regression analysis. b 10WLLT and PALT-s were of slightly higher significance to enter the equation at steps 1 and 2, respectively, than PART. c Specific combination of measures, entered together (not stepwise) in the logistic regression analysis, on an exploratory basis. detect AD at a very early or preclinical stage. Although it is not certain that all aMCI patients were in fact in a preclinical stage of AD (i.e. which is certain in the preAD subgroup).

Best
Compared to the PART, the 10WLLT, seemed better able to detect AD at the earliest (preclinical) stage (comparisons 3 and 4; see Table 4). In comparisons 1 and 2, it was not selected first, but still contributed significantly to the classification, whereas in comparisons 3 and 4 it was the best predictor of group classification. Furthermore, a semantic memory measure-the SCFT-significantly improved the differentiation, in three out of four comparisons (except for comparison 3, focused on the 21 preAD patients). Thus, the differentiating value of the SCFT does not seem to be affected by the proportion of clinically diagnosed AD patients. The same was true for the PALT-s.
Classification accuracy was very high in all four comparisons. In addition, effect size of all final models, as measured by Nagelkerke R 2 , was large (>.80). This indicates that the explanatory variables are highly useful in detecting early AD (e.g. Bewick et al., 2005). Consistently, the set of best differentiating tests is similar over all four comparisons.
On an exploratory basis, I conducted a few additional analyses over the largest groups of participants (comparison 2), entering the best differentiating tests together (instead of using the stepwise method). As is shown in the lower part of Table 4, the combination of the two best differentiating measures (PART and 10WLLT) showed an excellent accuracy of classification. The same level of classification success was achieved by the combination of the PART, PALT-s, and PALT-ns (and nearly the same level when the PALT-ns was left out).

Discussion
In the present study, the diagnostic accuracy of a computerized battery of episodic and semantic memory tests was investigated in a group of amnestic MCI and early AD patients, relative to cognitively healthy elderly controls. The tests were constructed to measure specific memory components as purely as possible by minimizing the impact of short-term memory on episodic memory and of executive control processes on semantic memory. In addition, the tests were selected to create a varied battery involving: free recall; cued recall of semantically unrelated and related words (avoiding overlearned or superordinate associations); recognition (in forced choice and yes/no-format); superordinate and subordinate category fluency and phonemic fluency; accuracy and speed of naming of pictures and verbal descriptions of words.

Episodic and semantic memory impairments in (pre)clinical Alzheimer's disease
The main question was which type of memory tests differentiated best between normal aging and early or preclinical AD. In a previous study,  found that dementia was best predicted, two years before diagnosis but in a very small sample (N = 9), by reduced benefit of semantic relations (of moderate association strength) in a paired-associate learning test-the PALT-s measure in the current study. In the current study, I investigated whether the differentiating value of a (semantic) PAL paradigm could be replicated in a much larger sample of patients, who were actually diagnosed with AD, or who (probably) were in a preclinical stage (the aMCI patients). More precisely, I examined whether the newly developed paired-associate recognition test (PART) and perhaps also semantic memory tests improved the differentiation, particularly concerning specificity of classification.
The results showed that the PART (indeed) and the 10-word list-learning test (10WLLT; contrary to the hypothesis) were, respectively, most specific and most sensitive to AD. However, the combination of the PART and the PALT-s was nearly as successful. From a practical point of view, the latter combination is more time-efficient than the first, because administration of the PART requires the preceding administration of the PALT, during which the word-pairs are studied. In any case, the PALT and the PART classified early AD more accurately than the MMSE; despite the fact that the MMSE partly determined group classification, whereas the PALT and the PART did not. http://dx.doi.org/10.1080/23311908.2015.1125076

The diagnostic accuracy of PAL formats
Consistent with the literature reviewed in Section 1.1., PAL paradigms indeed appear to be very sensitive to MTL-related memory deficits characteristic of AD (e.g. Lowndes & Savage, 2007;Troyer et al., 2008). The 10WLLT proved to be highly sensitive as well, but it was less specific, presumably because it does not involve the binding of to-be-learned stimuli as in PAL tests. Instead, the 10WLLT demands self-initiated effortful search strategies that may be independent of MTL functioning. These processes may be sensitive to disturbances within, for example, the prefrontal cortex that also occur in normal aging, especially at very old age (e.g. Crawford, Bryan, Luszcz, Obonsawin, & Stewart, 2000;Dempster, 1992). This may explain the relatively low specificity of the 10WLLT.
The results indicate that episodic memory tests that call upon efficient semantic association of words are most accurate in the prediction or classification of AD. This may also explain why the PART obtained a higher specificity than the 10 word-recognition test (10WRT). The latter test less easily evokes semantic associations (and thereby false recognitions, common in early AD), as these were not induced during the study phase in the 10WLLT. Furthermore, cued recall of semantically unrelated word pairs (PALT-ns) was less accurate in classifying AD. This is in contrast with some studies, which intentionally tested recall of semantically unrelated words in order to increase sensitivity (e.g. Lowndes et al., 2008;Wechsler, 1997b). However, the type of semantic association between words in a PAL paradigm may be crucial: whereas overlearned or superordinate semantic knowledge is intact in AD, adequate encoding of moderately associated words (representing an "attribute" or "category coordinate" relationship) causes problems (e.g. Rogers & Friedman, 2008;Sailor et al., 1998;.

The role of semantic memory
From the findings described in the previous section, it may be concluded that poor performance on episodic memory tests that require a sufficiently deep and detailed, and a less gist-like level of semantic processing is most predictive of early AD (see also Winocur & Moscovitch, 2011). In addition, it is probably not a coincidence that the SCFT contributed to the differentiation. This is a more purely semantic memory test than standard superordinate category fluency tests (i.e. the MCFT). The MCFT did not contribute in the multivariate models (Table 4) and it had a clearly lower differential ability than most of the other tests (Table 3). The diagnostic value of the SCFT is consistent with studies reporting additional differential ability of certain semantic memory tests (e.g. Dudas et al., 2005;Hirni et al., 2013;Mickes et al., 2007).
Thus, a combination of episodic and semantic memory components seems to detect early AD best. This is consistent with recent theoretical views on memory fractionation stating that episodic and semantic memory are interdependent (Greenberg & Verfaellie, 2010) or dynamically interacting (Winocur & Moscovitch, 2011). I argue that the PALT-s and the PART are memory tests in which this interdependence is crucial. Both tests require adequate binding of contextual (i.e. episodic) information and information stored in semantic memory (e.g. Reder, Park, & Kieffaber, 2009). The additional value of the PALT-s and the PART, above traditional memory tests including most PAL paradigms, is owed to the selection of the stimulus materials: word pairs of moderate semantic association strength in the first test, and distractor words that were strongly related to the cue word in the second test. These task characteristics require a high quality of binding between episodic and semantic information to prevent mistakes. This probably renders these tests more appropriate than conventional tests, such as the 10WLLT, PALT-ns, and 10WRT (or visual PAL tests reviewed in Section 1.1.1, although this was not investigated in this study), to detect the memory problems of early AD but not of normal aging.

Strengths and limitations of the present study and future research directions
A strength of the present study is my attempt to improve construct validity and increase task purity of the tests I administered. Most important in this respect, and innovative compared to previous studies, was the critical selection of word pairs and distractors of the PALT-s and the PART, and the avoidance in both recall tests (10WLLT and PALT) of the unintended impact of short-term memory as a result of recency and sequential learning effects. In addition, the fact that the test battery was computerized probably reduced measurement error, thus increasing test reliability (Snyder et al., 2011). Also Rentz et al. (2013) recommended computerized testing.
However, it should be noted that this study was not specifically aimed at systematically investigating which specific task conditions (mentioned above) better differentiated between normal aging and early AD than others. This could still be explored in future research. Although these kind of issues were not investigated within an experimental research design, the relatively better differential ability of some tasks compared to other tasks is still clinically and theoretically interesting.
In addition, it would be interesting to directly compare the differentiating value of the PAL paradigms that were found useful in the current study (i.e. PART and PALT-s, involving carefully chosen semantic memory components), with visuospatial PAL paradigms (as in the CANTAB; Morris et al., 1987), or visual PAL paradigms, as in the visual-association test (Lindeboom et al., 2002). It may be noted that the PALT predicted demented cases more accurately than the visual-association test in the  study. Nonetheless, a new direct comparison in a larger study would still be interesting, especially when also a visuospatial PAL test is included. In any case, one should prevent circularity between independent (diagnosis) research variables and dependent (test performance) research variables.
Limitations of my study may be the moderate sample size and the lack of a systematic follow-up assessment procedure in the aMCI subsample. However, the effect sizes of all differentiating tests and of all final predictive logistic regression models were large. In addition, the set of best differentiating tests was highly similar over four performed analyses or (sub)samples. Although I have no (systematic) follow-up information of all MCI patients, recent literature (Albert et al., 2011;Dubois et al., 2007) shows that it is likely that this group contains a large proportion of very early AD patients. This was also indicated by the exploratory subsample of 21 preAD patients.

General conclusions and recommendations for clinical practice
My results suggest that a carefully chosen combination of episodic and semantic memory components, integrated within single test paradigms, is able to almost perfectly separate amnestic MCI and early AD from normal aging. This is in accordance with recent theoretical views on memory organization that emphasize the interdependence and dynamic interaction between episodic and semantic memory (Greenberg & Verfaellie, 2010;Winocur & Moscovitch, 2011). A verbal PAL paradigm-particularly using a recognition format-was found to be most adequate. The type of semantic associations incorporated in a PAL paradigm seems crucial for obtaining high sensitivity. Moderately associated words should be selected rather than more strongly related words that represent overlearned or superordinate semantic associations. Whether this type of (semantic) PAL paradigm predicts AD better (or not) than, for example, visuospatial PAL paradigms (e.g. De Jager et al., 2002;Fowler et al., 2002;Mitchell et al., 2009) still has to be investigated in future research, also taking into account the heterogeneity of Alzheimer's disease.
Experimental-neuropsychological studies have reported numerous interesting findings that could be useful for clinical practice. I hope to have shown with this study that the search for memory measures that best detect early AD benefits from joining knowledge and methods from clinical neuropsychology and experimental memory psychology (e.g. Lowndes & Savage, 2007;Spaan et al., 2003). (PALT-ns). This measure consisted of cued recall of five semantically unrelated word pairs, as part of the paired-associate learning test described above. The score was the sum of pairs reproduced over three trials (range: 0-15).

A.1.1.4. Paired-associate learning test: non-semantic pairs
A. . This test involved explicit recognition (forced choice) of the target words of the paired-associate learning Test in response to the presented cue words. In each trial, three distractor words were simultaneously presented on the screen together with the target and the cue (see Figures 1a and 1b). Semantic associations were derived from the same Dutch word association norms (de Groot, 1980;. Words were matched for word length. The score was the sum of correct answers (range: 0-10).

A.1.2. Semantic memory measures
A.1.2.1. Main-category fluency test. This measure required the participant to generate as many exemplars that belonged to the categories of "animals" and "products to buy in a shop" as he/ she could think of within 60 s per category. The score was the sum of correct and unique answers over both categories.
A.1.2.2. Sub-category fluency test. As part of the category fluency test mentioned above, the participant was required to generate as many exemplars that belonged to the subcategories of "birds", Notes: pipe-cigar. The first distractor ("smoke") was more strongly semantically related to the cue ("pipe"), compared to the association between the target ("cigar") and the cue, whereas the second distractor ("tobacco") was of similar association strength; the third distractor was semantically unrelated to the cue ("soup"). Notes: nail-butter. The first distractor ("finger") was strongly semantically related to the cue ("nail"), whereas the second distractor ("bread") was strongly semantically related to the target ("butter"); the third distractor ("polish") was moderately related to the cue.