The nature of specific autobiographical memories

The specificity of autobiographical memories has been studied frequently in the past three decades, in the fields of cognitive, social, and clinical psychology. An autobiographical memory (AM) refers to “a memory that is concerned with the recollection of personally experienced past events” (Williams et al., 2007, p. 122). The level of episodic specificity with which such personal memories are recalled has become a target of increased interest related to topics such as aging, cultural differences, and—most importantly—psychopathology (e.g., Addis, Wong, & Schacter, 2008; van Vreeswijk & de Wilde, 2004; Wang, Hou, Tang, & Wiprovnick, 2011). Specificity is important not only for vividly reexperiencing past events, but also for clearly imagining future events. In fact, remembering past events and envisioning possible future events share similar cognitive functions and neural substrates (Addis, Wong, & Schacter, 2007; D’Argembeau & Van der Linden, 2006; Okuda et al. 2003). Thus, AM recall, and particularly the retrieval of episodic specificity, is assumed to play a central role in human functioning because AM specificity contributes to a sense of self and serves as a source for future planning and goal pursuit (Williams et al., 2007).

Reduced AM specificity is considered particularly important in clinical psychology. Empirical studies in this area typically use the Autobiographical Memory Test (AMT; Williams & Broadbent, 1986), which has been used in more than 100 clinical and nonclinical studies since 1986 (for reviews, see Griffith, Sumner, et al. 2012; Sumner, Griffith, & Mineka, 2010; van Vreeswijk, & de Wilde, 2004; Williams et al. 2007). Such studies have repeatedly shown that people with major depressive disorder (American Psychiatric Association, 2013) find it difficult to retrieve memories of specific personal events (Dalgleish & Werner-Seidler, 2014; Williams et al., 2007). In the AMT, participants are asked to recall specific personal memories in response to emotional cue words (e.g., happy, sad). A specific memory refers to an event that occurred on a particular day and did not last longer than a day (e.g., “The day I turned 18, I got new shoes from my girlfriend”). In the AMT, healthy controls are able to provide specific memories on more than 80% of the occasions. However, people with depression tend to respond relatively more often with a memory that summarizes a category of similar events (e.g., “Getting presents from friends and family”; Williams et al., 2007). This lack of specificity, or overgeneral memory, is considered a hallmark of depressive cognition. Overgeneral memory is relatively stable over time, despite clinical improvement in the depressive symptoms, since it is associated with poor prognosis in terms of depression and PTSD (Brittlebank, Scott, Williams, & Ferrier, 1993; Kleim & Ehlers, 2008; Peeters, Wessel, Merckelbach, & Boon-Vermeeren, 2002; for a review, see Sumner et al., 2010).

Memory and language: Text-mining and machine-learning approaches

The assessment of AM largely relies on participants’ subjective narratives. Therefore, the linguistic analysis of written AMs could be a useful tool to obtain further insights into the aspects of the architecture of AM that may have psychological or psychopathological relevance. For example, Park, St-Laurent, McAndrews, and Moscovitch (2011) focused on present tense while referring to a past action (i.e., historical present) in autobiographical narratives, suggesting that patients with unilateral temporal lobe excisions used fewer instances of the historical present. This result was interpreted as an indication of a reduced sense of reliving one’s AMs. Other psychoneurolinguistic studies have shown an altered reported-speech use (i.e., direct quotes or indirect paraphrases of someone’s thoughts and words) and reduced creative use of language (i.e., verbal play) in amnesia, which is regarded as evidence of the interdependence across language, memory, and brain functioning (Duff, Hengst, Tranel, & Cohen, 2007, 2009).

Recent technological development in natural language processing (e.g., Nadkarni, Ohno-Machado, & Chapman, 2011) has enabled researchers to analyze respondents’ narratives automatically and comprehensively in order to find specific linguistic features that are relevant to psychopathology, such as depression and suicidal ideation (e.g., Rosenbach & Renneberg, 2015; Rude, Gortner, & Pennebaker, 2004; Stirman & Pennebaker, 2001). Furthermore, studies combining text-mining and machine-learning approaches have started to develop computerized algorithms that distinguish between writings of people with and without psychopathological problems, for example, genuine versus elicited suicide notes (Pestian, Nasrallah, Matykiewicz, Bennett, & Leenaars, 2010) and tweets of depressed versus nondepressed individuals (De Choudhury, Counts, & Horvitz, 2013). Machine-learning (e.g., Mitchell, 1997) of text categorization is a statistical approach to generate a rule of text classification by learning associations between characteristics of language use (e.g., word choice, sentence length) and specific categories of documents. As Sebastiani (2001) argued, the algorithm generated by such a machine-learning approach often achieves accuracy comparable to that of human experts. Automatic text classification by machine learning is used in various fields of text analysis, such as scoring student essays (Hastings, Hughes, Magliano, Goldman, & Lawless, 2012), testing readability of texts (Sung et al., 2015), and investigating grammatical development of child language (Hassanali, Liu, Iglesias, Solorio, & Dollaghan, 2014). The attempts to develop a classifier of both suicidal notes and depressed tweets have been successful as well, achieving approximately 80% prediction accuracy (De Choudhury et al. 2013; Pestian et al. 2010). However, to the best of our knowledge, no study has applied such approaches to the clinically and theoretically important phenomenon of reduced specificity of AM.

The present study

The present study aimed to explore the linguistic nature of specific memories generated on the AMT by developing an automatic classifier that distinguishes between specific and nonspecific memories conforming to the traditional dichotomous use of AMT responses in the field (Studies 1 and 2), and one that discriminates between the distinctive types of nonspecific memories in a five-category classification of the AMT (Study 3). By the nature of the written version of the AMT, all memories are recorded as sentences, which human raters then classify as specific or nonspecific ones. However, as most of the information that is needed for the specific/nonspecific categorization is in the actual sentences that respondents write down, analyzing the linguistic nature of the written memories could reveal unique characteristics that make a memory (more or less) specific.

In the present study, we followed standard procedures for text classification using machine-learning techniques (Ikonomakis, Kotsiantis, & Tampakas, 2005; Joachims, 2005; Sebastiani, 2001). These include several preprocessing steps such as (a) tokenization and lemmatization, (b) part-of-speech (POS) tagging, and (c) feature selection. First, we decomposed all the written memories into the smallest grammatical unit, or morpheme (tokenization and lemmatization). For example, a sentence like “I played the guitar” can be decomposed into “I/play-ed/the/guitar.” Second, each morpheme (often referred to as a “token”) was tagged with an appropriate POS on the basis of its definition and context (noun/verb-suffix/article/noun, for this example). Third, we compared the frequency of the morphemes between observer-rated specific and nonspecific memories in order to extract the linguistic features that are most relevant to “memory specificity” (feature selection). Despite the exploratory nature of the present study, we hypothesized that a specific memory would have a greater number of morphemes. A specific (vs. nonspecific) memory normally contains more detailed descriptions of a particular event, that is, contextual information such as temporal, spatial, and sensory–perceptual details, given that a specific memory is defined as a memory referring to a personal past event that occurred at a particular place and time, and lasted for less than one day (Williams et al., 2007). In contrast, expressions referring to extended durations of time (e.g., life and everyday) and repeatedness of an event (e.g., always and often) could be, in turn, a strong indication of a nonspecific memory. Feature selection is also of practical importance for machine learning methods, because the reduction in dimensionality (removing irrelevant features for classification) has a number of advantages such as saving computational resources, shrinking the search space, and most importantly, improving classification accuracy (Ikonomakis et al., 2005).

Next, we developed a text classifier by training a support vector machine (SVM; Cortes & Vapnik, 1995), which is a machine-learning technique that shows high performance for two-group classification problems. Using a set of training samples, whose categories (i.e., teacher signal) are already known, a SVM training algorithm develops a “rule” of input features that best classifies a new sample into one category or the other. The SVM was trained to distinguish between specific and nonspecific categories using a document–term matrix consisting of morpheme frequencies as input.

The SVM performance is often assessed by its prediction accuracy for “unknown” test data samples, which were not used in the SVM training processes. Our performance test was twofold. First, we assessed the performance in a test dataset, which comprised AMs that were recalled in the same way (by using the same cue words) as the AMs in the training dataset (Study 1). Second, the SVM was tested in another test dataset, which comprised AMs that were recalled in response to a cue word set that was different from the training dataset (Study 2). This two-step procedure enabled us to examine the general versatility and robustness of the SVM—the extent to which the SVM can predict AM specificity accurately across different memory contents. This performance test is of theoretical importance, as it could reveal whether memory specificity is determined by “local” words that are specifically associated with the types of cue words, or by general linguistic features that exist across memories triggered by different (sets of) cue words.

We also tested performances of SVM classifiers that were trained only on specific components of language features, such as sentence length, function words, and content words.Footnote 1 By comparing the full SVM trained on all the language features with those restricted models, we can explore which linguistic features specifically contribute to the nature of specific memories, and which part of the structural or semantic information of a written AM is important for the accurate classification of AMT responses.

We started with binary classification between specific and nonspecific memories (Studies 1 and 2), although the AMT has four additional categories for the nonspecific memories: extended memories (referring to an event that lasted for more than one day), categoric memories (a category of events or an event that occurred repeatedly), semantic associates (general semantic information), and omission (failure to make a response). Given that most previous studies that used the AMT utilized the proportion of specific memories as an index of memory functioning (i.e., memory specificity), we focused on developing a classifier that detects specific memories out of the other four types of responses. Indeed, Griffith, Sumner, et al. (2012, p. S22), in their review on the assessment of memory specificity, concluded that in most studies in this field, AMT responses are scored and used dichotomously as specific versus nonspecific categories. In Study 3, we evaluated the full five-class classification of the AMT. It was expected that the similarities and ambiguities among “nonspecific” memories might be a challenge to solve the five-class problem (e.g., “Attending a rock fest,” could be an extended memory, a categoric one, or even a semantic associate, depending on the context).Footnote 2

Study 1

Method

Participants and procedure

Participants (N = 1,240; 620 men and 620 women; mean age = 44.6 years, SD = 15.1 years) were recruited from a Japanese community population by an online survey company (Macromill Inc., Tokyo, Japan). Because all participants had to complete online registration via the website of the survey company, we can assume that most participants had at least minimal knowledge about computers and (the use of) internet. The survey company invited participants, from their list of potential (more than 1 million) respondents, with a small incentive (coupon for online shopping). Participants received explanations about the aims and protocols of the present study, and provided informed consent at the beginning of the survey. All procedures were approved by the institutional review board of Kansai University.

Participants were asked to complete the AMT via the internet with minimal instructions and without any time constraints to recall memories. The Japanese version of the AMT consists of ten emotional (five positive and five negative) cue words that are matched in terms of valence, frequency, and imageability (Gotoh & Ohta, 2001): self-assured, promising, succeeded, lucky, and content are the positive cues; discontent, desperate, fatigued, painful, and failed are the negative cues (Takano, Ueno, Mori, Nishiguchi, & Raes, 2016). In response to each cue word, participants were asked to recall and write down an event that they personally experienced in the past. We employed the shortened and simplified version of the instruction used in Heron et al. (2012), which did not explicitly mention that the memory should be specific (see also Debeer, Hermans, & Raes, 2009; Raes, Hermans, Williams, & Eelen, 2007). This minimal set of instructions has been shown to be more sensitive to variations in memory specificity in nonclinical samples, as compared to the traditional instructions in which respondents are explicitly asked to retrieve specific memories (Griffith et al., 2009; Heron et al., 2012). The authors coded all written memories into five categories before statistical analysis, following the coding rules established in previous studies that used the written version the AMT (Heron et al., 2012). The categories were as follows: specific memory, which refers to an event that occurred on a particular day (e.g., Last Sunday, I won 150 euros in the National Lottery); extended memory, which refers to an extended period of time (i.e., longer than a day; e.g., When I was married to my first wife); categoric memory, which refers to a category of similar or repeated events (e.g., I used to buy lottery tickets every Sunday); semantic associate, which is not a memory but a mere association to the cue, like an evaluative description of a person or an object (e.g., I’m a lucky person); and omission, in which participants did not write down a response. The hand-scoring of the AMT responses identified 3,531 (28%) specific memories, 1,306 (11%) extended memories, 1,644 (13%) categoric memories, 3,406 (27%) semantic associates, and 2,513 (20%) omissions. In order to simplify the classification problem, and because memory specificity is the most important class in the literature (e.g., Williams & Broadbent, 1986), all AMT responses were binary coded as specific versus nonspecific in the analyses of Studies 1 and 2. The interrater consistency of manual coding was tested on 200 memories that were independently rated by two raters. Interrater agreement was 91% for the two-class categorization, and kappa = .73 for the five-class categorization. The disagreements were resolved though discussion among four of the authors.

Parsing and morpheme analysis

All written memories were decomposed into morphemes (or unigrams) by using a text-mining software (KH coder; Higuchi, 2004, 2015), which integrates the Japanese text parser (ChaSen; Matsumoto, 2000), a database, and tools for multivariate statistical analyses. We used the software to obtain a term–document matrix through tokenization (decomposing a sentence into morphemes), lemmatization (standardizing different inflected forms of a word), and POS tagging.

All written memories were collapsed across ten emotional cues because we confirmed that the specificity of a memory was correlated among different cues and represented best by a unidimensional or unifactorial structure (Griffith, Kleim, Sumner, & Ehlers, 2012; Griffith et al., 2009; Heron et al., 2012; Takano et al., 2016 Footnote 3). Across the total of 12,400 memories, the software extracted 6,729 different types of morphemes. At the same time, POS tags were assigned to each word in each memory. After parsing, we calculated three linguistic statistics for each memory: (a) number of morphemes (i.e., token number), which reflects the length of the written memory; (b) number of types of morphemes (i.e., type number), which indicates the vocabulary used in the memory; and (c) type–token ratio (i.e., number of types of morphemes divided by number of morphemes), which reflects the richness of the vocabulary after controlling for the length of the sentence (e.g., Holmes, 1985). We did not include identifiers of cue words in the input vector because such information (which memory was generated in response to which cue words) did not improve the prediction accuracy of the SVM.

Training of the SVMs

We divided the 12,400 memories into training and test datasets. Of these, 8,200 randomly sampled memories were used as the training dataset, on which the SVMs were trained. The remaining 4,200 memories (around one third of the total memories; Statnikov, Aliferis, & Hardin, 2011) were used as the test data, in which the classification accuracy of the SVMs was assessed. This division between the training and test datasets was fixed across different SVMs, which allowed us to compare the model performance on a constant test dataset. We selected morphemes that should be important for classifying specific and nonspecific memories only from the training dataset. The selection criterion was the relative differences in term frequency (TF; how many times the word was used in a certain group of texts; e.g., Manning, Raghavan, & Schütze, 2008) between the observer-rated specific and nonspecific memories. This was indexed by chi-squared statistics (see also the individual morpheme analysis in the Results and Discussion section). This type of feature vector is often referred to as a document-term matrix, which describes the frequency of terms (or morphemes) that occur in a corpus, with rows as documents (i.e., AMs) and with columns as types of morphemes. Four different feature vectors were prepared as input signals, which consisted of term frequencies of 50, 100, 200, and 500 morphemes,Footnote 4 respectively. We selected these morphemes, which had (a) the highest chi-squared statistics (largest differences in TF between the specific and nonspecific categories), and (b) enough frequencies (more than four in the training dataset). In addition to the morpheme frequency, each of the four feature vectors included the token number and character count of each AM. Character count was included as an additional measure of sentence length because compound (or hyphenated) words are counted as a single token but have more specific meanings and higher character counts.

We employed a Gaussian-kernel SVM, which was trained using five-fold cross-validation within the training dataset. Cross-validation was used to avoid overfitting of the SVM to a single training dataset. To optimize the SVM, we performed a grid search for cost and gamma parameters. However, we found that the parameter optimization did not improve the SVM performances remarkably. Furthermore, because we needed to compare one SVM to the other in the performance tests, we reported the SVMs trained with default values (cost = 1; gamma = 1/data dimensions). Because our training dataset was not balanced between the two classes, we put greater weight on the “specific” than “nonspecific” category (i.e., weight balance = 2:1) in the SVM training.

As a performance test, the trained SVM was used to predict the outcomes (i.e., memory specificity) of the split test dataset. Data splitting was performed because (a) the generalizability of the trained model is not always guaranteed without a validation in an “unknown” test dataset that is separated from the model training process (cf. Esbensen & Geladi, 2010) and (b) we aimed to compare the SVM performance between two different contexts of AM retrieval (i.e., same vs. different cue words; Studies 1 and 2).

The SVM performance was evaluated by a receiver operating characteristic (ROC) analysis. In this performance test, hit rate (ratio of specific memories correctly identified by the SVM relative to all specific memories by observer rating) and correct rejection rate (nonspecific memories correctly predicted by the SVM relative to all observer-rated nonspecific memories) were calculated. ROC curves were depicted by plotting the hit rate against the false alarm rate (=1 – correct rejection) at various thresholds of SVM decision values. Note that decision values are continuous scores, which are transformed into binary outputs (i.e., a specific vs. nonspecific category) via a discriminant (or sign) function. Each possible threshold of this discriminant function provides a single point of the ROC curve (e.g., Rakotomamonjy, 2004).

The area under the curve (AUC) was also computed by summing up the areas under each possible point of the ROC curve, which indicates how well a classifier can distinguish between two groups, with .5 indicating a chance level and 1 indicating a perfect separation. Unlike the hit and correct rejection rates, the AUC depicts general behavior of the model that is independent of the thresholds of the SVM decision values. We utilized the R e1071 package (Meyer, Dimitriadou, Hornik, Weingessel, & Leisch, 2014) for training SVMs, and ROCR package for ROC analysis (Sing, Sander, Beerenwinkel, & Lengauer, 2005).Footnote 5

Results and discussion

Token and type numbers

As a first step, we tested the differences in basic linguistic statistics (token number, type number, and type–token ratio) between observer-rated specific and nonspecific memories. Means and standard deviations for these basic linguistic statistics are presented in Table 1. The results of an analysis of variance (ANOVA) indicated that token and type numbers had significant differences among the five AMT classes, implying that memory responses (specific, extended, and categoric) tend to be written in longer sentences and to contain a richer vocabulary than nonmemory responses (semantic associates and omissions). Omission and semantic associates had greater type–token ratios than the other three types of memory responses, indicating incompleteness of the sentence rather than richness of the vocabulary; omission and semantic associates typically have fewer words, especially fewer function words (e.g., auxiliary verbs and particles), than specific and the other two types of memory responses.

Table 1 Basic linguistic statistics for Studies 1 and 2

Individual morpheme analysis

To reveal predictive morphemes that can discriminate between specific and nonspecific memories, we calculated the chi-squared statistics for each morpheme, to determine the extent to which the TFs differed between specific and nonspecific memories. The morphemes were sorted and ranked by the magnitude of the chi-squared statistics in Table 2 (higher chi-squared scores correspond to greater differences in TFs between the specific and nonspecific memory category). This morpheme analysis was performed on the training dataset.

Table 2 Morphemes that had the largest difference in frequency (indexed by chi-squared statistics) between specific and nonspecific memories

Past-tense auxiliary verbsFootnote 6 (“ta” and “da”) had the largest chi-squared values. On average, past-tense expressions were used more than once in a specific memory, whereas their frequency was only 0.21 in nonspecific memories. Following these past-tense expressions, negation (no and not) had a greater frequency in nonspecific than in specific memories, because negative expressions were often associated with an omission type of response (e.g., “nothing” and “cannot remember”) and with semantic associates (e.g., “my salary is not enough”). In many cases, expressing the nonexistence of something does not refer to specific experiences, which should include the details of actions, events, and objects that actually took place or existed in the past. For example, a memory “I attended my friend’s birthday party” would be a specific memory, whereas a memory “I did not attend his birthday party” (or “I have never attended his birthday party”) would be scored as a semantic associate unless alternative actions (“because I had to attend a funeral”) are indicated. Furthermore, negation had a strong collocation with particular or particularly, in the forms of “nothing in particular” and “not particularly.” Because this collocation often appeared in the omission category in the present corpus, the term particular(ly) had a relatively large chi-squared value.

Furthermore, some particles (e.g., “de,” “wo,” “ni”) had relatively large chi-squared statistics (Table 2; see also Table A1 in the Appendix), indicating greater frequency of use in specific memories. In Japanese, a particle must be combined with nouns to form an adjectival or adverbial phrase, which provides additional information to a sentence. For example, the particles “de” and “ni” have a function similar to English prepositions (e.g., at, on, in, and to) that define the time, direction, and target of an action, when combined with a noun like “I worked on February 10th” and “I went to the park.” This means that memory sentences including a greater number of adverbial phrases by particles (or prepositions in English) are more likely to be judged as specific, probably because those sentences often have a richer amount of information regarding temporal and spatial details.

Figure 1 shows the proportions of major POS tags (verbs, nouns, adjectives, adverbs, adjectival nouns,Footnote 7 auxiliary verbs, and particles) in the four input vectors of 50, 100, 200, and 500 morphemes. The vector of 50 morphemes consisted of 38 (76%) independent words, which have meaning by themselves (i.e., verbs, nouns, adverbs, adjectival nouns, and adjectives), and 12 (24%) function words, which must be associated with independent words to impart meaning (i.e., auxiliary verbs and particles). This smallest input vector seems to have a minimum vocabulary (see also Table 1) with frequently used independent words (e.g., do verb), function words (e.g., auxiliary verbs and particles), and AMT cue words (e.g., failure and success). A larger input vector had a greater proportion of independent words but a smaller proportion of function words, which implies that the input vector of 500 morphemes covered the richest vocabulary containing less frequently used words that convey specific and unique meanings, as compared to the words in the other input vectors. However, as the number of types of function words is limited by the nature of the language, auxiliary verbs and particles became less dominant in larger input vectors.

Fig. 1
figure 1

Proportions of major part-of-speech tags in the four input vectors of 50, 100, 200, and 500 morphemes. Adv = Adverb, Adj n = Adjectival noun, Auxi v = Auxiliary verb

Performance test of the support vector machines

We trained four different SVMs using different numbers of morphemes (50, 100, 200, and 500) in the document–term matrices with sentence length information (token and character numbers) as input vectors, and the observer-assigned ratings (specific vs. nonspecific memory) as the categorical variable. Each SVM was trained on the training datasets. Then, the SVM performances were tested on the test dataset. The results of the performance test are shown in Table 3. Each model showed good classification performance in the test data, with 86%–87% correct rejections, 83%–87% hits, and .91–.92 AUC values. These results suggest that our SVMs, even the one trained on the smallest input vector of 50 morphemes and sentence length information, showed good performance in classifying the written AMs into specific and nonspecific categories. However, as the number of input morphemes was increased, the model performance improved slightly, although there was no change (or a decrease) in model performance between 50 and 100 morphemes. This small improvement implies that for accurate classification, some memories (around 1% of the total memory samples, in terms of the AUC) require additional knowledge and vocabulary about less frequently used words, which were not included in the base 50 input morphemes.

Table 3 Results of receiver operating characteristic analysis in Study 1 (N = 4,200)

Although the AUC indicates the SVM with 500 input morphemes as the best model, this model had the lowest hit rate among the four SVMs. This reduction in hit rate implies that the model with 500 morphemes might have overfit the training data, because the input vector could include morphemes that were locally optimal to the training dataset. A similar performance reduction was also found in Study 2. Thus, we regarded the SVM trained on 200 input morphemes as the best model in terms of classification accuracy, and accordingly, this SVM was used for the further analyses. The performance of the SVM on 200 morphemes was also tested for each cue word, and the results ranged from 77.3% to 93.8% hit rates, from 78.6% to 93.0% correct rejection rates, and from .87 to .96 in AUC (see Table 4).

Table 4 Results of receiver operating characteristic analysis for each cue word in Study 1

We also examined what linguistic features contributed to the classification accuracy by training SVMs on restricted information from the 200 morphemes: (a) sentence length (character and word counts), (b) 177 content (independent) words (verbs, nouns, adjectives, adjectival nouns, and adverbs), and (c) 23 function (dependent) words (auxiliary verbs and particles). The performance of these three SVMs (see Table 3) suggests that (a) sentence length alone had the smallest signal for the classification of memory responses (AUC = .751), (b) function words alone classified responses better than chance and provided the strongest signal for classification of responses (AUC = .880), (c) content words alone classified responses better than sentence length alone, but by a smaller margin than function words (AUC = .833), and (d) content words provided a classification signal beyond what is provided by function words alone, since the model including all words (AUC = .917) outperformed the model using only function words.

Figure 2 shows the ROC curves for the full and restricted SVMs. The full SVM on 200 morphemes still had the best performance over the three restricted models, which suggests the importance of integrating both functional and semantic information in the input feature vector. However, because the function words have a greater contribution to the prediction accuracy than the content words do, binary classification of specific versus nonspecific memories might be more dependent on the structural than on the semantic information of AMs.

Fig. 2
figure 2

Receiver operating characteristic curves for the support vector machines trained on input vectors of sentence length (red), 177 content words (blue), 23 function words (green), and all of the information (sentence length and 200 morphemes; black)

Study 2

The results of Study 1 suggest that the SVM trained on 200 input morphemes has sufficient accuracy in classifying AMs into specific and nonspecific memories. One important limitation that should be considered is that the SVM was trained and tested within a single context in Study 1; that is, both training and test datasets comprised of AMs that were retrieved in response to the same ten cue words. Therefore, it is still unclear if the SVM that we developed would also show sufficient accuracy in classifying AMs that were retrieved in response to a different cue word set. Testing the robustness of the SVM in classifying memories that were generated using different (sets of) cue words is especially important given that cue words used in the AMT literature are often different across studies (for a review, see Griffith, Sumner, et al., 2012). Furthermore, clinical interventions such as Memory Specificity Training (MeST; Neshat-Doost et al., 2013; Raes, Williams, & Hermans, 2009), developed to increase the specificity with which (depressed) individuals retrieve personal memories, use a wider range of cue words than a single AMT assessment.

At the same time, the robustness test of the SVM is of theoretical importance because high performance on this test implies that the SVM captures universal linguistic features of “specificity” that exist across different cue- or trigger-independent memory content. Otherwise, if the SVM performances were substantially worse than in Study 1, this could be an indication of content dependency (or cue word dependency) of memory specificity judgment. By testing the general versatility of the SVM, we could reveal whether memory specificity is coded by individual words and specific vocabulary, or whether it is more dependent on structural information of a sentence such as tense and functional words. Thus, in Study 2, we tested whether the SVM developed in Study 1 holds sufficient accuracy in classifying newly collected AMs that were retrieved in response to cue words different from those used in Study 1.

Method

Participants and procedures

Written AMs were collected through an online survey under the same instructions of the AMT in Study 1. Participants (N = 314) were recruited by a survey company (Macromill Inc., Tokyo, Japan) from a Japanese community population with balanced age (M = 44.7 years, SD = 18.9 years) and gender (males = 154, females = 160). Instead of the ten cue words used in Study 1, new cue words (no overlap with Study 1) were used in accordance with previous memory studies and emotional word lists (e.g., Gotoh & Ohta, 2001; Raes et al., 2009). Among the 27 cue words, seven were nouns (i.e., bicycle, animal, car, gift, party, house, and restaurant), and the other 20 were adjectives (i.e., satisfied, healthy, proud, friendly, active, effortful, cheerful, favorable, self-possessed, irritated, assertive, passive, clumsy, handy, happy, sad, hopeful, disappointed, secure, and insecure). Although the previous memory study (Raes et al., 2009) used more than 50 cue words, we halved the number of the cues to reduce burden on participants. Each participant was asked to write down their memories in response to all the cues. Manual coding of memory specificity was completed by two of the authors (interrater consistency between the two raters on 200 memories was 95% for the binary coding). The manual coding identified 1,993 (23.5%) specific, 1,173 (13.8%) extended, 1,326 (15.6%) categoric memories, 2,822 (33.2%) semantic associates, and 1,164 (13.7%) omission type memories. We confirmed the unifactor structure of the 27 cue words in terms of the specificity (binary coded with 1 as specific and 0 as nonspecific memories). An exploratory factor analysis on the tetrachoric correlations among the cue words revealed eigenvalues of 11.00, 1.95, 1.68, and 1.45 for one- to four-factor solutions, respectively. All the cues had medium-to-large factor loadings on a single factor (>.45) except for the cue words, satisfied (.40) and effortful (.31). Two thirds of the AMs were randomly selected as training data (n = 5,600), and the rest were used as test data (n = 2,878). In the training dataset, SVMs were trained as benchmark models that were optimally trained within the memory corpus of the new cue words. Their performances on the training dataset were compared with that of the SVM of Study 1.

Language analysis was identical to Study 1, which utilized the KH coder for Japanese morpheme analysis (Higuchi, 2004, 2015), R e1071 package for SVM development (Meyer et al., 2014), and R ROCR package for ROC analysis (Sing et al., 2005). Morphemes that should be used to develop benchmark SVMs were selected from the training dataset by using chi-squared statistics as an index of TF differences between specific and nonspecific AMs. In addition, we excluded morphemes with low frequencies (less than 5 in the training dataset of Study 2) to avoid constant predictors in a cross-validation routine of SVM training.

After this feature selection, benchmark SVMs were trained on three different sets of input morphemes (i.e., 100, 200, and 500). Model performances were tested on the test dataset, which was not used in the SVM training process. All the SVMs were trained with the Gaussian kernel, five-fold cross-validation, and class weights (i.e., 3:1 for specific and nonspecific categories).

Results and discussion

Among the total 8,476 memory responses, 1,993 were rated as “specific” and the other 6,483 memories were rated as “nonspecific,” by manual coding. Similar to the results of Study 1, the basic linguistic statistics (Table 1) indicate that a specific memory is characterized by a greater number of words (token number) and a richer vocabulary (type number). Table 5 shows the results of the performance test of (a) the SVMs trained on the training dataset of Study 2 and (b) the SVMs imported from Study 1. Among the SVMs that were trained within the dataset of Study 2, the SVM of 200 morphemes exhibited the highest performance in predicting manually coded memory specificity, achieving .92 AUC. Although the model trained in Study 1 (for 200 input morphemes) had a slightly worse performance than that of Study 2, it still performed well (AUC = .888), regardless of the cue word differences. However, the SVM trained in Study 1 showed relatively variable performances for individual cue words, ranging from 55.6% to 95.0% in hit rate, and 57.5% to 95.1% in correct rejection rate, and .78 to .98 in AUC (Table 6). The worst performance was observed for the cue words satisfied and effortful, which had the lowest factor loadings in the exploratory factor analysis. Responses to these cue words were different from those to the other cue words, which might have influenced the prediction accuracy of the SVM. The hit rate was particularly low for the cue words assertive (57.1%) and insecure (55.6%). However, because the proportions of specific memories were relatively small for these two cue words (12% and 8%), as compared to the overall average (23.5%), sampling biases as well as content differences may have influenced the SVM performance.

Table 5 Results of the receiver operating characteristic analysis in Study 2 (N = 2,878)
Table 6 Results of receiver operating characteristic analysis for each cue word in Study 2

With reference to the performance of the restricted models, the SVM trained on content words showed worse performance in Study 2 than in Study 1 (AUCs = .765 and .833, respectively). On the other hand, the SVM trained on function words exhibited similar levels of performance between Studies 1 and 2 (AUC = .880 and .874, respectively). These results suggest that (a) the versatility of the full SVM on 200 morphemes could be mainly attributed to function words, and (b) content words should be selected according to the set of cue words to improve the prediction accuracy of the SVM.

Among the 200 input morphemes that were used to train the SVMs in Studies 1 and 2, 72 morphemes (17 function words) were commonly used between the two SVMs (a full list of these morphemes has been provided in the Appendix). These common morphemes reflect general features of specific memories: past-tense auxiliaries, prepositions (or particles in Japanese), negations (no and not), and temporal expressions (e.g., life, always, and everyday), which exist across diverse contents of memories generated on different cue words. The other 128 morphemes, which were newly selected following the feature selection in Study 2, slightly improved the classification accuracy of the SVM developed in Study 2. This result suggests that for some memories, extra vocabulary that is specifically associated with the cue words contributes to enhanced classification accuracy; for example, the phrase “In a wedding ceremony” in a memory cued by the word party is a clear indication of specificity, because human raters have a priori knowledge that a wedding ceremony normally takes place within a single day.

Study 3

Studies 1 and 2 demonstrated the performance of the SVM in the binary discrimination between specific and nonspecific memories. To evaluate our machine-learning approach in a five-class classification of the AMT, we repeated similar analyses using the same datasets, but employing the multiclass SVM algorithm in Study 3.

Method

Feature selection

To select the morphemes that can be used in multiclass SVM training, we first computed chi-squared statistics for each morpheme by a one-versus-all comparison (e.g., specific memories vs. the other four categories) in the whole dataset of Study 1. Next, we selected potentially predictive morphemes by using the Round-Robin algorithm (i.e., selecting an equal number of candidates from each of multiple categories; Forman, 2004, 2007). The one-versus-all comparison was repeated for each of the five AMT categories, and in each comparison loop (extended vs. the other four, categoric vs. the other four, etc.), the same number of morphemes were nominated on the basis of the chi-squared statistics. For each AMT category, this nomination process revealed 186 morphemes that had the largest chi-squared statistics (>3.0) and sufficiently high frequency for analysis (>4). One exception was the Omission category, which identified only 78 available morphemes due to the limited vocabulary. This feature selection resulted in 514 morphemes after exclusion of overlaps in the nomination across multiple classes. Among the 514 morphemes, 192 morphemes were also used in the binary SVM trained on 200 morphemes that was used in Study 1.

SVM training

We found that the unbalanced sample sizes across the five AMT categories caused a serious bias in SVM training; therefore, we prepared a balanced dataset through sampling 1,000 responses randomly from each AMT category except for extended memories (840 responses) as a training dataset (4,840 responses in total), and 300 from each category as a test dataset (1,500 in total). These sample sizes were determined because the dataset of Study 1 had 1,306 extended memories (840 in the training and 466 in the testing data after random data splitting), which was the smallest class across the five AMT categories. To test the general versatility of the five-class SVM across different sets of cue words, we tested the performance of the SVM in predicting the data collected in Study 2. For ease of interpretation and comparison between the two datasets, we sampled 300 responses randomly from each of the five classes in the dataset of Study 2. Because the test data were balanced across 5 classes, we used accuracy (all correct predictions of specific and nonspecific memories relative to total samples) as a performance index in Study 3.

However, as an SVM can solve only a binary problem, the five-class SVM was trained using the “one-against-one” approach. In this training algorithm, binary classifiers were trained for each pair of classes, and each binary classifier “votes” to determine appropriate classes of test samples (Meyer et al., 2014). In this voting system, ten binary classifiers that were trained to distinguish between two given categories for ten possible combinations (e.g., specific vs. extended, specific vs. categoric, etc.) determined the most appropriate class for each memory response by a majority rule. Other settings were the same as the SVM training in Studies 1 and 2; we used the R e1071 package (Meyer et al., 2014), and the SVM was trained with the Gaussian kernel and five-fold cross-validation.

Results and discussion

The five-class SVM achieved 65.1% prediction accuracy as an average of five-fold cross-validation in the training dataset. Given that the chance level is 20% under the five-class problem, this accuracy means that the SVM has better separation than a random classification. In predicting the test dataset, the SVM classifier showed 64.7% accuracy (see Table 7). This performance was retained in predicting the test dataset from Study 2, which achieved an accuracy of 63.7%. However, the sensitivity to detect extended memories dropped from 47.3% in Study 1 to 42.7% in Study 2. This reduction in sensitivity suggests that the identification of extended memories needs more cue-dependent information and a priori knowledge about the tokens (see the “wedding ceremony” example in Study 2) than the identification of other types of memories. Importantly, the patterns of classification errors appear to reflect the similarities between the types of memories. For example, specific memories were often misclassified as extended memories, and vice versa. Furthermore, categoric memories and semantic associates tend to be confused for each other, which suggests that categoric memories share some language features that can be seen in nonmemory responses, and vice versa. The latter finding is in support of an earlier claim by Raes et al. (2007, p. 496) that a large part of semantic associates actually could or should be regarded and treated as a special class of “overgeneral categoric” responses.

Table 7 Confusion matrix of predicted and actual categories by the five-class support vector machine (SVM)

Overall, the performance of the five-class SVM classifier was better than chance, but the prediction accuracy was still moderate. A possible reason for the lower performance of the five-class SVM relative to the binary classifier may be that misclassifications often occurred among the three types of “nonspecific” responses (i.e., extended, categoric, the semantic associates), which accounted for more than half of the total classification errors (51.8%). Furthermore, the proportion of the input features that were specifically selected to detect specific memories was relatively small in the training dataset of the five-class model, because we needed to add extra features that are associated with categories other than specific memories. This proportional reduction of “specific” features would influence the sensitivity to detect specific memories, which was decreased in the five-class SVM (62.0%), as compared to that of the binary model (around 80%).

General discussion

In the present study, we sought (1) to reveal the linguistic features of specific memories on the AMT and (2) to develop an automatic categorization algorithm of memory specificity by employing text-mining and machine-learning approaches. From the morpheme analysis, we learned that specific memories tend to contain a greater number of words than do nonspecific memories, and that some types of morphemes (e.g., past-tense auxiliary verbs) are specifically associated with specific memories. The SVM classifier that we developed showed good performance, with an AUC of .89–.92 in categorizing specific and nonspecific memories of “unknown” test datasets (Studies 1 and 2), and with 64%–65% accuracy in the full five-class categorization (Study 3). These results demonstrate the feasibility of categorizing AMT responses as specific versus nonspecific by using a computerized classification algorithm, although the five-class classification using more fine-grained distinctions between categories of memories is more difficult than the binary classification.

Our results have four main implications regarding the linguistic nature of specific AMs. First, most specific memories are written in past tense. Past-tense expressions by auxiliary verbs “ta” and “da” in Japanese (cf. -ed in English) were observed more than once in specific memories, whereas nonspecific memories contained the past-tense expressions only 0.21 times. Given that a memory represents a past event, these results are quite evident and straightforward. In other words, this finding suggests that a substantial number of “memory” responses had no grammatical indications of past tense. Such responses could be categoric memories, semantic associates, or omissions, which are often described in the present tense (“whenever other people disappoint me,” categoric memory; “I’m a lucky person,” semantic associate) or in an incomplete one-word sentence (“nothing,” omission).

Second, our results suggest that memory responses (specific, extended, and categoric memories) contain a greater amount of information than do nonmemory responses (semantic associates and omissions), as memories are written in longer sentences (i.e., token number) with a richer vocabulary (i.e., type number). Although sentence length can be a filter for screening out semantic associates and omissions, the model trained only on sentence length had lower performance when differentiating a specific memory from the other types of categories, compared with models trained on content and/or function words.

Related to this point, adverbial phrases led by prepositions (particles in Japanese) are important indicators of memory specificity. Some prepositions, such as to and in (“ni” and “de” in Japanese), add detailed information about time and location to sentences. This then increases the probability of classifying a written memory as specific. For example, a simple sentence such as “I went” could be described in more detail by adding prepositional phrases, “I went to the park on February 10th,” which is more likely to be judged as specific. Thus, including the information on when, where, what, and other modifiers would be a critical factor for the specificity judgment, although the specific objects governed by the prepositions (or particles) are not necessarily important.

Third, the results of the performance tests suggest that most of the information needed for accurate judgment of memory specificity is already included in the smallest input morpheme vector, as the SVM trained on the input vector of 50 morphemes showed an AUC of .91. Although the prediction performances were improved by extending the feature vectors to 100 and 200 morphemes, memory specificity of 80%–90% memories is correctly judged by the usage pattern of 50 basic morphemes, consisting of 12 function words (e.g., auxiliaries and particles) and 38 independent words (e.g., adjectives and nouns).

Fourth, the results of the restricted models suggest that the function words are more predictive of AM specificity than are content words. The SVM trained only on content words showed worse performance in Study 2 than in Study 1, whereas the SVM trained only on function words exhibited similar levels of performance between Studies 1 and 2. These results suggest that the versatility of the full SVM on 200 morphemes can be mainly attributed to function words. In other words, memory specificity is more likely to be determined by functional and configurational linguistic features such as verb tense, negation, and the number of particles (how the memory was described) than by the individual meanings of words and phrases (i.e., the actual content of the memory).

Among the content (or independent) words were several time-related words, such as today and everyday, which are predictive of memory specificity. According to the standard AMT coding procedure (Williams & Broadbent, 1986), a specific memory should refer to an event that happened on one particular day. In line with this coding criterion, the words today, the other day, and yesterday were more frequently observed in specific than in nonspecific memories, whereas everyday, life, daily life, and days were more often observed in nonspecific (extended and categoric) than in specific memories. These content words nevertheless add important (although limited) information over the function words for the classification of specific and nonspecific memories, because the classification accuracies of the full SVM and content-word SVM are lower than those of the SVMs optimally trained in the dataset of Study 2. This performance reduction means that for some memories, additional vocabulary and a priori knowledge about the words (e.g., a wedding ceremony normally does not last longer than a single day) need to be installed in the model. This point is particularly important for the five-class classification in Study 3, as the sensitivity to detect extended memories considerably differed between the two test datasets.

It is also noteworthy that the morphemes that correspond to the cue words (failure, success, contentment, despair, lucky, discontentment, and self-confidence) are listed as the most informative words to distinguish between specific and nonspecific AMs (see Table 2 and the Appendix). Some cue words (failed and succeeded) were more likely to evoke specific memories, whereas others (discontent, self-assured, promising, and painful) were more strongly connected to nonspecific memories or unsuccessful retrieval of AMs. Previous studies have already revealed such variability in sensitivity and discriminatory power among different cue words, using item response theory (Griffith, Kleim, et al., 2012; Griffith et al., 2009; Heron et al., 2012). Thus, importantly, SVM learning would be influenced by the set of cue words. Therefore, to some extent, our trained SVM was localized to the cue words that we used in the present study, although it should be noted that the SVM did show a high level of versatility across different cue word sets.

From these analyses, it is possible to speculate about the memory-language architecture underlying specific AM retrieval. The influential memory model of Conway and Pleydell-Pearce (2000) suggests that overgeneral memory is the result of a truncation (or dysfacilitation) of a top-down memory search that is “stuck” at too high a level of a hierarchical memory structure (Williams et al., 2007). In such a case, the generation phase of retrieval is terminated before a detailed memory representation is formed, and thus, only the general descriptive information can be accessed (Conway & Pleydell-Pearce, 2000; Williams et al., 2007). Consistent with this theoretical account, our observations suggest that a nonspecific memory has a shorter sentence length and fewer adverbial phrases led by particles (or prepositions) than a specific memory does. The lack of adjectival and adverbial expressions about experiential, temporal, and spatial details could reflect the hypothesized truncation of memory search. However, another possibility remains—namely, that a failure in verbalization, rather than a truncation of retrieval per se, is associated with the less detailed description of a “nonspecific” memory. In the AMT, and particularly in the written version, participants can easily choose to suppress the verbal report of experiential details, even though they successfully and vividly retrieved such information from memory.

When interpreting our results, the following limitations should be borne in mind. First, our analysis is largely dependent on the Japanese language structure. Although we have already confirmed that the same text-mining and machine-learning approach works well for Dutch written memories (Takano, Ueno, Mori, Nishiguchi, & Raes, 2015), future research needs to test whether similar results can be obtained for other languages. Second, computer-based morpheme decomposition is not always accurate. Particularly in Japanese, words are not separated by a space, unlike European languages, which sometimes causes errors in detecting individual morphemes. Although the software used in the present study is highly reliable when it comes to analyzing formal texts (Higuchi, 2004), it was not able to cover all types of compounds and slangs. Third, there is some ambiguity in the original coding rules of the AMT. Particularly in the written version of the AMT, most memory responses have only one short sentence (sometimes two). Furthermore, as some memories lack information for accurate coding, human raters often have to use their own judgment. This ambiguity results in discrepancies in coding between two independent raters. Although the interrater reliability normally lies around a 90% concordance rate (kappa = .80) in the written version of the AMT (e.g., Heron et al., 2012; Raes, Hermans, de Decker, Eelen, & Williams, 2003), around 10% of the memories are difficult to categorize even by manual coding. Thus, we speculate that the technical errors in creating input vectors (morpheme detection) and ambiguity in teacher signals (categories given by the manual coding) might have influenced SVM learning, which, at least partly, contributed to the classification errors observed in our ROC analysis. Fourth, we did not include any questions to screen out participants’ satisficing (e.g., Krosnick, 1991), which is a behavior to provide satisfactory answers without paying enough attention and effort. Although we believe that our language data are qualified by the observer rating, there might have been sampling biases due to such compliance issues (e.g., intentionally shorten a response to complete the survey quickly).

Notwithstanding these limitations, we have demonstrated that the computerized classifiers that have been developed by using several linguistic features of specific memories achieves a high classification accuracy when categorizing a new memory as “specific” or “nonspecific.” This high performance of the SVM suggests that memory specificity can be accurately judged by the usage pattern of a relatively small set of words (around 200 morphemes), implying that memory specificity is primarily determined by the functional and configurational information of a sentence (e.g., auxiliary verbs and particles), and to a lesser extent by the semantic elements of a written memory. As a practical implication, our SVM could contribute to the automation and standardization of AMT coding protocols. As we mentioned in the introduction, an increasing number of studies are using the AMT, given the clinical and theoretical importance of memory specificity. Although the present studies are limited to the two-class (specific vs. nonspecific) classification, the SVM classifier would save expert manpower and enable researchers to make objective and standardized judgments of AMT data. Furthermore, such automation opens new avenues with reference to going online with clinical interventions directed at the remediation of overgeneral memory, like MeST (Raes et al., 2009).