We are commonly faced with the situation of trying to recall a set of items learned in a given context, without regard to the order in which the items were experienced. In the laboratory, this type of memory is studied using the free recall task; after studying a list of items (typically words), participants are asked to recall the list items in any order. The unconstrained nature of the free recall task provides a rich source of information on the nature of the retrieval cues used during memory search. The analysis of the dynamics of free recall has revealed the importance of semantic relatedness and temporal contiguity in guiding recall (e.g., Howard & Kahana, 2002; Kahana, 1996). Such analyses have also played an important role in developing and testing theories of memory retrieval (e.g., Davelaar, Goshen-Gottstein, Ashkenazi, Haarmann, & Usher, 2005; Kimball, Smith, & Kahana, 2007; Laming, 2010; Polyn, Norman, & Kahana, 2009; Sederberg, Howard, & Kahana, 2008).

Although much attention has been paid to the way people make transitions from one response to the next, two other components of the recall process are also of critical importance: recall initiation and recall termination. Recall initiation was a major focus of Deese and Kaufman’s (1957) classic study of the serial position effect in free recall. They documented the relation between the recency effect (superior recall of end of list items) and participants’ tendency to initiate recall with items from the end of the list. Subsequent analyses of recall initiation have enriched our understanding of the recency effect in both immediate free recall and free recall following various distractor schedules (Bhatarah, Ward, & Tan, 2008; Davelaar et al., 2005; Laming, 1999, 2010; Sederberg et al., 2008).

Much less, however, is known about the factors responsible for recall termination. In a study of interresponse times (IRTs) in free recall, Murdock and Okada (1970) found that the IRT prior to the final correct response tended to be approximately 8–10 s, regardless of how many items the participant recalled. They also showed that IRTs increased exponentially with output position (i.e., the position of a response in the sequence of recalls; see also Polyn et al., 2009; Wixted & Rohrer, 1994). This suggests that participants may terminate recall following a long period in which no new items are successfully retrieved.

Whereas Murdock and Okada (1970) inferred recall termination on the basis of the final correct response given in a fixed recall period, in a more recent study, Dougherty and Harbison (2007) assessed recall termination by asking participants to press a key when they could not remember any additional items. Dougherty and Harbison found that the duration between the last successful retrieval and the termination response (exit latency) decreased as the total number of items recalled increased. Furthermore, the researchers showed that variability in exit latencies was closely related to participants’ decisiveness, with participants who scored high on a decisiveness scale terminating recall more quickly.

An important feature of the recall process not considered in these previous studies concerns the nature of the responses themselves. Although most recalled items are correct responses (i.e., items studied on the target list), participants also occasionally commit errors by recalling items studied on an earlier list but not on the current list (prior-list intrusions), recalling items not studied on the current list or any earlier list (extralist intrusions), or repeating already recalled items. It is known that errors tend to occur late in recall (Kimball et al., 2007; Roediger & McDermott, 1995) and that they elicit subsequent errors (Zaromb et al., 2006). As such, one might hypothesize that whatever process contributes to recall errors may also play a role in recall termination. To test this hypothesis, we asked whether the conditional probability of stopping differed following various types of recall events. Because these events occur with different frequencies during the recall process, we compute these conditional probabilities separately as a function of output position. For this purpose, we have carried out secondary analyses of the raw trial-by-trial data culled from 1,079 participants across 14 large free recall experiments, comprising a total of 28,015 recall trials (for a description of each experiment, see the Appendix). By pooling raw data from so many trials, we were able to look at relatively rare events that happen during recall and to see how these events predicted recall termination. To foreshadow our results, we found that retrieval is more likely to terminate following recall errors than following correct responses, and that this effect appears consistently throughout the recall period. The increased tendency to terminate recall after committing an error varied significantly across the three types of recall errors that we studied: prior-list intrusions, extralist intrusions, and repetitions. After reanalyzing these prior data sets, we further validated our results by showing that the same pattern of increased termination following errors can be seen in a single new experiment, reported below.

Meta-analysis methods

We reanalyzed individual-trial data from the 14 experiments listed in Table 1. Our criteria for inclusion was stringent. First, we limited our secondary analysis to studies for which we could obtain individual-trial data for each participant. Second, we required those data to include information on the order of individual responses on each trial, including errors. Third, we excluded studies for which the nature of the recall errors was not classified according to the three key categories: prior-list intrusions, extralist intrusions, and repetitions. Finally, we further limited our analyses to studies reporting the timing of individual responses. Nonetheless, we were able to include data from 10 experimental conditions reported in seven published articles, and an additional 4 studies reported in working papers. In each of the included studies, the lists consisted of between 10 and 25 common words (often nouns selected from the Toronto Word Pool; see Friendly, Franklin, Hoffman, & Rubin, 1982) and recall was vocal, with speech being digitized and latencies recorded. The appendix provides brief descriptions of the methods used in each of the experiments we analyzed.

Table 1 Summary of experimental conditions

In free recall tasks, participants are typically given a fixed amount of time to recall the list items. As such, these studies do not tell us when recall actually terminates. For example, one may ask whether the recall period has terminated while the participant is still actively recalling words, whether the participant has given up early in the recall interval, or whether the participant is trying hard to recall items, but nothing is coming to mind. Another possibility is that recall terminates because participants have already recalled all of the list items. However, the latter situation almost never happens with the long lists used in these (and most) free recall studies. What we do know, on the basis of recall latencies, is that participants make most of their responses early in the recall period and that the time between successive recalls increases approximately exponentially with output position (Murdock & Okada, 1970; Polyn et al., 2009; Rohrer & Wixted, 1994).

In the present study, we define recall termination as occurring when the time between the last recalled item and the end of the fixed recall period was both longer than every IRT on the current trial and exceeded a criterion of 12 s. This value was chosen to exceed the mean exit latency of 10 s reported by Dougherty and Harbison (2007) in an open-ended retrieval period. Out of 28,015 trials, 18,829 met these criteria (67.21%). In the included trials, there were a total of 127,240 responses: 111,211 (87.40%) were correct, 3,589 (2.82%) were repetitions, 6,000 (4.72%) were prior-list intrusions, and 6,440 (5.06%) were extralist intrusions. Of the prior-list intrusions, 41% had been correctly recalled on their initial presentation list. We repeated our analyses without excluding any trials and obtained nearly identical results.

Meta-analysis results

Figure 1A shows the conditional probabilities of recall termination following correct responses and each type of recall error, as a function of output position (for the first eight output positions during recall).Footnote 1 We defined recall errors as repetitions of an already recalled item, as prior-list intrusions (PLIs), or as extralist intrusions (ELIs). We determined each participants’ probability of recall termination by dividing, separately for each output position and response type, the number of responses that were the final response in a trial by the total number of responses of that type. When calculating the mean probabilities for each response type and output position, the participants’ data were weighted according to the number of responses they contributed. To assess differences in the probability of recall termination following the four response types, we calculated bias-corrected and accelerated bootstrap 95% (two-tailed) confidence intervals (Efron & Tibshirani, 1993) for all six possible differences at each output position (see Fig. 1B). We considered differences with confidence intervals that did not include zero to be significant.

Fig. 1
figure 1

(A) Termination probabilities following correct recall, extralist intrusions, prior-list intrusions, and repetitions (“corr,” “eli,” “pli,” and “rep,” respectively, in panel B): Aggregate data from 14 free recall experiments. (B) Differences in the probabilities of termination, p(t), between the various response types, along with the corresponding 95% (two-tailed) confidence intervals (CIs; determined by bias-corrected and accelerated nonparametric bootstrap: Efron & Tibshirani, 1993). The dashed lines indicate zero difference (CIs that do not include zero indicate statistically significant differences)

Across the first eight output positions, which subsume the majority of the recall data across these experiments, participants were more likely to terminate recall following PLIs than following either ELIs or correct responses. For the later output positions (5–8), participants were also very likely to terminate recall following repetitions: The termination probabilities were similar following either repetitions or PLIs and exceeded the probabilities following both ELIs and correct responses. Recall termination rates following ELIs were generally intermediate between the rates for PLIs and correct responses. Recall termination was significantly more likely to occur following ELIs than following correct responses (for Output Positions 3–8) and significantly less likely than recall termination following PLIs (at all output positions) or repetitions (Output Positions 5–8). The pattern of results seen in the figure is thus quite reliable in our large sample of data: People are more likely to terminate recall following errors than following correct responses, and among the errors, recall termination is generally higher following PLIs and repetitions than following ELIs. One exception is the significantly lower probability of terminating recall following repetitions than following PLIs at Output Positions 3 and 4.

To further evaluate these results, we determined for each participant the earliest output position after which recall stopped, and then we aggregated the corresponding probabilities for the different response types across participants (e.g., if one participant always recalled at least four items and another always recalled at least six items, we aggregated the probabilities for Output Positions 4 and 6 for these participants). We aligned output positions at both the individually determined first and last stopping positions, and in both cases, we observed an ordering of probabilities of stopping that was consistent with that shown in Fig. 1: Probabilities of stopping tended to be lowest following correct recalls, greater after ELIs, and even greater after PLIs and repetitions. Repeating all of the above analyses without excluding any trials on the basis of our recall termination criteria yielded virtually identical results.

Although the results of the meta-analysis seem clear, some readers might not be at ease with analyses aggregated across so many diverse data sets. We therefore sought to validate these results in a single, large experiment. Fortuitously, at the time of this writing we were in the midst of conducting a large-scale study on the electrophysiological correlates of memory encoding and retrieval in free recall (Long, Miller, Sederberg, & Kahana, 2011). With 80 participants having completed seven experimental sessions, each involving free recall of 16 study–test lists, we had sufficient power to assess whether the patterns observed in the meta-analysis would replicate in a single study.

Experiment

Method

Participants

A group of 80 participants performed a free recall experiment consisting of one practice session and six subsequent experimental sessions. The participants provided informed consent according the University of Pennsylvania’s Institutional Review Board protocol and were compensated for their participation. Each session lasted approximately 1.5 h.

Procedure

Each session consisted of 16 lists of 16 words presented one at a time on a computer screen. Each study list was followed by an immediate free recall test, and each session ended with a recognition test. Half of the sessions (randomly chosen) included a final free recall test before recognition, in which participants recalled words from any of the lists from the session. This experiment was part of a larger study that included electroencephalogram recordings and further manipulations of the recognition and recall periods (Long et al., 2011).

Items were presented either with a task cue, indicating what judgement the participant should make about the word, or were associated with no encoding task. The two encoding tasks were a size judgment (“Will this item fit into a shoebox?”) and an animacy judgment (“Does this word refer to something living or not living?”), and the current task was indicated by the color and typeface of the presented item. There were three conditions: control lists (no task), task lists (all items were presented with the same task), and task shift lists (individual items were presented with either task). List and task order were counterbalanced across both sessions and participants. Additionally, using the results of a prior norming study, only words that were clear in meaning and that could be reliably judged in the size and animacy encoding tasks were included in the pool.

Each word was drawn from a pool of 1,638 words. The lists were constructed such that varying degrees of semantic relatedness occurred at both adjacent and distant serial positions. Semantic relatedness was determined using the word association space (WAS) model described by Steyvers, Shiffrin, and Nelson (2004). WAS similarity values were used to group words into four similarity bins (high similarity, cos θ between words > 0.7; medium-high similarity, 0.4 < cos θ < 0.7; medium-low similarity, 0.14 < cos θ < 0.4; and low similarity, cos θ < 0.14). Two pairs of items from each of the four groups were arranged such that one pair occurred at adjacent serial positions and the other pair was separated by at least two other items.

Each item was on the screen for 3,000 ms, followed by a jittered 800- to 1,200-ms interstimulus interval (uniform distribution). If the word was associated with a task, participants indicated their response via a keypress. After the last item in the list, there was a 1,200- to 1,400-ms jittered delay, after which a tone sounded, a row of asterisks appeared, and the participant was given 75 s to attempt to recall any of the just-presented items.

Results

Before reporting on recall termination following various types of errors, we will first show the results of more standard analyses applied to this data set. Standard serial position effects were observed, with marked recency, as would be expected in any immediate free recall task, and a moderately strong primacy effect extending about four or five serial positions into the list (Fig. 2A). Related to the recency effect, participants exhibited a strong tendency to begin recall with one of the last few items—a tendency that slowly dissipated across subsequent recalls (Fig. 2B).

Fig. 2
figure 2

Serial position, temporal contiguity, and semantic contiguity effects for the data from our experiment. The shaded regions are 95% confidence intervals. (A) Probabilities of recall for items in each serial position. (B) Probabilities of recalling presented items in Output Positions 1, 3, 5, and 7; the probability for Output Position 1 represents the probability of first recall (PFR). (C) Lag conditional response probabilities, which show the conditional probabilities of recalling items presented in serial position i + lag, where i is the serial position of the just-recalled item. (D) Semantic conditional response probabilities, which show the conditional probabilities of recalling items from a given level of semantic relatedness

The dynamics of free recall are largely characterized by the contiguity (or lag recency) effect and by the semantic proximity effect: That is, recall of an item tends to be followed by recall of a neighboring or similar item. The contiguity effect in this experiment, as shown in Fig. 2C, showed the usual forward asymmetry (Kahana, 1996). The semantic proximity effect in this experiment, shown in Fig. 2D, was similar whether semantic relatedness was defined by WAS similarity or latent semantic analysis (Landauer & Dumais, 1997). Because the results described above were only minimally affected by the different encoding conditions, we report all analyses collapsed across these conditions.

Recall termination effects were analyzed in the same manner as in the meta-analyses described above. Because there were very few trials with fewer than 4 correct responses (3.7%) or more than 12 correct responses (26%), we limited our analyses to Output Positions 4–12. Of the 9,122 trials, 6,527 met our inclusion criteria for selecting trials in which participants were likely to have terminated recall. The included trials comprised a total of 67,671 responses: 64,348 (95.09%) were correct, 1,570 (2.32%) were repetitions, 563 (0.83%) were prior-list intrusions, and 1,190 (1.76%) were extralist intrusions.

As is shown in Fig. 3A, the tendency to terminate recall was greater following PLIs, ELIs, and repetitions than following correct responses. Furthermore, the ordering of the termination probabilities was identical to that in the aggregate analyses—being highest following PLIs, next highest following repetitions, lower following ELIs, and lowest following correct responses. With the exception of the comparison between ELIs and correct responses, each of the other comparisons was statistically significant in the predicted direction for a majority of output positions between Positions 4 and 12 (see Fig. 3C). Additionally, we performed the previously described aligned output position analysis on these data and observed an ordering of probabilities matching those shown in Fig. 3A. We also repeated the analyses without excluding any trials. As is shown in Figs. 3B and 3D, these results were nearly identical to those based on our trial exclusion criteria.

Fig. 3
figure 3

Recall termination following errors and correct responses. (A) Termination probabilities following correct recall, extralist intrusions, prior-list intrusions, and repetitions: Data from the reported experiment. (B) Termination probabilities following correct recall, extralist intrusions, prior-list intrusions, and repetitions: Data from the reported experiment with no trials excluded. (C and D) Differences in probabilities of termination, p(t), between the various response types (designated “corr,” “eli,” “pli,” and “rep,” for correct, extra-list intrusion, prior-list intrusion, and repetition, respectively) along with the corresponding 95% (two-tailed) confidence intervals (determined by bias-corrected and accelerated nonparametric bootstrap: Efron & Tibshirani, 1993). The dashed lines indicate zero difference. Panel C corresponds to data from panel A, and panel D corresponds to data from panel B

A somewhat unusual feature of the present study, and also of several studies in the meta-analysis described above, was the high level of experience that participants obtained with the free recall task. One may therefore wonder whether these results reflect strategies that developed through extensive practice, or whether they are typical of the results that would be obtained with less highly practiced participants. We addressed this question by separately analyzing data from the first and last sessions of the reported experiment (Sessions 1 and 7). As is shown in Fig. 4, recall termination was more likely after incorrect than after correct responses for both the first and last sessions. Additional analyses revealed that the order of the probabilities for correct responses, ELIs, repetitions, and PLIs matched those shown in Figs. 1 and 3 for both Sessions 1 and 7.

Fig. 4
figure 4

Termination probabilities following correct recall and incorrect responses. The data are from the first session (S1; filled circles) and the last session (S7; open squares) of the experiment

Discussion

Understanding recall termination is particularly important because whatever accounts for recall termination determines the total number of items that are ultimately recalled. Although previous research has revealed a great deal about how people initiate recall and how they transition between successively recalled items, much less is known about the correlates of recall termination.

Through a reanalysis of individual-trial data from 14 experiments in previous studies, as well as from a newly reported study, we found that termination is consistently more likely to occur after an error than after a correct recall, and that this tendency to terminate recall following an error depends on the kind of error that is made. Recall termination is most likely to follow prior-list intrusions and repetitions of already recalled items, less likely to follow extralist intrusions, and least likely to follow correct responses.

Models of free recall in which retrieval of an item serves as a cue for the next response (e.g., Howard, Kahana, & Wingfield, 2006; Kimball et al., 2007; Metcalfe & Murdock, 1981; Polyn et al., 2009; Raaijmakers & Shiffrin, 1980; Sederberg et al., 2008) have suggested that the increased tendency to terminate recall following errors may reflect a fundamental memory process. Specifically, these models assume that neighboring items are associated during study, and that recall of an item tends to also retrieve items studied in proximate list positions. In this way, the models account for the well-known contiguity effect, which is seen in people’s strong tendency to successively recall items studied in neighboring list positions (Kahana, 1996). By the same logic, these models predict that intrusions will tend to be poor cues for subsequent correct recalls. For example, the recall of an item presented on an earlier list is likely to be a good cue for other items from the prior list, which compete with items from the current list. Zaromb et al. (2006) provided empirical support for this proposition. They found that participants were significantly more likely to commit PLIs following other PLIs, and further, that such intrusions tended to come from the same prior list. Zaromb et al. also found that when an item on the current list had been presented on an earlier list, recall of that item was more frequently followed by a PLI than by recall of a current-list item. Thus, PLIs tend to retrieve contextual information that is inappropriate to the current list, and therefore lead to further recall errors and recall termination. By the same logic, ELIs are also poor recall cues. Although such responses do not evoke specific competition from recently studied items, they are nonetheless poor retrieval cues, insofar as their associated temporal context will not serve as an effective cue for current-list items.

To the extent that the specific competition from recent (prior) list responses is greater than the competition associated with extralist associations (as discussed above), one would expect the probability of termination to be greater following PLIs than following ELIs, as we have observed. It is somewhat less obvious, however, why participants would be nearly as likely to terminate recall following repetitions as following PLIs. Such items do not harbor strong associations to items on earlier lists and are not strongly associated with temporal contexts that are unrelated to other list items. On the other hand, the fact that repetitions had previously been recalled suggests that the list items that were effective at cuing these repeated items were likely to have already been recalled as well. This account also suggests an explanation for the lower probability of terminating recall following repetitions at early output positions, when only few items have been recalled, since it is likely that items cued by the repeated item are still available for recall, and thus the detrimental effects of repetitions should be limited to later output positions. One might expect that repeating an already recalled item at later output positions would activate other recalled items, which would simply consume retrieval time without leading to another correct response (and thereby leading to increased termination probabilities). The idea that resampling and rejecting previously recalled items consumes retrieval time, and thus predicts recall termination in a fixed-interval task, forms the basis of accounts of the exponential growth of IRTs in free recall (Murdock & Okada, 1970; Rohrer & Wixted, 1994).

The finding that people are more likely to terminate recall following errors than following correct responses, even when controlling for output positions and recall time, adds to a growing body of evidence that recall of an item evokes contextual information previously associated with that item and that this contextual information can serve to either support or hinder subsequent recall (for a review, see Kahana, Howard, & Polyn, 2008). Whereas earlier evidence for this process was based solely on recall transitions, the present study suggests that recall termination depends on the loss of appropriate retrieval cues.

Although the observed pattern of results suggests a causal relation between recall errors and recall termination, one cannot strictly rule out the possibility that these results arise from some other endogenous aspect of the recall process that gives rise to both recall error and recall termination. Future research will be able to better adjudicate between these theoretical accounts by testing sophisticated process models of recall that can simultaneously fit data on recall initiation, recall transitions, and recall termination. The serious consideration of recall termination data in these models will in turn enable the models to speak more clearly to the memory mechanisms that underlie recall impairments in both healthy aging and neurological disease (see, e.g., Dubois & Albert, 2004; Golomb, Peelle, Addis, Kahana, & Wingfield, 2008; Grober, Lipton, Hall, & Crystal, 2000; Kahana, Howard, Zaromb, & Wingfield, 2002).