The Effects of Test Trial and Processing Level on Immediate and Delayed Retention

The purpose of the present study was to investigate the effects of test trial and processing level on immediate and delayed retention. A 2 × 2 × 2 mixed ANOVAs was used with two between-subject factors of test trial (single test, repeated test) and processing level (shallow, deep), and one within-subject factor of final recall (immediate, delayed). Seventy-six college students were randomly assigned first to the single test (studied the stimulus words three times and took one free-recall test) and the repeated test trials (studied the stimulus words once and took three consecutive free-recall tests), and then to the shallow processing level (asked whether each stimulus word was presented in capital letter or in small letter) and the deep processing level (whether each stimulus word belonged to a particular category) to study forty stimulus words. The immediate test was administered five minutes after the trials, whereas the delayed test was administered one week later. Results showed that single test trial recalled more words than repeated test trial in immediate final free-recall test, participants in deep processing performed better than those in shallow processing in both immediate and delayed retention. However, the dominance of single test trial and deep processing did not happen in delayed retention. Additional study trials did not further enhance the delayed retention of words encoded in deep processing, but did enhance the delayed retention of words encoded in shallow processing.

Evidence for the testing effect in promoting learning comes from laboratory studies (e.g., Wheeler, Ewers, & Buonanno, 2003), educationally related studies (e.g., Nungester & Duchastel, 1982) and classroom studies (e.g., Leeming, 2002). Laboratory studies typically use word lists as material, and free recall as test. For example, Wheeler et al. asked participants to study a 40-word list presented at a rate of one word every 3 seconds. After the first presentation, participants in the repeated test conditions were told to take a recall test to write down as many of the words as they could recall from the list, and this process was repeated four times with 1 minute break after each recall test. On the other hand, after the first presentation, participants in the repeated study conditions were told to study the words presented at the same rate, and this process was repeated four times with 1 minute break after each study. No matter whether participants were in the repeated test or repeated study conditions, participants in the 5-min delay conditions took a recall test for the study list after five minutes, and those in the 7-day delay conditions took the recall test after 7 days. Results revealed a huge advantage for repeated study trials on the immediate free-recall test, but repeated test trials were found to be favorable on the final free-recall test given a week later.
Other laboratory studies showed how the number of test trials at retrieval affects retention. Roediger and Karpicke (2006a) had participants either study a passage three times and take one test or study a passage once and take three tests. Results showed that those who had one test trial recalled more than those who had three test trials in immediate retention, but the opposite happened in delayed retention. Wheeler and Roediger (1992) also reported that taking three tests immediately after studying a list of pictures greatly improved retention on a final test relative to taking a single test. Dempster (1997) identified two hypotheses to account for the positive effects of test trials on learning. The first hypothesis stated that the testing effect was a result of additional exposure to material and overlearning of the material during the test trials (e.g., Thompson, Wenger, & Bartling, 1978). However, when Roediger and Karpicke (2006b) reviewed experiments with equal exposure to the material in the study trials when participants were asked to study the material several times, and in the test trials when participants were given a test several times, they still found testing effects. In addition, Wheeler et al. found that overlearning of the material with additional studying only produced better retention in the short term than repeated testing did, even though testing produced better long-term retention. If additional exposure and overlearning cannot explain the testing effect, an alternative is needed.
The second hypothesis stated that the testing effect was a result of the retrieval processes that increased the elaboration of a memory trace and multiplied retrieval routes (e.g., Bjork, 1975;Jacoby, 1978). Since recall tests that required production led to greater testing effects than recognition tests that involved identification, Bjork argued that recall tests required greater retrieval effort than recognition tests. The effortful retrieval increased the elaboration of the memory trace and enhanced the testing effect. In addition, McDaniel and Masson (1985) manipulated whether studied words were processed with semantic or phonemic encoding tasks. The testing group was given the first cued-recall tests with semantic or phonemic cues matched or mismatched the type of encoding, and the control group was dismissed. All subjects took a final cued-recall test the next day. They found that the testing group performed better on the final test when the cues for the first test mismatched the original encoding than when the cues on the first test matched the type of encoding. The effortful retrieval increased the types of retrieval routes to the memory trace and enhanced the testing effect.
Therefore, effortful retrieval processes that increased the elaboration of a memory trace and multiplied retrieval routes are better able to account for the testing effect.

Test Trial and Processing Level on Retention 130
With a sizable research on the testing effect, several variables have been investigated: The material to be learned (e.g., Roediger & Karpicke, 2006a) the format of the test trial and final retention test (e.g., Carpenter & DeLosh, 2006), the feedback received on the test trial (Karpicke & Roediger, 2007), the time interval between study and test trials (e.g., Carpenter & DeLosh, 2005), and the interval before the final retention test (e.g., Wheeler et al., 2003). However, how to study the material or the encoding of the material receive less attention.
One way to study the material is to encode it in different levels of processing (LOP). Craik and Tulving (1975) conducted a series of experiments to explore the LOP effect on memory. To process words at different depths, To further investigate the shallow and deep processing, Morris, Bransford, and Franks (1977) had participants encode words phonemically (shallow processing) or semantically (deep processing). They found that semantic encoding led to greater recognition than phonemic encoding in standard recognition test. However, phonemic encoding was superior to semantic encoding given a rhyming recognition test. Encoding manipulations that directed subjects to attend to the rhymes of inputs resulted in better performance on a rhyming test than did encoding activities that prompted subjects to process the semantic meaning of inputs.
In a separate study, Kuo and Hirshman (1997) manipulated the LOP (semantic vs. letter) by asking subjects to say aloud a word that was either related in meaning to the initial word (deep processing) or to share the first letter of the initial word (shallow processing). Subjects studied a list consisting of 48 context words (exception words or pronounceable nonsense words) and 19 regular words. A free recall test was given after five minutes.
Results showed that the mean proportions of regular words correctly recalled were significantly higher in the deep processing condition than those in the shallow processing condition. The LOP effect was approximately equal in the nonsense and exception word context.
In addition, Jacoby, Shimizu, Daniels, and Rhodes (2005) investigated if recognition memory was based on trace strength or familiarity, or depth of processing. In Phase 1, subjects made pleasantness judgments for 36 words in one list (deep processing) and vowel judgments (whether a word included an O or U) for 36 words in another list (shallow processing). In Phase 2, subjects received deep and shallow recognition memory tests.
For the deep recognition memory test, words whose pleasantness had been judged were mixed with an equal number of new words (i.e., foils). Subjects were correctly informed that all of the "old" words in the test list were from the pleasantness-judged list. For a separate, shallow recognition memory test, the subjects were correctly informed that all "old" words in the test list were presented in the vowel judgment list. In Phase 3, three types of words appeared in a recognition memory test of foils: 36 deep foils (presented as new items in the deep recognition memory test); 36 shallow foils (presented as new items in the shallow recognition memory test); and 72 new foils (words that were not presented earlier). The subjects were instructed to judge a word as "old" if it has been presented earlier during any phase of the experiment, and to respond "new" only if the word had not been presented earlier. Results showed that attempting to recognize old items that were deeply processed during study resulted in greater depth of processing at retrieval and thus better memory for foils than did attempting to recognize items that were shallowly processed during study. In contrast to formal models of recognition memory that highlighted the importance of quantitative criteria (e.g., strength of global familiarity in the model), specifying the source of old items or source-constrained retrieval could produce a qualitative change in the type of information used for memory judgments.
One study was found to study deep processing and testing effect. Karpicke and Smith (2012) investigated if another type of deep processing (elaboration) at encoding contributed to the testing effect of repeated retrieval.
Elaboration is the process of encoding more features or attributes of an event, producing distinctive representations and multiple retrieval routes for later retrieval. They asked participants to learn word pairs across alternating study and test trials. In elaborative study conditions, participants used an imagery-based keyword method, a verbal elaboration method, or a semantic elaboration method to encode items during study trials. In the imagery-based keyword method, a mental image of a meaningful interaction (e.g., an ant drinks poison) between the keyword (ant) and the definition of the vocabulary word (antiar means poison) was produced. In the verbal elaboration method, subjects were shown a word pair (e.g., wingu-cloud) and were told to type a word (e.g., bird or sky) that would help them relate the word and English word (e.g., a bird flying in the sky). In the semantic elaboration method, the word pairs were identical (e.g., castle-castle) so that the production of verbal elaborations relating the identical word pairs would be restricted or prevented. On a criterial test one week after the learning phase, repeated test trials produced better long-term retention than repeated study trials regardless of the elaborative encoding conditions.
Karpicke and Smith (2012) did not find any type of elaborative encoding to be accountable for the testing effect of repeated retrieval. Without comparing shallow processing to deep processing, it is still unclear if LOP would be a factor affecting the benefits of retrieval practice. In addition, when McDaniel and Masson (1985) asked subjects to process words with semantic or phonemic encoding, they found the semantic or phonemic cues on the first test could affect how much subjects remembered on the second test. With additional time exposed to the material in the first test, the testing effect could not be attributed to LOP. Further study is needed to examine whether LOP would be a factor mediating the testing effect.
Therefore, the purpose of the present study was to investigate the effects of test trial and processing level on immediate and delayed retention. Research questions included (a) Was there any difference between single test trial and repeated test trial on immediate and delayed retention? The testing effect expected that single test trial enhanced immediate retention but repeated test trial enhanced delayed retention (e.g., Wheeler & Roediger, 1992). (b) Was there any difference between shallow and deep processing on immediate and delayed retention? The level of processing effect expected that deep processing enhanced immediate and delayed retention (e.g., Craik & Tulving, 1975). (c) Was there any difference between test trials and processing level on final recall? Previous studies expected that there was an interaction among test trial, processing level and final recall (e.g., Karpicke & Smith, 2012;McDaniel & Masson, 1985).

Method Participants
Seventy-six college students (mean age = 21.3 years old; range = 19 -27 years old; Male = 8; Female = 68) completed the immediate and delayed tests in partial fulfillment of a psychology course requirement. At the beginning, ninety-one college students were invited to participate in the present study. Data of fifteen participants were discarded because nine of them were over 27 years-old (to keep the age range within 10), two of them did not show up for the delayed free-recall test, and four of them did not follow instruction to provide complete data. The procedures met all American Psychological Association (APA) ethical principles for use of human subjects (APA, 2002), and participants were provided informed consent in accordance with guidelines set by the Institutional Review Board of the university.

Materials
Forty stimulus words were taken from the words used by Craik and Tulving (1975, Experiment 9, see Table 1).
From the MRC Psycholinguistic Database (Wilson, 1998)  and processing level (shallow, deep), and one within-subject factor of final recall (immediate, delayed).
Participants were randomly assigned to the single test and the repeated test trials. They were then randomly assigned again to the shallow processing level and the deep processing level. Therefore, there were 38 participants in each test trial (single test and repeated test) and each processing level (shallow and deep).
The design for the test trial was based on that in Roediger and Karpicke (2006a) and Wheeler and Roediger (1992). In the single test trial, participants studied the stimulus words three times and took one free-recall test in In the shallow processing level, participants were asked whether each stimulus word was presented in capital letter or in small letter. In the deep processing level, participants were asked whether each stimulus word belonged to a particular category (see Table 1). The final immediate free-recall test was administered five minutes after the 12 study and test trials, whereas the delayed free-recall test was administered one week later.

Procedure
Participants were tested in groups of five or fewer. They were told to study and recall a list of words, and answer some questions to help them remember the words. The task was programmed by E-prime experimental software (Version 1.1; Schneider, Eschman, & Zuccolotto, 2002). Before the word list was presented, participants were given a practice list of two words to familiarize themselves with the task and the presentation rate, and a practice recall test to familiarize themselves with the testing procedure.
There were a learning phase and a testing phase after the practice. The learning phase consisted of 12 study and test trials and took about 30 minutes. At the beginning of each study trial, participants were asked to rest their hands on a key labeled "yes" and the other on a key labeled "no" on the computer keyboard. First, a "Ready" prompt was shown on the computer screen for 1 s. The typescript question or category question was then shown for 1 s, and participants were asked to answer the question by pressing the appropriate key. The typescript question was asked in the form, "Is the word in capital letter?" or "Is the word in small letter?" The category question was asked in the form, "Is the word (a category)?" Both typescript and category questions were counterbalanced, so that half of the answers to the questions was "yes" and half was "no." The purpose of the question was to induce the participant to process the word at a relatively shallow level (typescript questions) or at a relatively deep level (category questions). No matter if participants answered the typescript or category questions, stimuli words were presented on a computer at 2 s per word and the screen proceeded to the next word after 2 s. To present 40 stimuli words, the total time for one study trial was 80 s.
Participants who were not able to answer the questions correctly over 80% were discarded from the analysis.
The beginning of each test trial was indicated by a tone (presented over headphones for 0.5 s) and a "Recall" prompt that remained on the computer screen throughout the test. During each test trial, participants were given 80 s to write down as many of the words as possible, in any order, on a response booklet. Therefore, the time of exposure to materials on study trials and test trials was equated (both are 80 s). The transition from one test trial to another (in the repeated test condition) was indicated by a tone as well as a change in the background color on the computer screen: The background was blue during the first test, green during the second test, and red during the third test. At the end of each test trial, participants were instructed to turn to the next page on their response booklets and not to look back at any of their previous responses at any time during the learning phase.
After the learning phase of three cycles of 12 study and test trials, participants proceeded to the testing phase and were asked to complete mazes for five minutes. Participants were then given an immediate free-recall test to write down as many of the words as they could recall in 10 minutes, and were instructed to draw a line on their recall sheet to mark their progress at one minute intervals. This procedure ensured that participants had exhausted their knowledge by the end of the 10 minutes recall test and allowed the researcher to measure the number of words recalled.
All participants, except two, returned for the delayed free-recall test one week later. They were given 10 minutes to write down as many of the words as they could recall, and were instructed to draw a line on their recall sheet to mark their progress at one minute intervals. Finally, participants were asked whether they expected to be given a test in the second session and whether they consciously rehearsed the test items after the first session. At the end of the delayed free-recall test, participants were debriefed and thanked for their participation.

Results
The mean number of correct words recalled out of 40 words on the immediate and delayed free-recall tests is presented in Figure 1, as a function of test trial (single test, repeated test) and processing level (shallow, deep   There was an interaction between final recall and test trial, F(1, 72) = 7.119, p = .009, partial η 2 = .09. There was an interaction between final recall and processing level, F(1, 72) = 39.454, p < .001, partial η 2 = .354. No interaction was found between test trial and processing level, F(1, 72) = 1.762, p = .189, partial η 2 = .024.
Simple main effects analysis was then conducted to investigate the interaction between final recall and test trial (  Simple main effects analysis was also conducted to investigate the interaction between final recall and processing level (Table 3) Test Trial and Processing Level on Retention 136 Another simple main effects analysis was conducted to investigate the interaction among final recall, test trial and processing level (Table 4  In single test trial and immediate recall,

Discussion
The present study investigated the effects of test trial and processing level on immediate and delayed retention.
There was an interaction between final recall and test trial; between final recall and processing level; and among final recall, test trial and processing level. However, no interaction was found between test trial and processing level.
The finding that participants in single test trial recalled more words than repeated test trial in immediate final free-recall test was consistent with previous studies that single test trial produced more short-term benefits than repeated test trial (Roediger & Karpicke, 2006a;Wheeler et al., 2003). However, the dominance of single test trial over repeated test trial in delayed retention was different from previous studies that repeated test trial produced more long-term benefits than single test trial (Roediger & Karpicke, 2006a;Wheeler et al., 2003).
Participants in the single test trial were exposed to the words in 9 study trials and those in the repeated test trial were exposed to the words in 3 study trials. The additional exposure to the words in single test trial may lead to overlearning of the words and better retention on immediate and delayed test.
On the other hand, participants in the repeated test trial took 9 test trials and those in the single test trial took 3 test trials. The additional test trials were supposed to give participants in the repeated test trial more retrieval practice. Bjork (1975) stated that the retrieval process increased the elaboration of memory trace and enhanced the testing effect. However, Duchastel (1981) noted that the free-recall test contained no cues to assist participants in answering the test and might therefore result in recall of only part of the contents, or a lesser testing effect. With less exposure time to the words in the study trials and no cues to assist free recall in the test trials, participants in the repeated test trial failed to produce greater benefits on the delayed recall test.
The finding that participants in deep processing performed better than those in shallow processing in both immediate and delayed retention was consistent with previous studies that deeper encodings led to higher levels of performance on subsequent retention test (Craik & Tulving, 1975;Jacoby et al., 2005). The effort participants put forth to differentiate if each stimulus word belonged to a particular category promoted a deep processing of the words whereas the effort to differentiate if the words were presented in capital letters encouraged a shallow processing. Craik and Tulving explained that memory performance depended on the elaborateness of the encoding, and retention was enhanced when the encoding context was more fully descriptive.
Even though no interaction was found between test trial and processing level, there was an interaction among test trial, processing level and final recall. No matter whether it was in shallow or deep processing, participants in single test trial performed better than those in repeated test trial in immediate retention. The advantage of single test trial over repeated test trial carried over to delayed retention in shallow processing, but not in deep processing. Similarly, no matter whether it was in immediate or delayed retention, participants in deep processing performed better than shallow processing in repeated test trial. The advantage of deep processing over shallow processing carried over to single test trial in immediate retention, but not delayed retention.
The most interesting finding was the delayed retention of participants in the deep processing and single test trial. The current study showed a dominance of single test trial over repeated test trial and deep processing over shallow processing in retention. However, the dominance of single test trial and deep processing did not happen in delayed retention.
Even though participants in the repeated test trial were only exposed to the words in 3 study trials, their delayed retention was the same as those in the single test trial who were exposed to the words in 9 study trials. When participants studied the words in deep processing, the number of study and test trials did not matter. No testing effect was found because the repeated test trial still did not outperform the single test trial in delayed retention. Kang, McDermott, and Roediger (2007) pointed that testing could be of little help when very few items were successfully retrieved on test trials. A further look at the recall performance in the learning phase found that participants in the repeated test trial did not recall more items than those in the single test trial. With a disadvantage of the fewer items retrieved in the learning phase, the repeated test trials managed to perform the same as the single test trial in delayed retention, but did not perform well enough to bring the testing effect.
Retrieval practice was only beneficial to memory when retrieval was successful.
The advantage of deep processing over shallow processing prevailed when participants only studied the words 3 times in repeated study trial. However, when participants studied the words 9 times in single test trial, the deep processing advantage disappeared in delayed retention. When participants studied more times, the depth of processing did not mediate the delayed retention. Craik (2002) stated that initial encoding determined the potential for later retrieval, while retrieval environment determined the degree to which that potential will be realized. Deep processing has the potential for assisting later performance but the retrieval environment makes the potential possible. Even though shallow processing does not have the potential for greater retrieval, the number of study trials may have increased the odds of the environment for greater retrieval.

Conclusion
The present study found the level of processing effect or the superiority of deep processing over shallow processing on subsequent retention tests, but did not find any testing effect or the superiority of repeated testing over simple testing on subsequent retention tests. Even though testing effect was not found in delayed retention, the depth of processing did mediate the delayed retention.
In deep processing, participants managed to perform the same no matter whether they were in single or repeated test trial. It showed that the number of study and test trials did not affect the delayed retention when participants studied the words in deep processing. Once participants established a connection of the word to the category in 3 study trials, the additional 6 study trials did not further enhance retention.
In single test trial when participants studied the words 9 times, they performed the same no matter whether they processed the words in shallow or deep encoding. It showed that the number of study and test trials affected the level of processing effect. Even when participants processed the words in shallow encoding, they could perform as well as those in deep processing when both studied the words 9 times.
In conclusion, additional study trials did not further enhance the delayed retention of words encoded in deep processing, but did enhance the delayed retention of words encoded in shallow processing.

Funding
The author has no funding to report.