When presented with novel activities, we must identify which behaviors or strategies will lead to successful outcomes. In most cases, this is not a matter of identifying a single “perfect” action that will always result in success; instead, many activities demand that behavior display an ongoing degree of variability. In this respect, the mastery of a task consists not only of minimizing “errors,” but also of sustaining appropriate levels of variability in our actions (Stokes, 2001).

In behavioral paradigms, variability has often been characterized in terms of “operant response classes”: Rather than reinforcing a single discrete behavior, schedules instead reinforce those behaviors that belong to broad classes that vary along multiple dimensions, such as timing and response topography. The degree of variability of these various dimensions can also be shaped by feedback (Shahan & Chase, 2002).

Organisms can readily increase or decrease their behavioral variability, whether responses are constrained to a narrow class (Davison & Baum, 2000) or widened to broad conceptual categories (Neuringer & Jensen, 2010). Indeed, behavioral variability seems not only inescapable, but also often manifests at the precise levels appropriate to a given context; response variability not only adapts, but is also adaptive (Neuringer, 2002).

The experimental evidence suggests that strategic increases in variability precede the discovery of new problem-solving strategies. Greater variability during skill acquisition is associated with greater learning (Stokes, Lai, Holtz, Rigsbee, & Cherrick, 2008). This pattern of “variability-as-path-to-discovery” is observed in children mastering grammatical rules (Bowerman, 1982), learning to solve arithmetic problems (Siegler & Jenkins, 1989), or acquiring novel concepts (Goldin-Meadows, Alibani, & Church, 1993). Furthermore, children who use more strategies when first learning a task acquire the correct strategy more often than those who use fewer initial strategies (Siegler, 1995). This has also been seen in adults making novice-to-expert transitions in radiology (Lesgold, Rubinson, Feltovich, Klopfer, & Wang, 1988) or cardiology (Johnson et al., 1981), in whom greater variability precedes acquisition of advanced diagnostic expertise. This makes modulating levels of variability central to exploration/exploitation strategies (March, 1991).

The increase in variability under extinction protocols is well-established (see Balsam, Deich, Ohyama, & Stokes, 1997, for a review), and this may simply be a basic principle of behavior. Natural selection plausibly favors mechanisms for generating variability in the face of failure, and such a mechanism would have relevance to a wide range of problem domains (Neuringer, 2002). In studies of extinction, reported changes in behavior are often more quantitative than qualitative. Neuringer, Kornell, and Olufs (2001) reported, for example, that although extinction increased the frequency of rare response sequences, relatively common sequences were still exhibited more often than their uncommon counterparts. In addition to extinction effects, intermediary levels of variability are observed when reinforcement is reduced without being entirely extinguished. Stahlman and Blaisdell (2011) reported that variation in response form increases as the probability of reinforcer delivery is lowered, as well as when the magnitude of the reward is reduced or the delay to reward delivery is increased.

According to associative theories of Pavlovian conditioning, learning (and the resulting changes in behavior) depends on surprising outcomes that are processed differently from the status quo (Kamin, 1969; Rescorla & Wagner, 1972; Wagner & Brandon, 1989). Surprising events (whether they be positive or negative) thus result in the “prediction error” necessary for discovering causal relationships (Elsner & Hommel, 2004). This behavioral literature complements findings that valence-independent prediction error signals can be observed in the brain (Schultz, 2006; Wang & Tsien, 2011). A Pavlovian account of variability in response to novel events might thus begin by examining whether a prediction error might result in a shift in behavioral variability.

In practice, however, the simple “valence-independent” symmetry of early prediction error models (in which unexpected reinforcement has an effect equivalent to that for an equally unexpected failure to obtain reinforcement) requires revision in order to accommodate the experimental evidence. Even when outcomes are programmed using Pavlovian schedules, infrequent reinforcement corresponds to increased behavioral variability (Stahlman, Young, & Blaisdell, 2010). When reinforcement is infrequent, the overall uncertainty is lower (because most trials are correctly predicted to be unreinforced); despite this, an increase in variability is observed. Whether or not trial-specific prediction errors play a roll, results such as these suggest that a degree of “induced variability” can be expected, independent of whether the schedule directly reinforces “functional variability.”

Amsel’s (1992) frustration theory was proposed as a mechanism that may increase variability during extinction and other schedules with downshifted outcomes. If afferent feedback from formerly reinforced responses becomes aversive in extinction, then variants that do not produce this feedback will be negatively reinforced. Additionally, this “nonreward frustration” changes the general stimulus conditions, thus altering the relative strengths of different responses and/or switching attention to different stimuli that might control different responses. Amsel’s account draws on a large body of experimental work (Killeen, 1994) and has been invoked to explain behavior in a wide range of species (Papini, 2002).

Surveying this literature suggests a variety of hypotheses. The classical prediction-error view might suggest that any uncued change in the explicit value of an outcome should impact variability; a more nuanced interpretation might suggest that only the initial (very unexpected) changes might have an effect, as subsequent changes come to be expected (and thus correspond to less dramatic prediction errors). On the other hand, an account that places special importance on downshifts in outcome value (such as Amsel’s frustration theory) might predict increased variability only in cases in which the value of the outcome is reduced. It is also unclear whether extinction differs from more general downshift, so reducing an outcome’s value to zero might have an effect that is distinct from other reductions.

Because the literature has primarily emphasized the effects of low probabilities of informative feedback, we examined the effects of varying explicit reward magnitudes. In our experiment, participants generated arbitrary strings using a keyboard and were presented with different surprising changes in the values of response outcomes. After participants repeatedly earned points (delivered 10 at a time), they experienced one of three conditions: extinction (in which feedback was shifted to 0 points), downshift (in which feedback was shifted to 1 point), and upshift (in which feedback was shifted to 100 points). These shifts in point values were introduced unexpectedly and then, after a brief period, were revoked. We examined response variability as a factor of this brief exposure to a surprising condition.

Method

Participants

The participants were 30 Barnard undergraduates (all female) who participated in the experiment to fulfill an introductory psychology class requirement.

Apparatus

The participants made responses using a personal computer enclosed in a 1.5 × 3.5 m experimental room. Participants used a modified QWERTY keyboard, with all of the keys covered except for the Space key, the Enter key, and the eight characters in the string “kl;’m,./” (for clarity, denoted as ABCDEFGH), as depicted in Fig. 1. The eight symbolic keys (that is, those other than the Enter and Space keys) are here collectively referred to as the “Alpha” keys. Any keys unlabeled in Fig. 1 were blocked from view and could not be used.

Fig. 1
figure 1

Keyboard layout used in the experiment. The key letters are labeled “A” through “H” for clarity. Unlabeled keys were covered and could not be pressed

The apparatus was identical to that used by Stokes, Mechner, and Balsam (1999), who described it in more detail.

Procedure

The participants were randomly assigned to three groups: extinction, downshift, and upshift. These groups underwent identical training before beginning the experimental component of the experiment. Throughout the experiment, participants were given feedback by means of “points” awarded on the computer screen, presented as black numbers in a white rectangle. Note that an explicit award of 0 points was distinct from receiving no feedback at all.

Training component

Participants were instructed to earn points by pressing keys and to use the ten keys depicted in Fig. 1; aside from these two statements, they were given no further verbal instruction, learning the remaining details of the task by trial and error. During training, each reinforcer was worth 10 points. Their responding was shaped in seven stages, with each stage persisting until 10 reinforcers were collected, except where noted.

  1.  1.

    A blue rectangle was presented on screen. Reinforcement was delivered each time that Enter was pressed.

  2.  2.

    A blue rectangle was presented on screen. After a press to any Alpha key, the upper-right corner of the rectangle turned white. Pressing Enter produced a reinforcer when the white corner was visible.

  3.  3.

    A red rectangle was presented on screen. The rectangle remained red until Space was pressed, which turned the rectangle blue. After any one press to an Alpha key, the blue rectangle’s white corner indicated that a reinforcer could be earned by pressing Enter.

  4.  4.

    Identical to Stage 3, except that at least three Alpha keypresses were required to make the white corner visible. Repeated responses to an Alpha key were counted toward this requirement.

  5.  5.

    Identical to Stage 4, except that at least six Alpha keypresses were required to make the white corner visible.

  6.  6.

    Identical to Stage 5, except that at least ten Alpha keypresses were required to make the white corner visible.

  7.  7.

    Identical to Stage 6, with the exception that the white corner ceased to appear, so participants were no longer given an explicit cue indicating that they had made a sufficient number of responses.

Experimental component

In this component, the white corner never appeared, so participants were not given an explicit cue that they had made a sufficient number of responses. However, if their “response sequence” (the series of responses made between the initial Space response and the final Enter response) contained fewer than ten Alpha keypresses, the task went directly to presenting the red rectangle without providing feedback. As such, participants were still given a cue indicating that at least ten Alpha responses were required. Beyond the requirement that response sequences should consist of at least ten Alpha responses, preceded by a Space and followed by Enter, any combination of responses was permitted, including repeating the same Alpha key ten times. Reinforcement was not contingent on which Alpha responses were emitted.

In all three groups, participants made responses until 40 reinforcers had been earned. We will henceforth refer to this as “Phase 1” of the experiment. As in training, each reinforcer earned participants 10 points, so Phase 1 consisted of a total of 400 points earned.

Phase 2 of the experiment was the “surprise phase,” which consisted of ten consecutive reinforcers. In the extinction group, participants were given an explicit reinforcer worth 0 points (although the requirement to emit at least ten responses to receive explicit feedback was still in effect). In the downshift group, participants received reinforcers worth 1 point. In the upshift group, participants received reinforcers worth 100 points.

In Phase 3, participants returned to earning 10 points per reinforcer, as in Phase 1. This persisted for a total of 50 reinforcers, or 500 points. Across all phases, the experimental component consisted of as many trials as were necessary to earn 100 reinforcers.

Results

In order to compare the ten reinforcers during Phase 2 with the 40 reinforcers in Phase 1 (and the 50 reinforcers in Phase 3), the history of responses was divided into “subphases” of ten consecutive reinforcers apiece. From this point forward, “Phase 1” will refer to the 40 reinforcers in their entirety, whereas “Subphase 1–1” will refer to the first ten reinforcers, “Subphase 1–2” to the second ten, and so forth; the same subdivision will be used for Phase 3.

The mean length of the response sequences emitted in each subphase was calculated for each participant. Figure 2 presents the grand means (across participants) of those means. A mixed-model repeated measures analysis of variance (ANOVA) was performed comparing the effects of subphase (within subjects) and condition (between subjects) on participants’ mean string lengths. A significant effect was found for subphase [F(9, 243) > 4.19, p < .0001]. In a post-hoc Tukey test, Subphase 1–1 was found to be significantly different (p < .04) from all other subphases: Participants had significantly shorter string lengths in Subphase 1–1 than in the subsequent subphases of the experiment. Otherwise, no significant differences were found, including any effect resulting from the “surprise” manipulation.

Fig. 2
figure 2

Mean lengths of strings entered via the keyboard during each phase. Means were calculated for each participant, and grand means were calculated from those means

In addition to performing an analysis comparing the mean lengths in each subphase, we examined the mean of the participant standard deviations for each subphase. Figure 3 shows these across-participant means of the within-subphase standard deviations and suggests a considerable increase in the variance during the surprise manipulation in the extinction and downshift conditions, but no such change in the upshift condition. We performed a mixed-model repeated measures ANOVA comparing the effects of subphase (within subjects) and condition (between subjects) on participants’ within-subphase standard deviations, and found a significant effect for subphase [F(9, 243) > 10.5, p < .0001]. We also found a significant interaction between subphase and condition [F(18, 243) > 4.9, p < .0001]. In a post-hoc Tukey test, we found that the surprise phase in the extinction condition was significantly different (p < .002) from all other subphases, with the exception of Subphase 3–1. In the downshift condition, the surprise phase was significantly different (p < .01) from all subphases except Subphase 1–1. Additionally, a significant difference (p < .03) emerged between Subphases 1–1 and 1–4 in the downshift condition. However, the only result from the upshift condition was that Subphase 1–1 differed (p < .02) from all other subphases; the surprise phase in this condition was indistinguishable from any subphase apart from the first.

Fig. 3
figure 3

Standard deviations (SDs) of string lengths during each phase. SDs were calculated for each participant, and grand means were calculated from those SDs

These results suggest that the surprise manipulation had a distinctive effect on responding, but only when the points awarded were unexpectedly reduced from their previous levels. The return to 10-point rewards in Phase 3 had no discernible effect. However, an increase in the variance of length during the surprise phase might have been independent of increased variability in the content of those strings.

To determine whether the content of the response strings changed as a result of the surprising manipulation in Phase 2, we compared strings in terms of “Levenshtein distance” (Levenshtein, 1966). Levenshtein distance, described in detail in Appendix A, is a metric of the “edit distance” between two strings, which refers to the minimum number of discrete operations necessary to change one string into another.

For each participant in each subphase, we calculated the Levenshtein distance between each consecutive pair of strings (first to second, second to third, etc.). We then computed the mean distance in each subphase as an overall estimator of how much participants varied their responses as each subphase progressed. Figure 4 presents these grand means of the within-subjects mean Levenshtein distances for each subphase.

Fig. 4
figure 4

Mean Levenshtein distances between consecutive strings during each phase. These distances were averaged for each participant, and grand means were calculated from those individual means

As with string length, we performed a mixed-model repeated measures ANOVA comparing the effects of subphase (within subjects) and condition (between subjects). We found a significant effect for subphase [F(9, 243) > 5.6, p < .0001], as well as a significant interaction between subphase and condition [F(18, 243) > 3.9, p < .0001]. In a post-hoc Tukey test, we found significant differences in the downshift condition: The surprise phase was significantly different from all other phases (p < .004). In the extinction condition, the surprise phase differed significantly from Subphases 1–1, 1–4, 3–3, 3–4, and 3–5 (ps < .05). All other subphase comparisons, including all comparisons from the upshift condition, were nonsignificant.

In order to confirm that consecutive Levenshtein distances were representative of overall behavioral variability (as opposed to, e.g., merely being the result of switching between two sequences), we calculated the average distance between all pairs of response strings (Pinheiro, de Souza Pinheiro, & Sen, 2005). The resulting means in Phase 2 (μ ext = 14.00, μ down = 10.16, μ up = 5.36) were similar to those in Fig. 4, as were those in Subphase 1–4 (μ ext = 9.07, μ down = 6.44, μ up = 7.26) and Subphase 3–1 (μ ext = 8.91, μ down = 5.69, μ up = 6.64). The mean distance between pairs is a Hoeffding (1948) U statistic of level 2, and as such, its full statistical analysis is beyond the scope of this article. To confirm that these differences were significant, a rank transformation of the means was performed, allowing ANOVA to be used as a robust nonparametric test (Conover & Iman, 1981). When this technique was applied to the data from Phase 2, the significant effect of the change in point value was confirmed by this nonparametric analysis [F(2, 26) > 6.3, p < .006].

Discussion

We trained participants to enter strings of responses, requiring only that these strings begin with Space, end with Enter, and consist of at least ten intervening Alpha keypresses, chosen from a bank of eight alternatives. Each sequence meeting these criteria was awarded 10 points in the first and third phases of the experimental component. Between these was a “surprise” phase, during which the number of points awarded was changed to one of three values: 0 (in the extinction group), 1 (in the downshift group), or 100 (in the upshift group).

We observed that response variability increased during the surprise phase for the extinction and downshift groups, but not for the upshift group. This is consistent with results showing that “unexpected downshifts” generally elicit variability, while also confirming that a surprising change in the rewards is not in itself sufficient to do so. Additionally, the upshift group also experienced a tenfold downshift in the value of the reward at the end of the surprise phase, but this did not have any detectable impact on their behavior. This suggests that the unexpected nature of the initial downshift is an important characteristic of the manipulation, because it introduces a new kind of change to the participant’s learning history.

Unlike in traditional extinction schedules, we did not withhold information from participants. Rather than use probabilistic reinforcement (e.g., da Silva Souza, Abreu-Rodrigues, & Baumann, 2010; Stahlman & Blaisdell, 2011) or outright extinction (Kinloch, Foster, & McEwan, 1981; Neuringer et al., 2001), our procedure was more akin to the “successive negative contrast” effects observed when outcomes unexpectedly worsen (Freidin, Cuello, & Kacelnik, 2009). In our extinction group, participants were given explicit feedback that “0 points” were earned upon success, whereas they were given no feedback at all upon failure. Thus, the informative value of the feedback regarding whether each trial was correct was identical across conditions. It is inappropriate to interpret the point values awarded by this feedback as corresponding to their “reinforcement value,” in the classical sense, because the numbers of points awarded were independent of how informative the feedback was about whether a string was deemed acceptable according to the schedule.

Another benefit of the Alpha-sequence paradigm (previously described in Stokes et al., 1999) was that response strings had many degrees of freedom: Given eight alpha keys, participants had over one billion “ten-response” strings to choose from. Tasks that constrain possible variability can be insensitive to differences in behavior, especially over short windows of time. This is why many extinction studies require hundreds or thousands of trials in order to obtain parametric estimates. Because of the combinatorial growth of possible strings, a sequence of discrete responses can easily entail much greater uncertainty than can any single response sampled from a continuous multivariate space.

For example, spatial tasks (e.g., Stahlman & Blaisdell, 2011) constrain the range of behaviors judged to be “effective” because they only have a few dimensions along which to vary. Stahlman et al. (2010) reported gradual shifts in the standard deviations of recorded behaviors on the order of at most 25 %. By contrast, our reported variability changes were substantial: Our two downshifted groups at least doubled their standard deviations, and did so abruptly (Fig. 3). We hesitate to directly compare the amounts of variability that we observed to those present in animal studies, but the increase in variability that we observed appeared more acute and pronounced than the “induced variability” that arises in more constrained response paradigms.

The Levenshtein distance provided a way to analyze these complex responses. As Nickerson (2002) pointed out, “variability” has several technical definitions. According to his classifications, Levenshtein distance is a measure of compressibility, in contrast to more common measures of behavioral entropy (Neuringer, 2002). Entropy estimates require many observations, because estimators of entropy that are based on observed frequencies are biased, having errors of approximately (k − 1)/–2n for k sequences and n observations (Roulston, 1999). Given a rule of thumb that bias can be kept small by requiring that n ≥ 5 k, our surprise phase did not consist of enough observations to obtain a reliable estimate, even in pairs of keypresses (for which k = 82 = 64). Consequently, entropy estimates were not appropriate, given the short duration of our manipulation. Contrastingly, the Levenshtein distance reliably measures variability in arbitrarily long strings, a property exploited by computational biologists measuring mutation in genetic data (Gusfield, 1997). Future studies examining the effects of schedule- or prediction-error-induced variability can use this metric to complement findings from entropy-based metrics, as well as branch out to paradigms that are ill-suited to entropy estimation.

Previous studies have suggested that Pavlovian learning about causal relationships is unlikely to be driven by mere contiguity, but instead depends on statistical contingency (Moore et al., 2009). According to this view, downshifted participants may have increased their variability in order to investigate the relationship between points and the input string.

Such an account still does not eliminate the role of prediction error, however, because the upshift group also experienced a tenfold downshift at the end of the surprise phase, but did not change their behavior at that time. This suggests that increasing variability as a form of “ hypothesis testing” may depend on the relative unfamiliarity of the task conditions. This interpretation is compatible with a signal detection account of learning, in which the perceived degree of contingency between outcomes is modeled in psychophysical terms (Allan, Hannah, Crump, & Siegel, 2008).

These results contribute to a growing literature examining how the properties of a conditioned stimulus are interpreted. For example, the information conveyed by a stimulus depends on the multiple layers of conditional probability used by an organism to build predictions (Bromberg-Martin, Matsumoto, & Hikosaka, 2010). Any theory invoking “prediction error” must thus account for the complexity of an organism’s prediction model. Similar results have been observed in studies comparing how different learning histories lead to distinctly different behaviors under an otherwise identical schedule (da Silva Souza et al., 2010; Stokes et al., 1999).

The interpretation that variability arises from primary drives, such as frustration (Amsel, 1992), suggests a very different underlying mechanism, wherein variability manifests explicitly as a component of an exploratory strategy (Freidin et al., 2009). According to this view, the dramatic effect in the downshift groups (as compared to the lack of any effect at all in the upshift group) would not necessarily reflect changes in judgments of response dependency, but might instead point to framing effects: The 100-to-10 transition at the end of the surprise upshift would constitute a return to the norm, rather than an aversive downshift in reward value. An important future direction in identifying these relationships will be to determine the degree to which learning mechanisms (and their resulting behavioral manifestations) depend on both the contingency detection and contextual outcome valence (Bromberg-Martin et al., 2010; Wang & Tsien, 2011).

In conclusion, we found that the way in which participants reacted to a surprising change in feedback depended on whether the change improved or worsened conditions. Although this result does not preclude a role for prediction error, it rules out the claim that any unexpected change would be sufficient to induce variability. These results also rule out the claim that increased variability necessarily follows from downshifts, as the 100-to-10-point transition in the upshift group did not result in a change in behavior. In our experiment, both the unexpected nature of the shift and its direction appeared to play a role. This emphasizes the importance of interpreting task cues not according to their objective properties, but rather as they relate to an organism’s learning history.