Timing-dependent valence reversal: a principle of reinforcement processing and its possible implications

Punishment feels bad, but relief upon its termination feels good. As a consequence of such timing-dependent valence reversal, memories of opposite valence can result from associating stimulus A with, for example, the occurrence of punishment (A−) versus punishment termination (−A): A− training results in aversive memory, but −A training in appetitive memory (corresponding effects exist for reward occurrence and termination). Whereas learning through the occurrence of punishment is well studied, much less is known about learning through its termination. Current research investigates how dopaminergic system function contributes to these processes in Drosophila, rats and humans. We argue that dopamine-related psychopathology may entail distortions in learning through punishment termination, and that this may contribute, for example, to non-suicidal self-injury or post-traumatic stress disorder.


Introduction
Avoiding punishment and obtaining reward are evolutionarily deeply rooted behavioral goals. However, half of the processes through which punishment and reward affect behavior are notably understudied. That is, past research has provided detailed insight into how we learn about predictors of the occurrence of punishment and reward [1][2][3][4], but much less is known about how we learn about their termination [5]. Since the occurrence of punishment feels bad, but its termination feels good (Figure 1), associating stimuli with these experiences results in memories of opposite valence, namely in aversive and appetitive memory, respectively. Conversely, the occurrence versus the termination of reward supports appetitive versus aversive learning. There are thus four, not just two, predictive relations between stimuli and the reinforcement paired with them ( Figure 1) [5]. This reversal in memory valence through a switch in event timing is called timing-dependent valence reversal. The present paper argues that timing-dependent valence reversal reflects a general principle of how reinforcement is processed; it details what has recently been revealed about the role of dopaminergic neurons in timing-dependent valence reversal; and it explores the possible impact of distortions in timing-dependent valence reversal on mental health. secondary and oppositely-valenced after-affect upon their termination. This goes both for events of a primary-aversive and of a primary-appetitive nature ( Figure 1) [6]. These four types of affect can induce memories for stimuli associated with them ( [5,7,8] and references therein). Typically, stimuli associated with the occurrence of punishment (AÀ) acquire negative learned valence, whereas stimuli associated with the termination of punishment (ÀA) acquire positive learned valence. Using punishment, this has been observed, for example, in startle-modulation by Pavlovian learning in humans and rats, eyelid conditioning in rabbits, Sidman avoidance learning in rats, and in olfactory as well as visual Pavlovian conditioning in the fly Drosophila. This called for an extension of the threat imminence model [9,10] to include the effects produced upon the termination of punishment ('post-strike' phase: [5]). Likewise, stimuli presented with reward occurrence (A+) and termination (+A) acquire positive and negative learned valence, respectively, as shown for odor-sugar associative learning in honeybees. These findings of timing-dependent valence reversal across species, paradigms, stimulus modalities and valence domains prompted major revisions of the prediction-error learning rule [11]. It remains unresolved how these behavioral effects, which are observed across notably different time scales in the various species and paradigms, relate to molecular coincidence detection and/or to spike-timing-dependent synaptic plasticity observed in the second to millisecond range (discussion in Refs. [5,12]).

Timing-dependent valence reversal in Drosophila
In the most widely used Pavlovian conditioning paradigm for Drosophila, the flies are presented with an odor A followed by electric shock punishment (AÀ), and a second odor B unpaired from punishment. Such AÀ/B training results in a relative avoidance of A in a subsequent test of choice between the two odors. By contrast, presenting odor A upon the termination of punishment (ÀA/B training) leads to a relative approach toward A. Thus, the paradigm is bivalent: the balance in the choice between A and B can be tipped either way, making it possible to reveal learned avoidance or approach, respectively, within the same paradigm. Follow-up experiments have shown that memory after ÀA training is weaker than after AÀ training, requires a longer interstimulus-interval (i.e. a longer time interval between punishment and A), is best for mild punishment, requires more training trials (cf. Ref. [5]), and increases with punishment duration [13 ; the latter conforms to the classical proposal of Ref. 6]. Memory scores established through AÀ and through ÀA training show noncorrelating variation across inbred fly strains derived from the wild, suggesting considerable non-overlap of their genetic determinants [14]; whether the same is the case for memories after ÀA and A+ training remains to be tested. It was further revealed that odor-specific associative memory shortly after AÀ training is composed of two components, one that is dependent on Synapsin [15], an evolutionarily conserved presynaptic protein known to regulate synaptic strength, and is susceptible to cold-amnesia (cf. Ref. [5]), and one that is Synapsinindependent and resistant to cold-amnesia. Memory after ÀA training, by contrast, is only of the former kind, that is Synapsin-dependent and susceptible to coldamnesia. What is known about the circuit-level localization of the engrams after AÀ and ÀA training?
The concerted efforts of the Drosophila community suggest that the engram underlying odor-specific associative short-term memory after AÀ training can be localized to a single layer of synapses in third-order olfactory neurons of the fly brain (reviewed in Ref. [16]). Specifically, the engram is apparently local to the presynaptic terminals of those mushroom body Kenyon cells that are activated by odor A (Figures 2, S1), which we will call A = KCs. Learned avoidance of A is thought to come about through an odor-specific depression in the excitatory, cholinergic connection of the A = KCs to approach-promoting mushroom body output neurons (MBONs) [17-20,21 ,22,23]. To the limited extent tested, this same synaptic layer also appears to harbor the engram after ÀA training [13 ,15]. Thus, AÀ training can evidently establish depression whereas ÀA training might possibly establish potentiation of these A = KC-MBON synapses (indeed, these synapses have been shown to be capable of potentiation upon associative training: [24-26; also see Refs. 20, [27][28][29]). But how is punishment signaled to the KCs? Answering this question has been facilitated by a breakthrough in understanding  Timing-dependent valence reversal. Left: The occurrence of punishment induces negative affect (Pain), while its termination induces positive affect (Relief). Likewise, the occurrence versus termination of reward feels good and bad, respectively (not shown). All four of these affective states can enter into association with stimuli paired with them. Right: Of the four predictive relations of stimuli (A) and reward (+) or punishment (À), two are notably understudied (open boxes). Learning through unpaired presentations of stimuli and reinforcement is not covered in this paper (discussion in Ref. [5]).
the cellular and synaptic architecture of the mushroom body in notable detail and completeness [18,30,31 ]. This architecture is reviewed here in simplified form.
The KCs receive excitatory, cholinergic input from second-order neurons of multiple sensory modalities, including olfactory projection neurons, establishing a sparse representation of the flies' sensory environment. The KC axons form a parallel bundle intersected by the terminals of ascending modulatory neurons. Typically, these modulatory neurons are either octopaminergic or dopaminergic (OANs, DANs). The modulatory effects exerted on the KCs may be complex, given that multiple types of octopamine and dopamine receptor can be expressed in KCs, and given that the presence of typically more than one morphological type of synaptic vesicle suggests that further signaling molecules might be released as well. In any event, the DANs can be broadly classified as mediating either punishment or reward (a similar situation may be emerging in mammals as well: [32][33][34][35]). For a given DAN the DAN-KC synapses are restricted to only a small region along the KCs axons, non-overlapping with neighboring DANs. These regions are also respected by the dendritic branches of the MBONs. This results in a peculiar and valenced compartmental structure of the mushroom body ( Figure 2).
It has been found that DANs mediating punishment information target compartments whose MBONs are approach-promoting [19]. This is led to the mentioned working hypothesis that learned avoidance after AÀ training comes about by a depression of the A = KC-MBON 116 Pain and aversive motivation Circuit and microcircuit for associative learning in flies.
Left: Simplified working hypothesis of olfactory associative learning in flies. Odor A (orange cloud) activates a subset of KCs (orange fill). In these KCs, a coincidence can be detected with modulatory signaling from DANs that intersect the mushroom body and that convey either punishment or reward information. Such coincidence can lead to a depression of the synapses from the respective KCs to the MBONs, indicated by the irregular star. In the case of AÀ learning, reduced activation of an approach-mediating MBON would lead to learned avoidance because avoidance tendencies through non-depressed MBONs in other mushroom body compartments would prevail (for more detail see Figure S1). A+ learning is thought to come about in an analogous way. According to this scenario, AÀ and ÀA learning would take place by depression and potentiation, respectively, within the same compartment, and the same would be the case, in a separate compartment, for A+ and +A learning. Grey connections between ascending punishment and reward signaling imply the possibility of mutual inhibition between them, and thus of postinhibitory rebound activation. This mutual inhibition could also be indirect, via for example cross-compartmental feedback from MBONs. It would imply as an alternative scenario that A-and +A learning as well as ÀA and A+ learning take place in the same compartment, respectively, and through depression in all cases. The organization of innate olfactory, punishment-and reward-related behavior largely bypasses the mushroom body. Please note that for simplicity KC-KC, KC-DAN, DAN-MBON, and MBON-MBON synapses are omitted from this figure; how they contribute to the learning processes discussed in this paper is as yet unknown. Right: Schematic of a mushroom body compartment with the chemical synapses between KCs, DANs and MBONs as determined from electron microscopy (data from Refs. [30,31 ]). The stippled box indicates that MBON-MBON connections are located largely outside the mushroom body. The fly image is taken from Ref. [51], copyright Elsevier.
synapses, and learned approach after ÀA training by their potentiation. In turn, DANs mediating reward information target compartments with avoidance-promoting MBONs, suggesting that the same logic applies for A+ and +A learning. If this were so, and in line with both spike-timingdependent plasticity (STDP) at the KC-MBON synapse [26] and an earlier theoretical approach [36], the onset and the offset of the activation of a given DAN should confer timing-dependent valence reversal. Is this indeed the case?
Individual DANs can mediate memories of opposing valence, dependent on timing In Drosophila, transgenic effectors can be expressed permitting non-invasive manipulation of neuronal activity in awake, freely behaving animals. This allows high-temporal-resolution control of activity in hemispherically single, identified DANs with known and bilaterally highly symmetrical synaptic connectivity. This has revealed that the activation of DANs called PPL1-01 can produce timing-dependent valence reversal [13 ,37 ] (synonyms for PPL1-01 are: PPL1-g1pedc, MB-MP1, and MP). That is, similar to what has been found for AÀ and ÀA training, presenting odor A with the onset of PPL1-01 activation resulted in learned avoidance of A, whereas presenting odor A with the offset of PPL1-01 activation resulted in learned approach (Figure 3). Thus, the same dopaminergic neurons can mediate memories of opposing valence, depending on the timing of their activation (a similar result was obtained, in larval Drosophila, in the appetitive domain for the DAN called DAN-i1 [38 ]).
These results are consistent with a scenario whereby the above-mentioned dopaminergic and STDP-like mechanism at the KC-MBON synapse underlies timing-dependent valence reversal. They imply that a given DAN can induce memories through both AÀ and ÀA training. However, the observation that different compartments with DANs processing punishment and reward information, respectively, show alternating cycles of activation and inhibition [20] has prompted an alternative scenario [13 ]. What if PPL1-01 inhibits, possibly via a non-dopaminergic mechanism, a DAN mediating reward signals? Once PPL1-01 activation is terminated that reward-mediating DAN may not only be released from inhibition but may show post-inhibitory rebound activation, which is a frequently observed physiological phenomenon (see following section). Such post-inhibitory rebound activation of a rewarding DAN could produce a depression of the A = KC-MBON synapses-in 'its own' compartment. In other words, the alternative scenario is that AÀ and ÀA learning come about through different DANs, and that both do involve the depression of A = KC-MBON synapses yet in different compartments. In this case, and in contrast to the STDP-scenario detailed above, a given DAN would mediate the effects of both A+ and ÀA training. The proposed inhibition between oppositely-valenced DANs is unlikely to come about by direct chemical synapses, however, as no such connections have yet been revealed [30,31 ].
Timing-dependent valence reversal Gerber et al. 117 Timing-dependent valence reversal in flies.
Top: Summary of timing-dependent valence reversal in flies across studies. Different sets of flies received either AÀ training at the respectively indicated negative inter-stimulus-intervals (ISIs) or ÀA training (positive ISIs). The lightning bolt indicates the period of electric shock delivery. The memory scores quantify the extent to which, relative to flies receiving unpaired presentations of odor A and shock, AÀ training leads to learned avoidance of A (negative memory scores), whereas ÀA training leads to learned approach (positive memory scores). Each dot reflects the median memory score of on average N = 22 sets of n = 150 flies each (total N = 919, total n approx. 140 000). Dots sharing the same color come from the same study. The red and green areas under the curve are meant to facilitate comparison to Figure 1. The raw data and references to their original publication can be found in Table S1. Bottom: Flies expressing the blue-light-gated cation channel ChR2-XXL in the DAN called PPL1-01 were trained as in the top panel, except that instead of electric shock blue light was turned on (blue star and vertical line). This reveals timing-dependent valence reversal through PPL1-01 activation. The box plots represent the median, 25/75, and 10/90% quantiles as middle line, box boundaries, and whiskers, respectively, of on average N = 23 sets of flies with n = 150 animals each (data from Ref. [13 ]; Table S1).
Coloring of the plots indicates significant difference from chance. The inset sketch shows how PPL1-01 innervates the mushroom body (after [18]) (data from Ref. [13 ]; Table S1) (the fly image is taken from Ref. [51], copyright Elsevier).

Timing-dependent valence reversal in rats and humans
A workhorse paradigm for current studies on timingdependent valence reversal in mammals is the modulation of the evolutionarily conserved auditory startle reflex [39][40][41][42]. Rats and humans exhibit potentiated startle in the presence of a visual stimulus A previously paired with mild electric shock punishment (AÀ), whereas they show attenuated startle after appetitive training (A+). Intense research into the mechanisms of such AÀ and A+ learning has revealed a conserved role of the amygdala for AÀ learning and of the dopaminergic inputs to the nucleus accumbens of the striatum for A+ learning.
As startle modulation is a bivalent measure, it lent itself to showing timing-dependent valence reversal in mammals. Indeed, an attenuation of startle and thus positive valence was revealed after ÀA training (Figure 4), as was a double dissociation of the brain regions for ÀA and AÀ learning [43]: in rats, ÀA learning was impaired by acute inactivation of the ventral striatum but not of the amygdala, whereas AÀ learning was impaired by inactivation of the amygdala but not of the ventral striatum. Fittingly, functional imaging in humans showed that A activates the ventral striatum upon ÀA training but not upon AÀ training, whereas the amygdala was activated by A upon AÀ but not upon ÀA training. Recent data from the rat show that it is specifically dopaminergic transmission from the posterior-medial ventral tegmental area to the nucleus accumbens shell that is required for ÀA learning but dispensable for AÀ learning [44,45 ]. Whether dopaminergic transmission can also bring about timing-dependent valence reversal, whether corresponding dissociations can be observed in the appetitive domain (A+ versus +A), what the role of other dopaminergic projections is, and whether the same or different dopaminergic neurons mediate these effects, are questions that remain to be investigated. We note that rebound activation of dopamine neurons upon the termination of aversive stimulation is well documented [35,[46][47][48].

Possible implications
The central basic-research question underlying the above discussion is whether different subsets of dopamine neurons and/or receptors are involved in different aspects of timing-dependent valence reversal and across valence domains. We believe that answering this question will have implications for understanding, and treating, distortions in motivated behavior in humans and its concomitant psychopathologies. For example, 'excessively strong' relief may help in understanding pathological risk-taking or non-suicidal self-cutting as behaviors seeking to bring about relief [49]. But does this indeed come about by exaggerated ÀA learning, which under normal conditions is a rather modest effect (Figure 4), or rather by blunted AÀ learning? Or both? Is it related to dopaminergic hyperfunction or hypo-function, or maybe to both-in different 118 Pain and aversive motivation Timing-dependent valence reversal in rats and humans.
Top: Different groups of rats were submitted either to AÀ training (negative ISIs) or ÀA training (positive ISIs). A 5-s light was used as stimulus A and combined with a mild electric shock punishment of 0.5 s duration, indicated by the lightning bolt. In two different control groups (open plots), either only stimulus A was presented, or A and punishment were presented in a random temporal relationship. One day later, the effect of stimulus A on the magnitude of the auditory startle response was measured. Plotted is the mean percent difference between startle magnitude in the presence versus in the absence of A. Startle attenuation, that is positive valence, is plotted toward the top of the Y-axis, whereas startle potentiation, that is negative valence, is plotted toward the bottom of the Y-axis. Coloring of the plots indicates significant differences from the median of the randomcontrol (stippled line) (data from Refs. [43,44]; Table S1). Other details as in Figure 3, bottom. Bottom: Summary of timing-dependent valence reversal in humans across studies. AÀ training (negative ISIs) and ÀA training (positive ISIs) were implemented in either withinsubject or between-subject protocols, as detailed in Ref. [52], using 8s-duration visual geometric shapes displayed on a computer screen as A and 0.2-s-duration mild electric shock as punishment, indicated by the lightning bolt. The magnitude of the startle responses was transformed into standard scores (z-scores). The colored dots indicate the mean results of the respective treatment condition, with startle attenuation, that is implicit positive valence, plotted toward the top of the Y-axis, whereas startle potentiation, that is implicit negative valence, is plotted toward the bottom of the Y-axis. Triangles reflect startle modulation by a within-subject Control stimulus that was presented at a very long ISI (15-25 s, dependent on study; see Ref. [52]). The red and green areas under the curve are meant to facilitate comparison to Figure 1. Data are taken from Ref. [52] (Table S1), and reflect the performance of N = 214 subjects in total.
sets of dopamine neurons and/or employing different dopamine receptors? What does this mean for the treatment of dopaminergic dysfunction? Corresponding questions arise concerning 'excessively weak' relief. That is, maybe the risk of post-traumatic stress disorder is increased when during the post-encounter phase subjects mnemonically focus on the adverse emotions experienced during trauma, rather than on the relief of having survived it [50]? Might a distorted mnemonic focus also contribute to the pathological avoidance behavior characteristic of anxiety disorders? Again, it would have opposing implications depending on whether in either case this comes about by blunted ÀA learning, by exaggerated AÀ learning, or both, and whether the affected transmitter systems are hypo-functional, hyper-functional, or, in different sets of cells or in relation to different dopamine receptors, both. In all the abovementioned cases, the aim for treatment, and possibly the aim for prevention as well, would thus need to be to restore a proper balance between the four types of predictive learning of stimuli and reinforcement ( Figure 1).

Conclusions
The available evidence suggests us that timing-dependent valence reversal is a general principle of how punishment and probably also reward are processed. Given that avoiding punishment and obtaining reward are important goals of behavior, we believe that distortions in timing-dependent valence reversal can have a significant, and hitherto insufficiently appreciated, impact on motivated behavior and related psychopathology.
Because dopaminergic system function emerges as a determinant for timing-dependent valence reversal across species, we are optimistic that the understanding, treatment, and prevention of these pathologies will benefit from analyses that focus specifically upon dopamine system function in learning from the occurrence and the termination of punishment and reward.

Conflict of interest statement
Nothing declared.