Intrinsic rewards modulate sensorimotor adaptation

Recent studies have demonstrated that task success signals can modulate behavioral changes during sensorimotor adaptation tasks, primarily through the engagement of explicit processes. In a series of reaching experiments with human participants, we explore a potential interaction between reward-based learning and implicit adaptation, using a method in which feedback is not contingent on task performance. We varied the size of the target to compare conditions in which visual feedback indicated an invariant angular error that either hit or missed the target. Hitting the target attenuated the behavioral changes from adaptation, an effect we attribute to the generation of an intrinsic reward signal. We evaluated two models, one in which reward and adaptation systems operate in parallel, and a second in which reward acts directly on the adaptation system. The results favor the latter, consistent with evidence showing communication, and possible overlap, between neural substrates underlying reward-based learning and sensorimotor adaptation.


INTRODUCTION 21
Multiple learning systems contribute to successful goal-directed actions in the face of changing 22 physiological states, body structures, and environments (Huberdeau, Krakauer, & Haith, 2015;23 McDougle, Ivry, & Taylor, 2016; Jordan A. . Among these different learning 24 processes, implicit sensorimotor adaptation is of primary importance for maintaining appropriate 25 calibration of sensorimotor maps over both short and long timescales. A large body of work has 26 focused on how sensory prediction error (SPE), the difference between predicted and actual sensory 27 feedback, drives sensorimotor adaptation (Shadmehr, Smith, & Krakauer, 2010). In addition to 28 sensorimotor adaptation, there is growing awareness of how reward-based learning contributes to 29 motor control. While several recent studies have shown that rewarding successful actions alone is 30 sufficient for the learning of perturbations (Izawa & Shadmehr, 2011;Therrien, Wolpert, & Bastian, 31 2016, little is known about how rewards impact implicit adaptation. Thus, a central question 32 remains as to how learning systems tuned to SPE versus those tuned to rewards interact during motor 33 tasks. 34 35 Despite utilizing very similar task paradigms, initial studies have led to an inconsistent picture of how 36 reward impacts performance in sensorimotor adaptation tasks. For example, in two separate 37 visuomotor rotation studies using similar task paradigms and reward structures, the first study reported 38 no effect of reward on adaptation rates but an enhancement of motor memory due to rewards (Galea,39 Mallia, Rothwell, & Diedrichsen, 2015), while the second reported a beneficial effect of rewards 40 specifically on adaptation rate (Nikooyan & Ahmed, 2015). In a more recent study, however, 41 manipulation of reward attenuated overall learning (Leow, Marinovic, & Carroll, 2018). 42 43 One factor that may contribute to these inconsistencies is highlighted by recent work showing that, 44 even in relatively simple sensorimotor adaptation tasks, overall behavior reflects a combination of 45 explicit and implicit processes (Jordan A. Taylor & Ivry, 2011; Jordan A. Taylor, Krakauer, & Ivry, 2014). 46 Unless the explicit component is directly assayed (Jordan A. , measures of 47 adaptation can be confounded by explicit aiming. That is, while the SPE is thought to drive adaptation 48 (Tseng, Diedrichsen, Krakauer, Shadmehr, & Bastian, 2007), participants are often also consciously 49 aware of the perturbation and decide to aim in order to compensate for it, thereby improving task 50 performance. It may be that reward promotes the activation of explicit processes, which can be more 51 flexibly implemented depending on the task demands (Bond & Taylor, 2015). A recent study provides 52 evidence for this hypothesis (Codol, Holland, & Galea, 2017), showing that at least one of the putative 53 effects of reward, the strengthening of motor memories (Shmuelof et al., 2012), is primarily the result of 54 net result of these two independent processes (Movement Reinforcement model), or intrinsic reward 81 directly modulates adaptation (Adaptation Modulation model). Our results provide support for the latter, 82 although our model-based analyses suggest there may be a mixture of both mechanisms. 83

85
In all experiments we used clamped visual feedback, in which the angular trajectory of a feedback 86 cursor is invariant with respect to the target location and thus spatially independent of hand position 87 (Morehead et al., 2017;Fig. 1a). The instructions emphasized that the participant's behavior would not 88 influence the cursor trajectory: They were to ignore this stimulus and always aim directly for the target. 89 This method allows us to isolate implicit learning from an invariant SPE, eliminating potential 90 contributions from strategic changes that might be used to reduce task performance error. 91

92
In Experiment 1, we asked whether hitting the target under conditions in which the feedback is not 93 contingent on behavior would modulate adaptation, based on the assumption that this would be 94 intrinsically rewarding. We tested three groups of participants (n=16/group) with a 3.5° clamp offset for 95 80 cycles (8 targets per cycle). The purpose of this experiment was to examine the effects of three 96 different relationships between the clamp and target while holding the visual error (defined as the 97 center-to-center distance between the cursor and target) constant ( Fig. 1b): Hit Target (when the  98 terminal position of the clamped cursor is fully embedded within a 16 mm diameter target), Straddle 99 Target (when roughly half of the cursor falls within a 9.8 mm target, with the remaining part outside the 100 target), Miss Target (when the cursor is fully outside a 6 mm target). As seen in Fig. 1d Experiment 2 was designed to extend the results of Experiment 1 in two ways: First, to verify that the 133 effect of hitting a target generalized to other contexts, we changed the size of the clamp offset. We 134 tested two groups of participants (n=16/group) with a 1.75° clamp offset. For the Hit Target group (Fig.  135 2a), we used the large 16 mm target, and thus, the cursor was again fully embedded. For the Straddle 136 Target group, we used the small 6 mm diameter target, resulting in an endpoint configuration in which 137 the cursor was approximately half within the target and half outside the target. We did not test a Miss 138 Target condition because having the clamped cursor land fully outside the target would have 139 necessitated an impractically small target (~1.4 mm). Moreover, the results of Experiment 1 indicate 140 that this condition is functionally equivalent to the Straddle Target group. The second methodological 141 change was made to better assess asymptotic adaptation. We increased the number of clamped 142 reaches to each location to 220 (reducing the number of target locations to four to keep the experiment 143 within a 1.5 hour session). This resulted in a nearly three-fold increase in the number of clamped 144 reaches per location. One alternative is that participants in the Hit Target groups had reduced accuracy demands relative to 171 the other groups, given that they were reaching to a larger target (Soechting, 1984). If the accuracy 172 demands were reduced for these large targets, then the motor command could be more variable, 173 resulting in more variable sensory predictions from a forward model, and thus a weaker SPE (Körding & 174 Wolpert, 2004; see Fig. 3). While we do not have direct measures of planning noise, a reasonable 175 proxy can be obtained by examining movement variability during unperturbed baseline trials (data from 176 clamped trials would be problematic given the induced change in behavior). If there is substantially 177 more noise in the plan for the larger target, then the variability of hand angles should be higher in this 178 group (Churchland, Afshar, & Shenoy, 2006). In addition, one may expect faster movement times (or 179 peak velocities) and/or reaction times for reaches to the larger target, assuming a speed-accuracy 180 tradeoff (Fitts' law;Fitts, 1992). The experimental designa employed in Experiments 1 and 2 cannot distinguish between these two 232 hypotheses because both make similar predictions when the clamp is introduced. In the Movement 233 Reinforcement model, the attenuated asymptote arises because movements are rewarded throughout, 234 including during early learning, biasing future movements towards baseline. The Adaptation Modulation 235 model makes a similar prediction, but here the effect arises because the adaptation system is directly 236 attenuated. 237 238 However, a transfer design in which the target size changes after an initial adaptation phase affords an 239 opportunity to contrast the two models. In Experiment 3, we tested a group of participants (n=12) with a 240 1.75° clamp, using the design depicted in Fig. 5a (Straddle-to-Hit group). In an initial acquisition phase 241 (first 120 clamp cycles), the target was small, such that the clamp always straddled the target. Based 242 on the results of Experiments 1 and 2, we expect to observe a relatively large change in hand angle at 243 the end of this phase. The key test comes during the transfer phase (final 80 clamp cycles), in which 244 the target size is increased such that the invariant clamp now results in a target hit. By the Movement 245 Reinforcement model, hitting the target will produce an intrinsic reward signal, reinforcing the 246 associated movement. Therefore, there should be no change in performance (hand angle) following 247 transfer: The SPE remains the same, and with the introduction of a reward signal, the executed 248 movements would now be reinforced (Fig. 5b). In contrast, the Adaptation Modulation model assumes 249 that the introduction of the reward signal will directly attenuate the output of the adaptation system. As 250 such, this model predicts a marked decay in hand angles following transfer, relative to the initial 251 asymptote. 252

253
In addition to the Straddle-to-Hit group described above, we also tested a second group (n=12) in which 254 the large target (reward) was used in the acquisition phase and the small target (no reward) in the 255 transfer phase (Hit-to-Straddle group). Both models make the same predictions for the Hit-to-Straddle 256 group. At the end of the acquisition phase, there should be a relatively small change in hand angle due 257 to the presence of an intrinsic reward signal. Following transfer, the Movement Reinforcement model 258 predicts that, with the switch to the small target, the intrinsic reward signal will now be absent, 259 weakening the contribution of the reward-based system to the motor output. As such, there should be 260 an increase in hand angle following transfer. The Adaptation Modulation model predicts a similar 261 change in behavior due to the removal of the direct inhibitory effect of the reward system on adaptation 262 following transfer. Although this group in isolation does not discriminate between the models, it does 263 provide a second test of each model, as well as an opportunity to rule out alternative hypotheses for the 264 behavioral effects at transfer. For example, the absence of a change at transfer might be due to 265 reduced sensitivity to the clamp following a long initial acquisition phase. With the Hit-to-Straddle group, 266 both models predict a marked increase in hand angle. 267 268 For our analyses, we first examined performance during the acquisition phase ( To quantitatively evaluate the Adaptation Modulation model, we simulated the results of the transfer 295 phase of Experiment 3 based on parameters estimated from the acquisition phase of both groups. We 296 fit the data using a single rate state-space model of the following form: 297 where x represents the motor output on trial n, A is a retention factor, and U represents the 299 update/correction size (or, learning rate) as a function of the error size, e. This model is mathematically 300 equivalent to a standard single rate state-space model (Thoroughman & Shadmehr, 2000), with the 301 only modification being the replacement of the error sensitivity term, B, with a correction size function. 302 Unlike standard adaptation studies where error size changes over the course of learning, however, e is 303 a constant with clamped visual feedback, and therefore U(e) can be fit as a single parameter (for further 304 details, see Kim et al. 2018). We refer to this model as the motor correction variant of the standard 305 state space model. 306

307
To estimate A and U(e), we fit the bootstrapped samples of mean behavior, using only the data from 308 the acquisition phase. The model provided good fits of behavioral change during the acquisition phase 309 (Fig. 5f) individual data, reliable differences between groups were observed for U(e) (p = .012), but not A (p = 314 .802). Thus, the analysis of the parameter estimates indicates that reward modulates the error size-315 dependent motor correction within the adaptation system, effectively reducing the size of the trial-to-trial 316 correction. 317 318 To further test whether the effect of intrinsic reward was better explained by a reduction in learning rate 319 rather than a change in retention, we compared models in which only U or A were free to vary, asking 320 how well these models fit the bootstrapped samples of the acquisition phase data. For the model in 321 which U was a free parameter, we fixed A to its median value from the original fits (.96); for the model 322 in which A was a free parameter, we fixed U to its median value from the original fits (.64 The estimated parameters for each group's acquisition phase data were then used to predict the 328 transfer performance for the other group. That is, parameter estimates from the Hit-to-Straddle group 329 were used to predict the transfer performance of the Straddle-to-Hit group. In a complementary 330 manner, parameter estimates from the Straddle-to-Hit group were used to predict the transfer 331 performance of the Hit-to-Straddle group. We used all 1000 sets of parameter estimates from each 332 group to generate the mean and variance of the predicted behavior (Fig. 5f). During transfer, the model 333 captures the qualitative change in performance for both groups, with an increase in hand angle for the 334 Hit-to-Straddle group and decrease in hand angle for the Straddle-to-Hit group. However, the 335 predictions of the model slightly underestimate the observed rates of change for both groups. We return 336 to this issue in the Discussion; for now, we note that modeling results are consistent with the hypothesis 337 that intrinsic reward directly modulates the adaptation system. To evaluate this perceptual uncertainty hypothesis, we tested an additional group in Experiment 3 with 364 a large target, but modified the display such that a bright line, aligned with the target direction, bisected 365 the target (Fig. 5a). With this display, the feedback cursor remained fully embedded in the target, but 366 was clearly off-center. If the attenuation associated with the large target is due to perceptual 367 uncertainty, then the inclusion of the bisecting line should produce an adaptation effect similar to that 368 observed with small targets. Alternatively, if perceptual uncertainty does not play a prominent role in the 369 target size effect, then the adaptation effects would be similar to that observed with large targets. The difference in behavioral change as a function of relative target size was observed across different 421 clamp sizes and did not appear to be because of differences in perceptual sensitivity or motor 422 competence. This was supported by our control analyses, perceptual control experiment, and our 423 finding that the Straddle group in Experiment 1 was similar to the Hit group, suggesting that the effect of 424 target size was categorical. As such, we assume that the effect of target size on behavior arises from 425 the generation of an intrinsic reward signal, one that is generated when the cursor lands fully within the 426 target. In the final experiment, we explored two ways in which an intrinsic reward signal could impact 427 performance. One hypothesis centered on the idea that reward modulates the strength of movement 428 representations associated with task success, a variant of the idea that reward and SPE engage 429 distinct representations and learning systems (Shmuelof et al., 2012). The other hypothesis considered 430 a more direct modulatory impact on the adaptation process. The results showed that the differences in 431 asymptote cannot be attributed solely to strengthening of rewarded movements. Rather, intrinsic 432 reward directly attenuates the operation of the adaptation system. 433

434
We recognize that our interpretation of the results rests on the assumption that "hitting" the target with 435 the cursor is intrinsically rewarding (Huang et al., 2011;Leow et al., 2018). If correct, this assumption 436 holds despite the participants' awareness that the angular motion of the cursor is causally unrelated to 437 their behavior. Our earlier work with clamped feedback had shown that adaptation can be driven by a 438 task-irrelevant error signal, the SPE defined by the difference between the cursor and target. Here we 439 see the automatic operation of an intrinsic reward signal. Of course we do not have evidence, 440 independent of the behavior, that hitting a target is rewarding; this might require using methods such as 441 fMRI ( Thoroughman & Shadmehr, 2000). In the simplest version, these models entail two 447 parameters, a memory term, A, representing the retention of the current state from trial to trial, and a 448 learning rate term, B, representing how the state is updated based on the error from the current trial 449 (the A and U(e) terms in Eq. 1, respectively). Given this framework, we can consider how reward might 450 modulate adaptation. One possibility is that reward modulates retention. This hypothesis is consistent 451 with the results of a recent visuomotor adaptation study comparing groups that either received only 452 cursor feedback or cursor feedback and a monetary reward, scaled to their accuracy. The latter showed 453 greater retention during a washout block in which the feedback was removed (Galea et al., 2015). In contrast, the clamp method, by eliminating the contribution of strategic processes, allows us to 482 directly examine how reward might influence estimated rates of implicit learning. Here we see that the 483 effect would suggest that reward reduces the learning rate, made salient by the parameter estimates 484 from the acquisition phase of Experiment 3 (see also, Leow et al., 2018). A reduction in the learning 485 rate can be conceptualized as a gain factor attenuating the system's response to error. In terms of the 486 standard state space model, this would translate into reducing the system's sensitivity to error; in the 487 motor correction variant of the state space model, this would translate into reducing the amount of 488 change induced by an error of a given size. In either conceptualization, the end result is that in the 489 presence of an intrinsic reward signal, the error-dependent drive is reduced. 490

491
The hypothesis that reward attenuates the learning rate within the adaptation system provides a 492 parsimonious account of the data from all three experiments. Following the introduction of the clamped 493 feedback, a lower asymptote was observed in Experiments 1-3 when the cursor hit the target. 494 Assuming the memory process is unaffected, the reduced error-dependent drive will result in a lower 495 asymptote. Moreover, the rate of change in behavior, operationalized here by the early learning rate, 496 should also be lower, a pattern evident in all three experiments, although only statistically significant in 497 We also recognize that behavioral changes here may reflect the operation of multiple processes 512 , and the composite effects of these processes might account for why the 513 observed changes were more gradual than predicted. For example, intrinsic reward may not only 514 directly modulate adaptation, but may also reinforce an executed movement (Castro,  We thank Matthew Hernandez and Wendy Shwe for assistance with data collection. We are also 542 grateful to Maurice Smith, Ryan Morehead, Guy Avraham, and Ian Greenhouse for helpful discussions 543 regarding this work. 544 545 546

Competing Interests 547
No competing interests, financial or otherwise, are declared by the authors. 548 order to increase the overall number of training cycles with the clamp, while keeping the experiment 727 under 1.5 hours, and so that participants would reach a stable asymptote. Participants were instructed 728 to accurately and rapidly "slice" through the target, without needing to stop at the target location. Visual 729 feedback, when presented, was provided during the reach until the movement amplitude exceeded 8 730 cm. As described below, the feedback either matched the position of the stylus (veridical) or followed a 731 fixed path (clamped). If the movement was not completed within 300 ms, the words "too slow" were 732 generated by the sound system of the computer. feedback trials, the feedback followed a path that was fixed along a specific hand angle. The radial 744 distance of the cursor from the start location was still based on the radial extent of the participant's 745 hand during the 8 cm outbound segment, but the angular position was fixed relative to the target (i.e., 746 independent of the angular position of the hand). 747

748
The primary instructions to the participant remained the same across the experimental session: 749 Specifically, that they were to reach directly towards the visual target. Prior to the introduction of the 750 clamped feedback trials, participants were briefed about the feedback manipulation. They were 751 informed that the position of the cursor would now follow a fixed trajectory and that the angular position 752 would be independent of their movement. They were explicitly instructed to ignore the cursor and 753 continue to reach directly to the target. Participants also performed three instructed trials with the clamp 754 perturbation on. During these practice trials, a target appeared at the 90 deg location (straight ahead), 755 and the experimenter instructed the participant to first "reach straight to the left" (ie, 180 deg). For the 756 second practice trial, the participant was instructed to "reach straight to the right" (0 deg). For the last 757 trial, the participant was instructed to "reach straight down (towards your torso)" (ie, 270 deg). The 758 purpose of these trials was to familiarize the participant with the exact clamp condition they were about 759 to experience. Following these three practice trials, the experimenter confirmed with the participant they Participants (n=48, 16/group) were randomly assigned to one of three groups, each training with a 3.5° 774 clamp but differing only in terms of the size of the target: 6mm, 9.8, or 16 mm diameter. These sizes 775 were chosen so that at an 8 cm radial distance the clamped cursor would be adjacent to the target 776 without making any contact (Target Miss group), straddling the target by being roughly half inside and 777 half outside the target (Straddle Target group), or fully embedded within the target (Hit Target group). 778 The Euclidean distance for this clamp offset, measured from the centers of cursor and target, was 4.9 779 mm. 780

781
The session began with two baseline blocks, the first comprised of 5 movement cycles (40 total 782 reaches to 8 targets) without visual feedback and the second comprised of 10 cycles with a veridical 783 cursor displaying hand position. The experimenter then informed the participant that the visual 784 feedback would no longer be veridical and would now be clamped at a fixed angle from the target 785 location. Immediately following these general instructions, the experimenter continued providing 786 instructions for the three practice trials which immediately followed (see Experimental Feedback 787 Conditions). After the practice trials and confirming the participant's understanding of the task, the 788 clamp block ensued for a total of 80 cycles. A short break (<1 min), as well as a reminder of the task 789 instructions, was provided after 40 cycles (i.e., at the halfway point of this block). Immediately following 790 the perturbation block, there were two washout blocks, first a 5 cycle block in which there was no visual 791 feedback, followed by 10 cycles with veridical visual feedback. These blocks were preceded by 792 instructions regarding the change in experimental condition and participants were reminded to always 793 aim for the target and to attempt to slice through it with their hand. 794 795

Experiment 2 796
In Experiment 2 we assessed adaptation over an extended number of clamped visual feedback trials. 797 The purpose of extending the perturbation block was to ensure that participants reached asymptotic 798 levels of learning. In order to achieve a greater number of training cycles, we reduced the number of 799 target locations within the set from 8 to 4. 800 801 Participants (n=32, 16/group) trained with a 1.75° clamp (2.4 mm distance between target and cursor 802 centers) and were assigned to either a small (Straddle) or large (Hit) target condition. The session 803 started with two baseline blocks, 10 cycles (40 reaches) without visual feedback and then 10 cycles 804 Individual baseline biases for each target location were subtracted from all data. Biases were defined 856 as the average hand angles across cycles 2-10 (Experiments 1 and 2) or 2-5 (Experiment 3) of the 857 feedback baseline block. These same cycles were used to calculate mean RTs, MTs, and movement 858 variability (SD). To calculate each participant's baseline RT or MT, we took the average of median 859 values at each target location. To calculate each participant's movement variability, we took the 860 average of the standard deviations of hand angles at each target location. 861

862
In order to pool all of the data and to aid visualization, we flipped the hand angles for all participants 863 clamped in the counterclockwise direction. 864 865 For Experiments 1 and 3, movement cycles consisted of 8 consecutive reaches (1 reach/target); for 866 Experiment 2, we only used four targets, thus a movement cycle consisted of 4 consecutive reaches (1 867 reach/target). Early adaptation rate was quantified by averaging the hand angle values over cycles 3-7 868 of the clamp, and dividing by the number of cycles (i.e., 5) to get an estimate of the per trial rate of 869 change in hand angle. We opted to use this measure of early adaptation rather than obtain parameter 870 estimates from exponential fits since the latter approach gives considerable weight to the asymptotic 871 phase of performance and, therefore would be less sensitive to early differences in rate. This would be 872 especially problematic in Experiment 2, which utilized 220 clamp cycles. Asymptotic adaptation was 873 defined as the last 10 cycles within a clamp block. In Experiment 1, the aftereffect was quantified by 874 using the data from the first no-feedback cycle following the last clamp cycle. We also performed a 875 secondary analysis of early adaptation rates using cycles 2-11 (Krakauer, 2005), rather than 3-7. 876 Results from using this alternate metric were consistent with the reported analyses (i.e., slower rates for 877 Hit Target groups), only they resulted in larger effect sizes due to the gradually increasing divergence of 878 learning functions. 879 880