Neuronal Representation of a Working Memory-Based Decision Strategy in the Motor and Prefrontal Cortico-Basal Ganglia Loops

Abstract While animal and human decision strategies are typically explained by model-free and model-based reinforcement learning (RL), their choice sequences often follow simple procedures based on working memory (WM) of past actions and rewards. Here, we address how working memory-based choice strategies, such as win-stay-lose-switch (WSLS), are represented in the prefrontal and motor cortico-basal ganglia loops by simultaneous recording of neuronal activities in the dorsomedial striatum (DMS), the dorsolateral striatum (DLS), the medial prefrontal cortex (mPFC), and the primary motor cortex (M1). In order to compare neuronal representations when rats employ working memory-based strategies, we developed a new task paradigm, a continuous/intermittent choice task, consisting of choice and no-choice trials. While the continuous condition (CC) consisted of only choice trials, in the intermittent condition (IC), a no-choice trial was inserted after each choice trial to disrupt working memory of the previous choice and reward. Behaviors in CC showed high proportions of win-stay and lose-switch choices, which could be regarded as “a noisy WSLS strategy.” Poisson regression of neural spikes revealed encoding specifically in CC of the previous action and reward before action choice and prospective coding of WSLS action during action execution. A striking finding was that the DLS and M1 in the motor cortico-basal ganglia loop carry substantial WM information about previous choices, rewards, and their interactions, in addition to current action coding.


Introduction
Human and animal decision-making processes can be modeled by reinforcement learning (RL) theory, in which agents update the expected reward for each choice (Sutton and Barto, 1998). However, learning can be more dynamic and hypothesis driven. Under the assumption that only one of two choices is rewarding and the other is not rewarding, win-stay-lose-shift or win-stay-lose-switch (WSLS) is an optimal strategy. WSLS can be implemented with a very high learning rate in model-free reinforcement learning (Barraclough et al., 2004;Ito and Doya, 2009;Ohta et al., 2021), entropy-based metrics (Trepka et al., 2021), or using working memory (WM; Kesner and Churchwell, 2011;Nolen-Hoeksema et al., 2014). The possible contents of WM are previous actions, previous rewards or prospective actions. Patients with psychiatric disorders or developmental disabilities frequently show abnormal patterns of WSLS (Shurman et al., 2005;Waltz et al., , 2011Prentice et al., 2008;Schlagenhauf et al., 2014), which may be because of disorders of WM (Barch and Ceaser, 2012).
Previous studies tested how availability of WM affected choice strategies by increasing the number of visual stimuli to remember (Collins and Frank, 2012;Collins et al., 2014), requiring an additional memory task in parallel (Otto et al., 2013a), or resulting in acute stress (Otto et al., 2013b). These studies involved humans and long intertrial intervals (ITIs) in rodents (Iigaya et al., 2018), suggested that choice strategies with intact WM were close to WSLS, whereas strategies under WM disruption became similar to behavior under standard RL.
While the basal ganglia play a major role in model-free reinforcement learning (Samejima et al., 2005;Doya, 2009, 2015a;Yoshizawa et al., 2018), the neural basis of WM-based decision-making is still unclear. Here, we developed a choice task for rats, in which WM availability was manipulated by inserting a no-choice trial between choice trials. This task addressed how working memory-based choice strategies, such as WSLS, are represented in the prefrontal and motor cortico-basal ganglia loops by simultaneous recording of neuronal activities in the dorsomedial striatum (DMS) and the medial prefrontal cortex (mPFC). These structures form a corticostriatal loop related to goal-directed behaviors (Voorn et al., 2004;Yin et al., 2004Yin et al., , 2005aYin et al., , b, 2006Balleine et al., 2007;Balleine and O'Doherty, 2010). The dorsolateral striatum (DLS) and the primary motor cortex (M1) form a corticostriatal loop related to motor actions. While previous studies suggested a major role of the PFC in working memory, the present results suggest the contribution of the motor loop to WMbased decision-making.

Subjects
Male Long-Evans rats (n = 6; 260-310 g body weight; 16-37 weeks old at the first recording session) were housed individually under a light/dark cycle (lights on at 7 A.M., off at 7 P.M.). Experiments were performed during the light phase. Food was provided after training and recording sessions so that body weights decreased no lower than 90% of initial levels. Water was supplied ad libitum. The Okinawa Institute of Science and Technology Graduate University Animal Research Committee approved the study.

Apparatus
All training and recording procedures were conducted in a 40 Â 40 Â 50 cm experimental chamber placed in a sound-attenuating box (O'Hara & Co). The chamber was equipped with three nose-poke holes on one wall and a pellet dish on the opposite wall (Fig. 1A). Each nose-poke hole was equipped with an infrared (IR) sensor to detect nose entry, and the pellet dish was equipped with an infrared sensor to detect the presence of a sucrose pellet (25 mg) delivered by a pellet dispenser. The chamber top was open to allow connections between electrodes mounted on the rat's head and an amplifier. House lights, two video cameras, two IR LED lights and a speaker were placed above the chamber. A computer program written with LabVIEW (National Instruments) was used to control the speaker and the dispenser and to monitor states of the IR sensors.

Behavioral task
Animals were trained to perform a choice trial and a nochoice trial using nose-poke responses. In either task, each trial began with a tone presentation (start tone: 3000 Hz, 1000 ms). When the rat performed a nose-poke in the center hole for 500-1000 ms, one of two cue tones (choice tone: white noise, 1000-1500 ms; no-choice tone: 900 Hz; 1000-1500 ms) was presented (Fig. 1B).
After onset of the choice tone (choice trials), the rat was required to perform a nose-poke in either the left or right hole within 2 s after exiting the center hole. If the rat exited the center hole before the offset of the choice tone, the choice tone was stopped. When the rat nose-poked either the left or right hole, either a reward tone (500 Hz, 1000 ms) or a no-reward tone (500 Hz, 250 ms) was presented probabilistically, depending on the selected action. The reward tone was followed by delivery of a sucrose pellet (25 mg) in the food dish. If the rat did not perform a nose-poke in either the left or right hole within 2 s, the trial was ended as an error trial after presentation of an error tone (9500 Hz, 1000 ms).
For the no-choice tone (no-choice trials), the rat was required not to perform left nor right nose pokes during 2 s after the exit from the center hole. Then, the trial was correctly finished by presentation of the no-reward tone. In this no-choice trial, the rat could not obtain any pellets, but if the rat could not perform this trial correctly, that is, if the rat incorrectly performed left or right nose-poke despite the no-choice tone, the trial was ended as an error trial after the error tone presentation, and the no-choice trial was repeated again in the next trial.
We designed the continuous condition (CC) that consisted only of choice trials, and the intermittent condition (IC) that had a no-choice trial inserted after every choice trial (Fig. 1C). A block is defined as a sequence of trials under the same reward probabilities of either (left, right) = (75%, 25%) or (25%, 75%). The first three blocks in each session were CC and the subsequent two blocks were IC. Reward probabilities of the first block were randomly selected from these two settings for each recording session, and were switched for every subsequent block.
The first (CC) and the third (CC) blocks were terminated when the choice frequency of the 75% reward side in the last 10 choice trials reached 80% ( Fig. 2A). The second (CC), and, fourth and fifth (IC) blocks were ended when 10 choice trials had been conducted. In this setting, the first 20 choice trials in the second and the third CC blocks, and in the fourth and fifth IC blocks should be comparable; starting from 80% biased choice and switching reward probabilities after 10 choice trials. This set of five blocks was repeated about six times in a 1-d recording session.

Surgery
After rats trained to perform the CC and the IC tasks, they were anesthetized with pentobarbital sodium (50 mg/ kg, i.p.) and placed in a stereotaxic frame. The skull was exposed and holes were drilled in the skull over the recording site. Four drivable electrode bundles were implanted and fixed in the DMS in the left hemisphere (1.0 mm posterior, 1.6 mm lateral from bregma, 3.7 mm ventral from the brain surface), the DLS in the right hemisphere (1.0 mm anterior, 3.5 mm lateral from bregma, 3.3 mm ventral from the brain surface), the mPFC in the left hemisphere (3.2 mm anterior, 0.7 mm lateral from bregma, 2.0 mm ventral from the brain surface), and the M1 in the right hemisphere (1.0 mm anterior, 2.6 mm lateral from bregma, 0.4 mm ventral from the brain surface) using pink dental cement with confirmed effects on the brain (Yoshizawa and Funahashi, 2020).
An electrode bundle was composed of eight Formvarinsulated, 25-mm bare diameter nichrome wires (A-M Systems) and was inserted into a stainless-steel guide cannula (0.3 mm in outer diameter; Unique Medical). Tips of the microwires were cut with sharp surgical scissors so that ;1.5 mm of each tip protruded from the cannula. Each tip was electroplated with gold to obtain an impedance of 100-200 kV at 1 kHz. Electrode bundles were advanced by 125 mm per recording day to acquire activity from new neurons.

Electrophysiological recordings
Recordings were made while rats performed choice tasks. Neuronal signals were passed through a head amplifier at the head stage and then fed into the main amplifier through a shielded cable. Signals passed through a  band pass filter (50-3000 Hz) to a data acquisition system (Power1401; CED), by which all waveforms that exceeded an amplitude threshold were time-stamped and saved at a sampling rate of 20 kHz. The threshold amplitude for each channel was adjusted so that action potential-like waveforms were not missed while minimizing noise. After a recording session, the following off-line spike sorting was performed using a template-matching algorithm and principal component analysis with Spike2 (Spike2; CED): recorded waveforms were classified into several groups based on their shapes, and a template waveform for each group was computed by averaging. Groups of waveforms that generated templates that appeared to be action potentials were accepted, and others were discarded. Then, to test whether accepted waveforms were recorded from multiple neurons, principal component analysis was applied to the waveforms. Clusters in principal component space were detected by fitting a mixture Gaussian model, and each cluster was identified as signals from a single neuron. This procedure was applied to each 50min data segment. If stable results were not obtained, data were discarded. Then, gathered spike data were refined by omitting data from neurons that satisfied at least one of the three following conditions: 1. The amplitude of waveforms was ,7Â the SD of background noise. 2. The firing rate calculated by perievent time histograms (PETHs; from À4.0 to 4.0 s with 100-ms time bin based on the onset of cue tone, exit from the center hole, or entrance into the left or right hole) was ,1.0 Hz for all time bins of all PSTHs. 3. The estimated recording site was considered outside the target.
Furthermore, considering the possibility that the same neuron was recorded from different electrodes in the same bundle, we calculated cross-correlation histograms with 1-ms time bins for all pairs of neurons that were recorded from different electrodes in the same bundle. If the frequency at 0 ms was 10Â larger than the mean frequency (from -200 to 200 ms, except the time bin at 0 ms) and their PETHs had similar shapes, either one of the pair was removed from the database.

Histology
After all experiments were completed, rats were anesthetized as described in the surgery section, and a 10-mA positive current was passed for 30 s through one or two recording electrodes of each bundle to mark their final recording positions. Rats were perfused with 10% formalin containing 3% potassium hexacyanoferrate (II), and brains were carefully removed so that the microwires did not cause tissue damage. Sections were cut at 60 mm on an electrofreeze microtome and stained with cresyl violet. Final positions of electrode bundles were confirmed using dots of Prussian blue. The position of each recorded neuron was estimated from the final position and the distance that the bundle of electrodes moved. If the position was outside the DMS, DLS, PL, or M1, recorded data were discarded.

Logistic regression analysis for behavioral data
We performed logistic regression analysis to examine the influence of past actions and outcomes on the next choice using the regression model: where b i is the regression coefficient for each variable (regressor), a(t) 2 {1: left, À1: right} is the selected action, and r(t) 2 {1: rewarded, À1: nonrewarded} is the reward outcome. The parameter c (0 c 1) specifies the decay rate of past actions and rewards. For each setting of c, regression coefficients b i are derived by the "fitglm" function of MATLAB and the optimal c was selected between 0 and 1 with a line search. The optimal m was determined by comparing adjusted R 2 with c set at the optimal value. The adjusted R 2 became maximal with m = 9.

Poisson regression analysis for neuronal data
We performed Poisson regression analyses to examine what kinds of variables were encoded in neuronal spikes. The first Poisson regression analysis considered a Poisson model in which the number of spikes at a certain phase is sampled from a Poisson distribution with the expected number of spikes at trial t, m(t): PoiðyjmðtÞÞ ¼ e ÀmðtÞ mðtÞ y y! : m(t) is represented by continued 10 trials, reward probabilities were reversed. The vertical axis indicates the frequency of the action associated with the higher reward probability in the first 10 blocks. Filled circles and open circles show that the action frequency was significantly different from 0.5 (p , 0.05; Mann-Whitney U test). D, Distributions of the choice probability of 75% reward side in one session of CC (upper) and IC (lower). The optimal action probability is the frequency of selecting the action associated with the larger reward probability in one session. Medians of both distributions are significantly different from 0.5 (p , 0.01 for CC and IC; Mann-Whitney U test). E, Effects of interaction between past actions and outcomes on the subsequent action. The subsequent action was regressed by action in the previous trial and interaction of actions and outcomes in the past nine trials. **p , 0.01, *p , 0.05, Wilcoxon signed-rank test. F, Winstay lose-switch (WSLS) indexes. The horizontal axis represents a win-stay index, the frequency that rats selected the same action after the rewarded trial. The vertical axis represents a lose-switch index, the frequency that rats switch the action after the no-reward trial. WSLS indices of CC sessions are plotted with green dots, while indices of IC sessions are shown with pink dots. Vertical lines in histograms indicate the medians of win-stay or lose-switch probabilities in CC and IC. **p , 0.01, Mann-Whitney U test.
1 logðdðtÞÞÞ; ::: where b i is the regression coefficient for each explanatory variable (regressor). b(t) is the monotonically increasing factor, namely, b(t) = t, which is inserted to capture taskevent-independent monotonic increases or decreases in firing pattern. a(t) 2 {1: contralateral, À1: ipsilateral} is the selected action, r(t) 2 {1: rewarded, 0: nonrewarded} is reward availability, and c(t) 2 {1: CC, 0: IC} is the task condition, d(t) is the time duration of a phase. We also analyzed the differential reward responses of neurons following ipsilateral or contralateral action choice using a(t)Âr(t) 2 {1: contralateral rewarded, À1: ipsilateral rewarded, 0: nonre-warded} as a regressor, instead of a(t) and r(t) separately. Optimal regression coefficients are determined so that the objective function, the log likelihood for all trials, is maximized. logðPoiðyðtÞjmðtÞÞÞ: Here, b represents a set of coefficients. For this calculation, a function in MATLAB Statistics and Machine Learning Toolbox "fitglm(X, y, 'Distribution', 'poisson')" was used.
Next, we found the minimal necessary regressors in (A) to predict m(t). Then, we used the Bayesian information criterion (BIC): where b* is the optimized b that maximizes the log likelihood L. k is the number of parameters, and n is the trial number in a session. BIC can be regarded as a fitting measure taking into account the penalty for the number of parameters (the number of b ) in the model.
Better models have smaller BICs. Because the full regression model includes five regressors (including the constant variable for 0), we can consider 2 5 models for all combinations. We calculated the BIC for all possible models, and then we selected a set of regressors that showed the smallest BIC. Then, we tested the statistical significance of each regression coefficient in the selected model using the regular Poisson regression analysis. If p , 0.01, the corresponding variable was regarded as being coded in the firing rate. This variable selection was conducted independently for each neuron and for each time bin.
Then, we used the following full model to find neuron coding actions, rewards, interactions between actions and rewards, or prospective action using a WSLS strategy: where L1(t) is the variable taking 1 if the rat selected left and rewarded in the trial t. Otherwise, it takes 0. L0(t), R1 (t), and R0(t) are the variables taking 1 only for L0, R1, or R0, respectively. Strictly speaking, this final variable R0(t) is unnecessary, because R0(t) can be represented by 1 -L1(t) -L0(t) -R (t). Therefore, we were unable to find a unique solution of the full model that minimized likelihood. So, we considered 26-1 combinations of coefficients without the full model, and found the model with the smallest BIC.
According to the variables included in the best model and the signs of these coefficients, we classified neurons into five types: neurons coding action in the previous choice trial, neurons coding reward in the previous choice trial, neurons coding action-reward-interaction (AÂR) in the previous choice trial, prospective-action-coding, and the noncoding neurons (Table 1).
If the activity codes action and not reward, coefficients of the model for L1 and L0, or for R1 and R0 should have the same signs. If the activity codes reward, but not action, coefficients of L1 and R1, or L0 and R0, should be necessary and should have the same signs. If the activity codes an interaction between action and reward (AÂR), one coefficient among L1, L0, R1, and R0, should be necessary.
If activity codes a prospective action by a WSLS strategy (Pros Action), the coefficient of L1 and R0 (both predict action R in the next trial), or L0 and R1 (both predict action L in the next trial) should be necessary. Because multiple tests changed the ratio of false positive errors, thresholds indicating a significant proportion of these information-coding neurons were calculated so that the ratio of false positives becomes 0.05 (binomial tests). To compare proportions of these information-coding neurons between CC and IC, data in CC and IC were analyzed separately with this Poisson regression analysis.

Mutual information
To elucidate when and how much information from previous AÂR was coded in neuronal activity, the mutual information between neural firing and previous AÂR was calculated using the method described previously Doya, 2009, 2015a). For a T = 100 ms time window in each trial, the neuronal activity was defined as a random variable F, taking the number of spikes from 0 to f max = T/2. X is a random variable taking x 1 , x 2 , x 3 , or x 4 , corresponding to previous AÂR: left rewarded, left nonrewarded, right rewarded or right nonrewarded, respectively. Mutual information between F and X is defined by the following: For each neuron, mutual information (bits) was estimated (for more detail, see Ito and Doya, 2009) for every L0 R1 R0 Action 1/À 1/À Action 1/À 1/À Reward 1/À 1/À Reward 1/À 1/À AÂR 1 or À AÂR 1 or À AÂR 1 or À AÂR 1 or À WSLS action 1/À 1/À WSLS action 1/À 1/À Research Article: New Research 100-ms time bin of an EASH, using all choice trials in the CC.

Experimental design and statistical analyses
The presented analyses include 48,554 behavioral and neural trials (after the task learning was completed) recorded over a total of 78 sessions in six rats. The minimum and maximum number of trials per session were 436 and 1105, respectively.
We used appropriate statistical tests when applicable, i.e., paired or unpaired t test, Mann-Whitney U test, Wilcoxon signed-rank test, binominal test, and x 2 tests with or without Bonferroni's multiple comparison tests. Differences were considered statistically significant when p , 0.05. See Results for details.

Results
To investigate whether and how the availability of working memory (WM) influences the choice strategy of rats, we designed a choice task composed of choice trials and no-choice trials. In the continuous condition (CC), choice trials were repeated and in the intermittent condition (IC), a no-choice trial was inserted between each pair of choice trials (Fig. 1). This insertion did not only prolong interchoice-trial intervals but also increase patterns of behaviors, therefore WM was expected to be strongly disturbed. An experimental session consisted of the first three blocks in CC and the fourth and fifth blocks in IC ( Fig. 2A,B). For the analysis, we used behavioral and neuronal data in 20 choice trials in the second and third blocks as a "CC sequence," and 20 choice trials in the fourth and fifth blocks as an "IC sequence" (see Materials and Methods for more detail). In one experimental day, this sequence of five blocks was repeated four to six times (usually, six times; mean = 5.9 times).

Behavioral performance
We trained six Long-Evans rats to perform the choice task ( Fig. 1; see Materials and Methods) and conducted 78 sessions consisting of 461 CC sequences and 461 IC sequences. In both conditions, rats learned to choose the 75% reward side, based on their experiences with choice and reward (Fig. 2B). When the more rewarding side switched to the opposite side, after ten choice trials, their choices also shifted to the opposite side.
We compared choice sequences in CC and IC in response to the change of reward probabilities (Fig. 2C). In CC, the choice probability switched to the other option immediately after the change of reward probabilities. In IC, the choice probability gradually shifted to the opposite side and reached a significantly different level from the chance after an average of 5 trials from the block change.
The choice probability of 75% reward side in each session was distributed between 0.5 and 0.75 with a median value of 0.632 in CC, and between 0.4 and 0.65 with a median of 0.538 in IC (Fig. 2D). In both conditions, the median was significantly greater than the chance level (p = 1.6e À14 in CC and p = 1.1e À7 in IC, Wilcoxon signed-rank test), confirming that choice behavior in both CC and IC adapted to reward-probability changes.
To investigate choice strategies in more detail, using logistic regression, we examined the influence of past actions and outcomes on the subsequent choice (see Materials and Methods). The interaction of the action and the outcome of the last trial affected the subsequent action more strongly in CC than in IC (median of coefficient for a(t-1)*r(t-1) b 2 : CC; 1.87, IC; 0.42, p = 3.3e-12, Wilcoxon signed-rank test; Fig. 2E). The effect of the previous action-reward relationship decayed more rapidly in CC (median of the decay constant c: CC; 0.23, IC; 0.62, p = 0.048 with no significant difference in the coefficient b 3 : CC; 0.13, IC: 0.19, p = 0.49). These results indicate that rats recognized the side with a larger reward probability and changed their action selection in both CC and IC, while insertion of no-choice trials in IC made learning slower.
Next, to test the hypothesis that the choice strategy is closer to WSLS in CC than in IC, we calculated WSLS indices, composed of the win-stay ratio P(stay| reward), the frequency that the rats chose the same action after the rewarded choice trial, and the lose-switch ratio P(switch| no-reward), the frequency that the rats switch the action after no-reward trials. WSLS indices calculated for each session are plotted two-dimensionally, P(stay| reward) versus P(switch| no-reward) (Fig. 2F). WSLS indices of CC sessions were concentrated near the upper right corner, while those of IC sessions were widely distributed from the center to the lower right corner. These two distributions had little overlap. Ratios of win-stay P(stay| reward) and lose-switch P(switch| no-reward) were significantly larger in CC than in IC (win-stay: 0.85 vs 0.64, p = 1.2e-16, Mann-Whitney U test; lose-switch: 0.82 vs 0.51, p = 2.6e-23). Since behaviors in CC showed high P(stay| win) and P (switch| lose), these could be regarded as "a noisy WSLS strategy" (cf. the regular WSLS is deterministic).

Neuronal activities of the prefrontal and motor cortico-basal ganglia loops
We recorded neuronal activity in DMS, DLS, mPFC, and M1 of rats performing the choice task. Each rat was implanted with four bundles of eight microwires. After all experiments were completed, locations of bundles were confirmed by Nissl staining (Fig. 3A,B). Stable recordings were made from 320 neurons in DMS, 210 neurons in DLS, 158 neurons in mPFC, and 247 neurons in M1 from six rats. We analyzed neural activities in CC and IC conditions in four phases in a trial (see Fig. 1B, 1: in the center hole before the choice tone; 2: after the choice tone before exiting the center hole; 3: during the approach to the left or right hole; 4: in the left/right hole with a reward/no reward tone).
Representative examples of spike perievent time histograms (PETHs) with intertrial time alignment (Ito and Doya, 2015b) are shown in Figure 3C-F. Neurons in DMS (Fig. 3C) increased their activity when the rat exited from the center hole and entered the left or right hole (phase 3), and showed the largest peak after exiting the left or right hole, when the rat anticipated obtaining a pellet (black line  ). B, Tracks of accepted electrode bundles for all rats are indicated by rectangles. Neurons recorded from blue, red, cyan, and magenta rectangles were classified as DMS, DLS, mPFC, and M1 neurons, respectively. Each diagram represents a coronal section referenced to the bregma (Paxinos and Watson, 1998). C-F, Perievent time histograms (PETHs) of a representative DMS neuron (C), DLS neuron (D), mPFC neuron (E), and M1 neuron (F). PETHs were calculated based on timings of five task events (onset of center-hole poking, onset of the tone, offset of center-hole poking, onset of left or right holepoking, offset of left-hole or right-hole poking), and the following four task phases were defined: phase 1, the period from the start of the center-hole poking to the onset of the cue tone; phase 2, the choice tone presentation period; phase 3, the action execution period between exiting the center hole and entry into the left or right hole; phase 4, the feedback period when a reward or no-reward in upper panel). PETHs for CC (green line) and IC (magenta line) differed in phase three and after exiting the left/ right hole, showing that the activity pattern was modulated by the task condition. PETHs also showed different increased activity following a right side, especially when the choice was rewarded (R1), showing that the activity pattern was modulated by a conjunction of choice and reward.
The DLS neuron in Figure 3D increased its activity when the rat was approaching the center hole (phase 1). It showed higher activity for left than right choices in phases 3 and 4, which was higher in IC, showing context-dependent action coding.
The mPFC neuron in Figure 3E showed higher activity for no-reward than reward experiences after exiting the left/right hole, showing outcome coding.
PETHs of the M1 neuron (Fig. 3F) showed higher activity when entering the left hole (phase 4), which was stronger in CC, showing context-dependent action coding.
To see the distribution of peak timings of all DMS neurons, PETHs for all DMS neurons were normalized and represented by color, with neuron indices sorted on the basis of peak activity timing (Fig. 3G). Activity peaks of DMS neurons were widely distributed in different phases in a trial. DLS, mPFC, and M1 neurons (Fig. 3H-J) also had their own peak firing at different phases, with greater apparent concentrations when approaching the center hole and choice holes (phase 3) in DMS and DLS, and during choice tone presentation (phase 2) in mPFC. These analyses show that neurons in the four areas show various actions, outcomes, and task condition coding at different times in the trial.

Neuronal representation of experiences of the current choice trial
We next applied Poisson regression analyses of spike counts in each phase to quantify neuronal coding of actions, rewards, and task conditions. Note that action coding in phase 1 and 2 represents an action command or plan before an action is performed, and that the outcome regressor means reward prediction in phases 1-3 (Fig.  4A). Average durations of each phase did not differ significantly between CC and IC (Fig. 4B), which excludes the possibility that differences in neural activities were due simply to differences in motor behaviors.
The proportion of neurons correlated with the outcome variable was significantly larger in the DLS than in the DMS in phase 4 (Fig. 4D, DMS: 7.8%, DLS: 15%, p = 0.044). Neurons in the DLS and M1 were significantly more activated by reward than no-reward experience (DLS: p = 0.018, M1: p = 0.0045). These results indicate that DLS neurons can more strongly distinguish reward-associated sensory stimulus than those in the DMS.
For the reward coding neurons, we further assessed whether their responses were sensitive to the laterality of action choice. We regressed the number of spikes in phase four using action Â reward (contralateral rewarded: 1, ipsilateral rewarded: À1, nonrewarded: 0; see Materials and Methods). DLS had significantly stronger sensitivity to contralateral reward than ipsilateral reward in both CC and IC conditions (Regression coefficient; CC: 1.44 6 0.41, p = 0.0013, IC: 1.6 6 0.44, p = 9.6e-4, paired t test compared with zero; Fig. 4E). Other regions did not show significantly different sensitivity to reward following contralateral and ipsilateral actions. Note that the order of left-right reward probabilities were randomized across sessions ( Fig.  2A), so that the laterality of action-reward responses did not affect the observed features, although the recordings of the motor loop were from the right hemisphere (Fig. 3B).
In all phases, the proportion of neurons correlated with the task condition (CC/IC) was ,10% (data not shown), which suggests that neuronal activities did not directly encode task condition, but does not exclude the possibility that information coding of action or outcome in the previous choice trial was modulated by task conditions.

Neuronal representation of experiences in the previous choice trial
Next, to examine how information from choice trials transferred to the subsequent choice trial with and without WM interference, we applied a Poisson regression analysis to spike data in CC and IC separately (see Materials and Methods; Table 1). In this analysis, activity of each neuron in each phase was classified exclusively into five groups: neurons coding the action of the previous choice trial, neurons coding the reward of the previous choice trial, neurons coding the interaction between continued tone was presented after left-hole or right-hole poking. PETHs of all choice trials (black), and of trials in CC (green) and trials in IC (cyan; upper panel). PETHs of CC and IC choice trials in which left was selected and rewarded (L1), left was selected and not rewarded (L0), right was selected and rewarded (R1), or right was selected and not rewarded (R0; lower panel). All PETHs (50-ms bins) were smoothed with a Gaussian kernel with a 150-ms SD. G-J, Normalized activity patterns of all recorded neurons from the DMS (G), DLS (H), mPFC (I), and M1 (J). An activity pattern for each neuron was normalized so that the maximum PETH was 1 and represented by pseudo-color (values from 0 to 1 are represented from blue to red). Indexes of neurons were sorted based on the time that the normalized PETH first surpassed 0.8. action and reward (AÂR) of the previous choice trial, neurons coding the action according to WSLS strategy, and noncoding neurons.
To elucidate when and how much information about previous AÂR was represented in the motor and prefrontal loops, we calculated the mutual information between previous AÂR and neuronal firing and in 100-ms time bins around the offset of center-hole poking in the CC (Fig. 5E; Doya, 2009, 2015a). M1 neurons showed a sharp peak after exit from the center-hole. Previous AÂR information was more strongly encoded in the motor loop than in the prefrontal loop for 500 ms before and after the offset of center-hole poking (before: DMS vs DLS; 0.020 6  . Filled and hatched bars indicate proportions in CC and IC, respectively. When rats employed the WSLS strategy, DLS neurons more strongly conveyed information about the previous choice trial during action preparation than DMS neurons (yellow arrows). Neuronal activities of all areas excluding the mPFC represented WSLS action during action execution (green arrow). **p , 0.01, *p , 0.05, x 2 test between DMS and DLS or mPFC and M1 in each task condition. ##p , 0.01, #p , 0.05, x 2 test between CC and IC in each recording area. E, The time course mutual information between previous AÂR and neuronal firing in 100 ms bins before and after the offset of center-hole poking in the CC. Mean 6 SEM. 0.0012 vs 0.031 6 0.0017 bits, p = 8.1e-08, mPFC vs M1; 0.016 6 0.0010 vs 0.026 6 0.0016 bits, p = 4.6e-06. after: DMS vs DLS; 0.033 6 0.0021 vs 0.045 6 0.0027 bits, p = 2.2e-04, mPFC vs M1; 0.021 6 0.0020 vs 0.046 6 0.0036 bits, p = 2.6e-07; mean 6 SEM unpaired t test; Fig. 5F). These time windows were equivalent to phases 2 and 3, respectively. The result of mutual information analysis is consistent with that of neuronal proportion analysis.
We further focused on the neural activities just before and after the reward probabilities were changed. We performed an analysis using the WSLS-related variables as in Figure 5 for phase 1 of the first or last four trials in each block (Fig. 6). In the first four trials, the DLS more strongly represented the previous reward than the DMS (DMS vs DLS; 7% vs 16%, p = 0.0053, x 2 test with Bonferroni correction), similar the result for all trials in Figure 5A. In the last four trials, the DLS more strongly encoded previous action and previous AÂR than the DMS (previous action: DMS vs DLS; 0.3% vs 3%, p = 0.042, previous AÂR: DMS vs DLS; 12% vs 26%, p = 0.00029), which were not observed with all trials combined. The result confirms the role of DLS in working memory both in early and late stages of learning, with the coding in the latter related more to the next action.

Discussion
Our main achievements and findings in this research are as follows.
1. We developed a choice task that manipulated WM availability for rats and showed that disturbance of WM disrupted the WSLS choice strategy (Fig. 2). 2. Poisson regression of neural spikes showed that the proportions of neurons coding the current action before and after action choice were larger in the DLS than DMS, and in M1 than the mPFC, and that neurons in DLS showed stronger reward response following contralateral action than ipsilateral action (Fig. 4). 3. Before action choice, the proportion of neurons coding the previous reward was larger in CC in DLS, mPFC, and M1, and the proportion coding the previous action was larger in CC in DLS and M1. During action execution in CC, neuronal activities of DMS, DLS, and M1 represented prospective action by the WSLS strategy (Fig. 5). 4. Throughout the trial, working memories of previous actions, rewards, and their interactions were more prevalent in the motor loop than in the prefrontal loop (DLS than the DMS and M1 than mPFC; Fig. 5).
In the present study, we showed the effect of WM availability in the choice behavior of rats, which allowed us to analyze neuronal correlates of WM-based choice strategy in the cortico-striatal circuit (Fig. 2).
Recent studies in both humans and rodents showed that disruption of WM changed the strategy in sequential choice tasks (Collins and Frank, 2012;Worthy et al., 2012;Otto et al., 2013a, b;Collins et al., 2014;Economides et continued F, The average mutual information between previous AÂR and neuronal firing during 500 ms before and after the offset of centerhole poking in the CC. Mean 6 SEM; **p , 0.01, *p , 0.05, unpaired t test. Figure 6. Neuronal representation of WSLS strategy before a change in reward contingency and after that change. Same as Figure  5A but only using behavioral and neuronal data during phase 1 of the first or last four trials in each block for analysis. **p , 0.01, *p , 0.05, x 2 test between DMS and DLS or mPFC and M1 in each task condition. ##p , 0.01, #p , 0.05, x 2 test between CC and IC in each recording area. Iigaya et al., 2018). Worthy et al. (2012) used a two-choice task and disrupted WM by requiring an additional memory task in parallel. They showed that human choice behavior with intact WM was better fitted by a WSLS model than a RL model, while choice behavior with WM load was better fitted by a RL model than the WSLS model. Collins and Frank (2012) and Collins et al. (2014) used a choice task in which a subject selected one action among three options for a given visual image. The WM load was controlled by varying the number of visual stimuli. They analyzed the choice strategy using a hybrid model combining a RL model and a WM model and suggested that the choice behavior without WM load can be explained by the WM model (RL model with the learning rate = 1), while the choice behavior with WM load can be explained by the RL model with the lower learning rate. Iigaya et al. (2018) studied mouse performance in a nonstationary, reward-driven, decision-making task and assessed WM availability based on spontaneous variations in intertrial intervals (ITIs). Mice showed WSLS-like choices after short ITIs, but RL-like choices after long ITIs. Optogenetic stimulation of dorsal raphe serotonin neurons boosted the learning rate only in trials after long ITIs, suggesting that serotonin neurons modulate reinforcement learning rates, and that this influence is masked by WM-based decision mechanisms.
All previous studies examining WM effects on choice strategies except Iigaya et al. (2018) were performed with human subjects, and there have been no reports comparing neuronal representations of a WM-based strategy. In this study, we recorded neurons in the DMS, DLS, mPFC, and M1 during a choice task with and without WM disturbance, and found that neurons in each area had a variety of activity patterns throughout a trial, and that patterns were modulated by selected actions, reward outcomes, and task conditions (Fig. 3). Information on action command and selected action was more strongly represented in the motor loop than in the prefrontal loop (Fig. 4). These properties are similar to our previous observations in the DLS and DMS neurons (Ito and Doya, 2015a). The DLS had significantly stronger sensitivity to reward following contralateral than ipsilateral actions, whereas reward responses in other regions did not show significantly different sensitivity to the laterality of actions (Fig. 4E). The neuronal activities in two loops were recorded from different hemispheres. However, the order of left-right reward setting was randomized across sessions, so that the analysis result would not be affected by the laterality of recording sites.
What kind of information must be retained between trials for the WSLS strategy? There are at least two possibilities. One is to retain direct experiences with action and reward in the previous trial, such as, L1, L0, R1, and R0 (keeping past experience). Another is to compute the next action using a WSLS rule soon after reward feedback (prospective action), that is L for L1 or R0, and R for L0 or R1 and retain that until the next choice (keeping future plan).
Indeed, before action execution in CC, coding of the previous action was seen in the DLS and M1 and coding of the previous reward was seen in the DLS, mPFC, and M1 (Fig. 5A,B). Combinatorial information of the previous action and reward was more dominant in CC in the DMS (phase 1) and the DLS, mPFC, and M1 (phase 2). A previous study in monkeys reported that prefrontal neurons modulated their activity according to the previous outcome and the conjunction of the previous choice and outcome (Barraclough et al., 2004).
Our results showed for the first time that the motor loop (DLS and M1) retains that information more strongly than the prefrontal loop (DMS and mPFC) when a WM-based strategy was observed.
Previous studies suggested that different types of uncertainty, such as the difference of the reward probabilities and the volatility of reward contingency, affects choice strategies and learning rates (Soltani and Izquierdo, 2019;Woo et al., 2022). While some studies report higher learning rate with higher volatility (Behrens et al., 2007;Piray and Daw, 2021), others reported reduced learning rates (Donahue and Lee, 2015;Farashahi et al., 2019;Woo et al., 2022). In the present study, learning and forgetting were slower in the IC condition than in the CC condition (Fig.  2C). It is not likely, however, that this is a result of the difference in the volatilities in the two conditions because both CC and IC blocks included the same 10 choice trials.
Another possible factor that might affect the choice strategy and the involvement of the motor loop is the simple repeated choice sequence acquired in the CC compared with a more complex choice/no-choice sequence acquired in the IC condition. However, the analysis of neuronal representation soon after reward contingency change still shows higher previous action coding in of the motor loop (Fig. 6). This suggests that the motor loop did not only contribute to simple repetitive actions but also adaptive choice shift using WM.
In the present study, rats were overtrained with the task before neuronal recording. It was reported that overtraining induced a shift from goal-directed to habitual behavior (Smith and Graybiel, 2013), which is often associated with a shift in the control from the DMS to DLS (Yin et al., 2004;Ashby et al., 2010;Thorn et al., 2010;Graybiel and Grafton, 2015;Kupferschmidt et al., 2017). A DLS lesion study reported an impairment of lose-shift response in an operant task (Skelin et al., 2014). Our study showed that motor loop including the DLS more strongly conveyed information about previous choices, rewards, and their interactions than prefrontal loop including the DMS. It is possible that the WSLS strategy using the WM of action and reward was established as a habitual behavior in the motor loop.
From the viewpoint of information processing, keeping a future plan is more efficient in that it requires less memory capacity. Memory of future action has been termed "prospective action" in previous studies (Kesner, 1989;Goto and Grace, 2008;Kesner and Churchwell, 2011). However, these studies suggested that the prefrontal cortex is responsible for prospective action coding, in our study, prospective WSLS coding appeared during action execution in all recorded areas, excluding the mPFC. It is still possible that the prospective action was calculated and retained in other prefrontal areas, such as the anterior cingular cortex (ACC), located just above the mPFC. The ACC was considered a part of the prefrontal cortex in previous studies (Kesner, 1989;Goto and Grace, 2008;Kesner and Churchwell, 2011) and was thought to be involved in working memory for the motor response (Kesner and Churchwell, 2011).
In conclusion, this experiment showed that the availability of WM affects choice strategies in rats and revealed WM-related neuronal activities in DMS, DLS, mPFC, and M1. A striking finding was that DLS and M1 in the motor cortico-basal ganglia loop carry substantial WM information about previous choices, rewards, and their interactions, in addition to action coding during action execution.