Reward expectations direct learning and drive operant matching in Drosophila

Significance Unraveling how humans and other animals learn to make adaptive decisions is a unifying aim of neuroscience, economics, and psychology. In 1961, Richard Herrnstein formulated a long-standing empirical law that quantitatively describes many decision-making paradigms across these fields. Herrnstein’s matching law states that choices between options are divided in proportion to the rewards received, a strategy that equalizes the return on investment across options. Identifying mechanistic principles that could explain this universal behavior is of great theoretical interest. Here, we show that Drosophila obey Herrnstein’s matching law, and we pinpoint a plasticity rule involving the computation of reward expectations that could mechanistically explain the behavior. Our study thus provides a powerful example of how fundamental biological mechanisms can drive sophisticated economic decisions.


Supplemental Materials & Methods
All the data presented in the figures and supplementary figures can be found at the following data repository: https://doi.org/10.5281/zenodo.7449214.The code used to run the Y-arena as well as perform the analysis and production of figures panels can be found in the following code repository: https://doi.org/10.5281/zenodo.7986372

Fly strains and rearing:
Drosophila melanogaster were raised on standard cornmeal food supplemented with 0.2 mM alltrans-retinal at 25 C (for Gr64f-Gal4 lines -see Cross progeny (2-5 days old) were sorted on a cold plate at around 4 C and females of the appropriate genotype were transferred to starvation vials.Starvation vials contained nutrient-free 1% agarose to prevent desiccation.Flies were starved between 28 -42 hrs before being aspirated into the Y-arena for experiments.

Apparatus design:
A detailed description of the apparatus is provided in Supplementary Information 1.The Y chamber consists of two layers of white translucent plastic.The bottom is a single continuous circular layer and serves as the floor of the Y that flies navigate.The top is a circular layer with a Y shaped hole in the middle that serves as the walls.The length of each arm from center to tip is 5 cm and the width of each arm is 1 cm.These two layers are placed underneath an annulus of black aluminum.A transparent glass disk is located in the center of this annulus and acts as the ceiling of the Y, allowing for video recording of experiments.This transparent disk is rotatable and contains a small hole used to load flies.The black annulus houses three clamps that lock the circular disk in place.All three layers are held together and made airtight with the help of 12 screws that connect the layers.
The Y chamber is mounted above an LED board that provides infrared illumination to monitor the fly's movements, and red light for optogenetic activation.The LED board consists of a square array of red (617 nm peak emission, Red-Orange LUXEON Rebel LED, 122 lm at 700mA, 1.9mW/cm 2 ) and infrared (IR) LEDs that shine through an acrylic diffuser to illuminate flies.Fly movements were recorded at ~5Hz from above the Y using a single USB3 camera (Flea3, model: FL3-U3-13E4M-C, with longpass filter of 800 nm).
Each arm of the Y has a corresponding odor delivery system, capable of delivering up to 5 odors (modified from 8).For our experiments, olfactometers injected air/odor streams into each arm at a flow rate of 100 ml/min.A crisp boundary between odors and air is formed at the center of the Y (Fig. S1A).Odors and concentrations used for each experiment are detailed in the behavioral experiments section of the Methods.The center of the Y contains an exhaust port connected to a vacuum, which was set at 300 ml/min using a flow meter (Dwyer, Series VF Visi-Float® acrylic flowmeter) -matching total input flow in our experiments.

Fly tracking and operation:
We wrote custom MATLAB code (MATLAB 2018b, Mathworks) to control the Y-arena and run experiments.The data collected by the USB3 camera was loaded into MATLAB in real time and the fly's location was identified using the MATLAB image processing toolbox as follows.A background image was calculated just before beginning the experiment by averaging multiple frames as the fly moved around in the Y.This background was subtracted from the frame being processed and the resulting image was thresholded, leaving the fly as a white shape on a black background.The location of the centroid of the fly was estimated using MATLAB's bwconncomp and regionprops functions.If the fly was located in one of the reward zones, a trial was deemed complete, and reward was provided by switching on the red LEDs as defined by the reward contingencies of the task.The arena was then reset, air pumped into the chosen arm and odors randomly reassigned to the two other arms (Fig. 1B).The location of the fly, along with other information, such as reward presence and odor-arm assignments, were saved as a .matfile for further analysis.All analysis in Figures 1, 2 and 3 were based on this information.

Circular olfactory arena:
Group learning experiments in Fig. S9 were performed in a previously described circular olfactory arena (8).

Odorant information:
For all experiments in the paper, two or three of the following odorants were used to form cuereward relationships:

Y-arena behavioral task structure and design:
The general structure of these experiments starts by inserting a fly randomly into one of the three arms of the Y via aspiration.This arm is injected with a clean airstream, and the odor choices are assigned randomly to the other two arms.This randomization ensures there is no consistent spatial relationship/component to the task.Once a fly reaches an odor arm and travels down to the choice zone, reward is delivered via a 500ms flash of red LED (617 nm, 1.9mW/cm 2 ) to activate the appropriate reward-related neurons.This constitutes one trial.The arena then resets, the arm chosen by the fly switches to clean air and the odor options are again randomly assigned to the other two arms, and the next trial commences.Trials are strung together into blocks consisting of 60 or 80 trials, depending on the paradigm.
Whenever probabilistic rewards were included in a task, reward baiting was incorporated as follows.On every trial within a block, there is a constant probability that reward will be delivered when a fly makes a particular odor choice.However, once a reward is scheduled to be delivered for a given choice, that state persists until that odor is chosen by the fly; in other words, the odor becomes 'baited with reward' until chosen.Note that this type of reward schedule means that the likelihood an odor cue yields a reward increases over time if it is unchosen for many trials; this design choice is to reflect the replenishment of resources that would occur in a natural foraging environment.
Our terminology designates experiments according to the reward probabilities associated with each odor.For example, 100:0 indicates that one odor was rewarded on 100% of the trials it was chosen, while the other option was never rewarded; this is a non-probabilistic task as in Fig 1 .By contrast, 80:20 indicates that one odor is baited with reward with 80% probability, while the other option is baited 20% of the time, and the task is entirely probabilistic.
When we evaluated whether flies learn two different cue-reward pairings, one high probability, one low probability in Fig. 1G and Fig. S1G, we had to use three different odors, OCT, MCH and PA.In these experiments PA was always an unrewarded odor cue, while OCT and MCH were arbitrarily assigned high (80%) and low (40%) reward baiting probability.On each trial, flies were presented with a choice of either OCT versus PA or MCH versus PA.These choices were delivered in alternation for a block of 80 unrewarded total trials to assess naive odor preference.This was followed by an 80-trial block with reward baiting as above.For each fly we arbitrarily assigned which odor (OCT or MCH) was associated with high reward probability, to ensure a balance across the dataset.
The dynamic foraging task in e.g.Fig. 2 was adapted from monkey and mouse versions (9-12), and used a three block structure where reward baiting probabilities were constant within a given 80-trial block, and changed between different blocks.We used reward baiting probabilities of 50:50, 33:67, 20:80 and 11:89; in a subset of experiments, we delivered lower net reward at the same ratios: 25:25, 16.5:33, 10:40 and 5.5:44.5.

Circular olfactory arena behavioral task structure and design:
A schematic of the task performed in the circular arena is shown in Fig. S9A.OCT and MCH were used as odors for these experiments.Odors were presented sequentially (and separated in time) for one minute each, with one of the odors paired with reward.To mimic the relationship between odor time and reward time experienced by the fly in the Y-arena, 1 sec of reward (red light, 617nm, 2.3mW/cm 2 ), was provided after every 3 seconds of odor experience.Flies were finally tested by dividing the circular arena into four quadrants with two opposite quadrants receiving one odor and the other two quadrants receiving the other.

Quantitative analysis and behavioral modeling:
All analyses and modeling were performed using MATLAB 2020b (Mathworks).We used nonparametric statistical tests when quantifying statistical significance in all cases.The Mann-Whitney test was used when testing hypotheses with unpaired samples.The Wilcoxon signed-rank test was used with paired samples.Specific descriptions of the hypotheses being tested are provided in the results and figure legend in each case.

Analysis of fly movement and choices in the Y-arena:
The (x,y) coordinates of the fly were analyzed to calculate: i) the distance of the fly from the center of the Y; ii) when the fly entered and exited a given odor arm; and iii) the time taken per trial to enter into the reward zone at the end of an odorized arm.These quantities were then used to produce the plots in Fig. 1, 2, 5 and Fig. S1, 2.
Distance from center was calculated by projecting the (x,y) location of the fly () onto a skeleton of the Y and this metric was used when plotting location over time plots (example shown in Fig. 1C) .Here the subscript  denotes the time point at which the (x,y) location was observed.The skeleton consisted of three lines running down the middle of each arm to the center of the Y ( ! ).
Based on which arm the fly was located in, its (x,y) position was projected onto the appropriate ( "# ) skeleton line using the following equations for projecting a point onto a line, where  =  −  !,  =  $ −  ! and  $ is the (x,y) coordinates of the end of the  "# arm.The entries/exits of a fly into/from a particular odorant or air were estimated by tracking the region that the fly was located in at every time point and comparing it to the known odor-arm identity map (stored in the experiment .matfile).A turn (reversal) was considered to have been made whenever a fly entered an odor and then exited this odor without reaching the reward zone.An approach was considered to have been made whenever a fly entered an odor arm and then traveled all the way into the reward zone of that same arm without ever exiting it.
To calculate the time taken per trial, we made use of the timestamp vector that we saved along with the (x,y) vector.Time taken from the entire trial was calculated by subtracting the timestamp for the frame that the previous trial was completed by the timestamp of the frame when the current trial was completed.Time taken from first exit of the air arm was calculated by subtracting the timestamp of the frame that the fly first exited the air arm after a trial began by the timestamp of the frame when the current trial was completed.
Choices themselves were determined by identifying the arm in which the fly crossed into the reward zone and mapping that arm to its assigned odor on that trial.Once choices were determined we could calculate two important metrics.Choice ratio, defined as the ratio between the number of choices made towards option A to the number of choices made towards option B, and reward ratio, defined as the ratio between the number of rewards received upon choosing option A to the number of rewards received upon choosing option B. These ratios were calculated on one of two timescales, i) the ratio over an entire block of 80 trials where baiting probabilities were constant, or ii) the ratio in a ten-trial moving window over the entire 240 trials of the experiment.The undermatching index used in Fig. 4E,F was defined as the mean square error between the instantaneous choice ratio and reward ratio curves produced for each fly.

Analysis of fly location in the circular arena:
Videos of a fly's movements in the Y-arena were read into MATLAB frame by frame and the location of the fly's centroid was identified using the MATLAB image processing toolbox.Once identified, the number of flies in each quadrant was used to calculate the preference index (PI) metric on a per frame basis.PI is defined as the difference between the number of flies in each pair of odor-matched quadrants divided by the total number of flies.Time-averaged PIs could then be calculated by taking the average of the PIs from each individual frame.

Logistic regression to estimate influence of past rewards and choices on behavior:
To estimate the role of choice and reward histories in determining fly choices in the dynamic foraging task, we fit the following logistic regression to each fly's choice sequence as where  is the present trial and  is the variable used to iterate over the past  trials.() = 1 if the chosen odor was OCT and -1 if the chosen odor was MCH.() = 1 if chosen OCT option produced reward, -1 if chosen MCH option produced reward, and 0 otherwise. !represents the weight assigned to the bias term,  $ ) represents the weight assigned to the  "# past choice and  $ % represents the weight assigned to the  "# past reward.We chose to look at the past  = 15 trials to align with previous studies (12,13).The regression coefficients generated were 10-fold crossvalidated, and the regression model included an elastic net regularization (MATLAB functionlassoglm).The weight of lasso versus ridge optimization was set to 0.1 as this value provided best fits to behavior.These fly-specific regression coefficients could be combined with the flies reward and choice histories to predict trial choice probability and estimate the log-likelihood (ℓ) and percent deviance explained () where  /34 is the total number of trials in the data being fit,  indexes trials,  indexes possible options, < *$ > is the probability with which the model predicts that choice  occurs on trial , and  *$ is the choice that actually took place on trial .

Leaky integrator model:
We also developed a leaky integrator model to predict behavior in the dynamic foraging task inspired by earlier work (9).This model determines choices on a given trial by comparing values assigned to each option the agent has to choose between based on choice and reward history.
The values () were calculated for a given trial  using the following equations.If OCT is chosen by the model, values are updated according to where  is a constant related to the learning rate.Similarly, if MCH is chosen by the model, values are updated according to These values are then compared and passed through a sigmoidal nonlinearity to determine the probability of each choice, The probability of choosing MCH was one minus that of OCT.The probability generated by this operation is compared with a value drawn from a uniform distribution over the [0,1] interval to determine whether the resulting choice is OCT or MCH.These predicted choices could be compared to fly behavior to compute the model's percentage deviance explained.The parameters  and  are fit for each fly so as to maximize the percentage deviance explained (values of these parameters can be seen in Fig. 2G and Fig. S3A,B).

Win -stay, Lose -switch model:
A third model to predict behavior incorporated information only about the fly's most recent choice, unlike the logistic regression and leaky-integrator alternatives.In this "win-stay, lose-switch" model the agent chooses randomly on the first trial.If the chosen option produces a reward the agent picks the option again on the next trial (stays).If it doesn't produce a reward, the agent picks the other option on the next trial (switches).This procedure repeats to generate a sequence of choices.The accuracy of this model was calculated by observing correctly predicted switches and stays as well as incorrectly predicted switches and stays, shown in Fig. 2D as a probability matrix.
To calculate this matrix the model was made to predict the behavior of flies on every trial of the dynamic foraging task.The average values across 18 flies run in the dynamic foraging task is presented in the matrix in Fig. 2D.

Neural circuit model of dynamic foraging:
We designed a neural circuit model, inspired by work from Loewenstein and Seung (14), that was used to simulate behavior in a dynamic foraging task.Two versions of this model were used.

Replicating Loewenstein and Seung's Model:
The first version aimed to directly replicate the model used by Loewenstein and Seung (Fig. S4A).
It generated behavior on a trial-by-trial basis in the dynamic foraging task.The number of trials to simulate were input by the user prior to simulation (60, 240 or 2000 trials).The model consisted of two sensory neurons ( ( and  + ) whose activity was drawn at the beginning of each trial from a normal distribution with mean 1 and standard deviation 0.1.These neurons synapse with weights ( ( and  + ) onto two motor neurons ( ( and  + ).The activity of  ( and  + were compared and the choice was driven by whichever neuron had the larger activity.
Once a choice was made, rewards were provided as determined by the reward contingencies of the task.The weights between  and M were updated after each choice and followed the following rules where  R =  or  R =  − () and  S $ =  $ or  S $ =  $ − ( $ ) based on the learning rule, and  iterates over odors.Note that () and ( $ ) depended on time and were calculated in one of two ways: i) by calculating the mean over the last 10 trials, ii) by filtering the entire history with an exponential filter with exponential timescale of 3.5 trials.The various covariance and noncovariance rules were achieved by selecting the appropriate combinations of  R and  S $ .

Task and mushroom body inspired version:
The second version incorporated modifications to the model that made it more appropriate for the task we designed for fruit flies (Fig. 3B).This model consisted of two sensory inputs that represented activity of populations of Kenyon cells (KCs).However, this version of the model looped through odor experiences, rather than looping through trials determined by two-alternative forced choices.Therefore, the activity of the sensory neurons was drawn differently.Rather than both values being drawn from normal distribution with mean 1 and standard deviation  = 0.1, this was only true for the odor that was deemed to have been "experienced" by the model on a given odor experience.The activity of the other neuron was drawn from a normal distribution with mean  = 0.1 (Fig. 3,4; Fig. S5,6) and standard deviation  = 0.1.Here,  represents the similarity, or overlap, between the two inputs.This was included because the KC representations of the two different odors used in our task are thought to have some amount of overlap (15).However, we found that modulating this term did not affect the resulting matching behavior (Fig. S6D-G) and so for Fig. 5, we chose  = 0 .We also explored incorporating noise covariance between the two sensory inputs (with correlation coefficient  = 0.1), but this correlation was empirically unimportant, and we usually set  = 0.
Another difference is that an odor experience could lead to either an approach (choice) or a turn away.The behavior chosen by the model on any given odor experience depended on the response of the single output neuron incorporated into this model.The activity of this output neuron () was the weighted sum of the two inputs.This was then passed through a sigmoidal nonlinearity where  = 4,  = 1 (this value was chosen to encourage exploration at the beginning of learning) and  is the action produced by the model.When  = 0 the odor is always accepted and when  = 1 the odor is always rejected.A random number from the interval [0,1] was drawn and compared to  to determine whether an approach/choice or turn was made.If a turn was made, no reward was provided, and weights remained unchanged.The model then experienced a new odor and the process repeated.If a choice was made, then a reward was provided based on choice contingencies and weights were updated according to the rules in eqs. 10 and 11.We add the additional constraints that  is negative (synaptic depression) and that weights have a lower bound of 0.
Plasticity requirements of operant matching in the mushroom body model:

Relating operant matching to the covariance of neural activity and reward
We begin by reproducing the key theoretical argument provided by Loewenstein and Seung.Consider a sequence of trials where an animal chooses between two options and receives feedback via reward.Specifying an element in this trial sequence requires three random variables, the choice (), the reward (), and the underlying neural activity ().Note that  is very general in this argument; it can be any quantification of neural activity whose mean depends on choice.We further assume that both options are sometimes chosen, and that neural activity and reward are conditionally independent on choice.Under these assumptions, Loewenstein and Seung show that Herrnstein's operant matching law is satisfied if and only if the covariance between  and  vanishes over the trial sequence.
The proof begins by recalling the definition of covariance, where  denotes the expectation over the trial sequence (i.e., ,  and ) and  =  − ().
By the product rule of probability, (, , ) = ()(, |), so we can rewrite this expectation as where the subscripts on denote the probability distributions over which the expectations are computed, and we used the conditional independence assumption, (, |) = (|)(|), in the second step.Writing out the expectation over choice explicitly, the expression for the covariance becomes To simplify this expression, note that where we again used the product rule of probability, and the notation means that the lefthand equation implies the righthand one.It follows that (, ) = ( = 1) @|)'( () ^%|)'( () −  %|)'+ ()_ . (17) From this expression, we can conclude that The lefthand equation is the matching law, as it says that the expected reward is independent of choice, so the matching law implies a vanishing covariance between neural activity and reward.Moreover, it follows that  @|)'( () ≠ 0 ≠  @|)'+ () from the assumption that the average neural activity has a choice dependence (i.e.,  @|)'( () ≠  @|)'+ ()).Consequently, neither ( = 1) nor  @|)'( () is equal to zero, and This equation says that the matching law follows from the vanishing covariance of choice-related neural activity and reward.This completes the proof.

Casting the mushroom body model in the framework of Loewenstein and Seung
Our mushroom body model consists of a sequence of odor presentations and accept/reject decisions.Specifying an element in this decision sequence requires four random variables, the odor option experienced (), the odor-induced neural activity in the KCs (), the accept-reject decision provided by the MBON (), and the reward received from this action ().Reward delivery and plasticity only occur when an odor option is accepted, and we refer to the accepted odor option as the choice ().Note that  is undefined when the odor is rejected.Therefore, specifying an element in the choice sequence only requires three random variables, , , and .
Setting  = , this choice sequence satisfies the assumptions of Loewenstein and Seung's theory.We therefore expect flies to obey Herrnstein's operant matching law if and only if the covariance between KC activity and reward is equal to zero over the choice sequence.
It is important to recognize that the matching law is generally inconsistent with vanishing covariance between KC activity and reward over the decision sequence (rather than the choice sequence).Our assumption that plasticity only occurs following the decision to accept (i.e., the choice) is thus critical for obtaining matching behavior from covariance-based plasticity rules.

Vanishing covariance does not imply matching between more than two alternatives
The preceding analyses assumed binary choices between two options.However, Herrnstein's operant matching law can also be satisfied with more than two options, and the general form of the matching law is where  ≥ 2 is the number of options.We can write this condition more succinctly as Here we show that this more general form of the matching law implies that the covariance between neural activity and reward vanishes.However, the converse is not true, as it's possible for the covariance to vanish without behavior that produces the matching law.The biologically important consequence of this result is that covariance-based plasticity rules may not lead to matching when the animal is deciding between more than two options.
In this more general decision-making task, we express the covariance between the neural activity and reward over the choice sequence as where we've made the same conditional independence assumption as in the binary analysis.If the matching law is satisfied, then we can take  %|)'B () out of the sum and we find Therefore, the matching law implies that the covariance between neural activity and reward is zero.

Logistic regression model for estimating learning rules:
To determine the learning rules that best predict fly behavior, we designed a logistic regression model that made use of the known relationship between MBON activity and behavior.This model predicted behavior between input and weights that give rise to MBON activity following the relationships $ () =  $ (0) + 3 ∆ $ () , where  h () is the predicted action on odor experience , | = 0 indicates all past odor experiences where the fly chose to accept the odor, and  $ () represents the synaptic weights associated with neurons representing odor  at time .Now the change in synaptic weights ∆ $ () depends on the learning rule that is used by the circuit.It was here that we wanted to have the regression model identify the rule that provided the best fit to actual data.To do this we allowed the model to use a learning rule with 4 different terms whose coefficients could be modified, Here, , , , and  are the coefficients assigned to each component of the learning rule.The regression model takes the sensory stimuli and synaptic weights at a given time as inputs to predict the output action.However, when fitting this model to behavior we have only sensory stimulus and reward information readily available.We therefore used eq.26 and 27 to convert synaptic weights and sensory stimuli to inputs that consisted of sensory stimuli and rewards and a constant input that serves as a bias term.The resulting inputs could be represented as + () = 3  $ () The coefficients assigned to each of the five inputs ( ! (0),, , , ) could then be used to identify the learning rule that the model predicted as the best estimate for producing the behavior that was tested.Of course, the values of these coefficients varied from fly to fly.To examine if pairs of coefficients changed in a correlated manner across flies, we estimated the correlations between the terms by using the Matlab function corrcoef, that produces a matrix of correlation coefficients.

Figure S8 Figure
Figure S8

Figure S9 Figure S9 :
Figure S9 table) or 21 C (for other lines -see following table) with 60% relative humidity and kept in dark throughout.The details of all flies used for experiments in this manuscript can be found in the table below: