A rational theory of the limitations of working memory and attention 1 2

The precision with which items are encoded in working memory and attention decreases with the number of encoded items. Current theories typically account for this “set size effect” by postulating a hard constraint on the allocated amount of encoding resource. While these theories have produced models that are descriptively successful, they offer no principled explanation for the very existence of set size effects: given their detrimental consequences for behavioral performance, why have these effects not been weeded out by evolutionary pressure, by allocating resources proportionally to the number of encoded items? Here, we propose a theory that is based on an ecological notion of rationality: set size effects are the result of a near-optimal trade-off between behavioral performance and the neural costs associated with stimulus encoding. We derive models for four visual working memory and attention tasks and show that they account well for data from eleven previously published experiments. Moreover, our results suggest that the total amount of resource that subjects allocate for stimulus encoding varies non-monotonically with set size, which is consistent with our rational theory of set size effects but not with previous descriptive theories. Altogether, our findings suggest that set size effects may have a rational basis and highlight the importance of considering ecological costs in theories of human cognition.


INTRODUCTION 46
Human cognition is strongly constrained by set size effects in working memory and attention: the 47 precision with which these systems encode information rapidly declines with the number of items, 48 as observed in for example delayed estimation, change detection, visual search, and multiple-object  Table 1; estimated precision computed in the same way as in Fig. 3A). (C) Stimulus encoding is assumed to be associated with two kinds of loss: a behavioral loss that decreases with encoding precision and a neural loss that is proportional to both set size and precision. In the delayed-estimation task, the expected behavioral error loss is independent of set size. (D) Total expected loss has a unique minimum that depends on the number of remembered items. The mean precision per item that minimizes expected total loss is referred to as the optimal mean precision (arrows) and decreases with set size. The parameter values used to produce panels C and D were λ=0.01, β=2, and τ↓0.  The key novelty of our theory is the idea that stimuli are encoded with a level of mean 136 precision, , J that minimizes a combination of behavioral loss and neural loss. Behavioral loss is 137 induced by making an error ε, which we formalize using a mapping L behavioral (ε). This mapping may 138 depend on both internal incentives (e.g., intrinsic motivation) and external ones (e.g., the reward 139 scheme imposed by the experimenter). For the moment, we choose a power-law function, 140 L behavioral (ε)=|ε| β with β>0 as a free parameter, such that larger errors correspond with larger loss. The 141 expected behavioral loss, denoted , is obtained by averaging loss across all possible errors, 142 weighted by the probability that each error occurs, 143 is the estimation error distribution for given mean precision and set size. In 147 single-probe delayed-estimation tasks, the expected behavioral loss is independent of set size and 148 subject to the law of diminishing returns (Fig. 1C, black curve). 149 behavioral L 7 A second kind of loss is the energetic expenditure incurred by representing a stimulus. Since 150 this loss is primarily rooted in neural spiking activity, we refer to it as "neural loss" and use neural 151 theory to make an estimate of the relation between encoding precision and neural loss. For many 152 choices of spike variability, including the common one of Poisson-like variability (Ma, Beck, 153 Latham, & Pouget, 2006), the precision (Fisher information) of a stimulus encoded in a neural 154 population is proportional to the trial-averaged neural spiking rate (Paradiso, 1988;Seung & 155 Sompolinsky, 1993). Moreover, it has been estimated that the energetic loss induced by each spike 156 increases with spiking rate (Attwell & Laughlin, 2001;Lennie, 2003). When combining these two 157 premises, the expected neural loss associated with the encoding of an item is a supralinear function 158 of encoding precision. However, to minimize free model parameters, we assume for the moment 159 that the function is linear (at the end of this section we present a mathematical proof that the main 160 qualitative prediction of our theory generalizes to any supralinear function). Further assuming that 161 stimuli are encoded independently of each other, expected neural loss is also proportional to the 162 number of encoded items, N. We thus obtain 163 where α is a free parameter that represents the amount of neural loss incurred by a unit increase in 167 mean precision (Fig. 1C, colored lines). 168 We combine the two types of expected loss into a total expected loss function (Fig. 1D) 8 Under the loss functions proposed above, we find that optimal J is a decreasing function of set size 180 ( Fig. 1D), which is qualitatively consistent with set size effects observed in experimental data (cf. 181 When formalizing the loss functions, we had to make specific assumptions about how behavioral 185 errors map to behavioral loss and encoding precision to neural loss. Since these assumptions cannot 186 yet be fully empirically substantiated, it is important to verify that our theory generalizes to other 187 choices that we could have made. To this end, we asked under what conditions our general theory, 188 Eq. (4), predicts a set size effect (i.e., a decline of encoding precision with set size).

Model fits 202
To evaluate whether our theory can quantitatively account for experimental data, we fit the model 203 formulated above to 67 individual-subject data sets from a delayed-estimation benchmark set * 204 (Table 1). The maximum-likelihood fit accounts well for the raw error distributions ( Fig. 2A) and 205 the two statistics that summarize these distributions (Fig. 2B). Hence, these data are consistent with 206 the theory that set size effects are the result of an ecologically rational trade-off between behavioral 207 9 performance and neural cost. Maximum-likelihood estimates of the three model parameters (  , τ, 208 and β) are provided in Supplementary Table S1.

Total precision as a function of set size 234
One feature that sets our rational theory apart from previous theories is that it does not predict a 235 trivial relationship between the total amount of allocated encoding resource and set size. To see this, 236 we quantify the amount of allocated resources as the precision per item summed across all items, 237  J JN In fixed-resource models, this quantity is by definition constant and in power-law 238 models it varies monotonically with set size. By contrast, we find that in the fits to several of the 239 delayed-estimation experiments, total precision in the rational model varies non-monotonically with 240 set size (Fig. 3B, gray curves). To examine whether there is evidence for such non-monotonic 241 behavior in the subject data, we use the fitted precision values from the unconstrained model as our 242 best empirical estimates of the precision with which subjects encoded items. We find that these 243 empirical estimates show signs of similar non-monotonic relations in some of the experiments (Fig.  244 3B, black circles). To quantify this statistically, we performed Bayesian paired t-tests (JASP_Team, 245 2017) to compare the empirical total J estimates at set size 3 with the estimates at set sizes 1 and 6 in 246 the experiments that included these set sizes (DE2 and DE4-6; Table 1). These tests reveal strong 247 evidence that total precision at set size 3 is higher than total precision at both set sizes 1 248 (BF +0 =1.05·10 7 ) and 6 (BF +0 =4.02·10 2 ). Moreover, across all six experiments, the subject-averaged 249 set size at which total J is highest in the unconstrained model is 3.52±0.18. These findings suggest 250 that the total amount of resources that subjects allocate for stimulus encoding varies non-251 monotonically with set size, which is consistent with our rational model but not with previous 252 descriptive models. To the best of our knowledge, this non-monotonic behavior has not been 253 reported before and may be used to further constrain models of visual working memory and 254 attention. 255 256

Alternative loss functions 257
To evaluate the necessity of a free parameter in the behavioral loss function, L behavioral (ε), we also 258 test the following three parameter-free choices: |ε|, ε 2 , and −cos(ε). Model comparison favors the 259 original model with AIC differences of 14.0±2.8, 24.4±4.1, and 19.5±3.5, respectively. While there 260 may be other parameter-free functions that give better fits, we expect that a free parameter is 261 unavoidable here, as it is likely that the error-to-loss mapping differs across experiments (due to 262 differences in external incentives) and possibly also across subjects within an experiment (due to 263 differences in internal incentives). We also test a two-parameter function that was proposed recently 264 (Eq. We next examine the generality of our theory, by testing whether it can also explain set size effects 271 in two change detection tasks (Table 1). In these experiments, the subject is on each trial 272 sequentially presented with two sets of stimuli and reports whether there was a change at any of the 273 stimulus locations (Fig. 4A). A change was present on half of the trials, at a random location and 274 with a random change magnitude. The behavioral error, ε, takes only two values in this task: 275 "correct" and "incorrect". Therefore,     13 optimal rule (see Supplementary Information) and that there is random variability in encoding 279 precision. This decision rule introduces one free parameter, p change , specifying the subject's prior 280 belief that a change will occur. Due to the binary nature of ε in this task, the free parameter of the 281 behavioral loss function drops out of the model, as its effect is equivalent to changing parameter  282 (see Supplementary Information). The model thus has three free parameters (  , τ, and p change ). We 283 find that the maximum-likelihood fits account well for the data in both experiments (Fig. 4B). 284 So far, we have considered tasks with continuous and binary judgments. We next consider 285 two change localization experiments (Table 1) in which judgments are non-binary but categorical. 286 The task is identical to change detection, except that a change is present on every trial and the 287 observer reports the location at which the change occurred (out of 2, 4, 6, or 8 locations). We again 288 assume variable precision and an optimal decision rule (see Supplementary Information). Although 289 the rational model has only two free parameters (  and τ), it accounts well for both datasets (Fig.   290 4C). 291 The final task to which we apply our theory is a visual search experiment (Mazyar et al., 292 2013) (Table 1). Unlike the previous three tasks, this is not a working memory task, as there was no 293 delay period between stimulus offset and response. Set size effects in this experiment are thus likely 294 to stem from limitations in attention rather than memory, but our theory applies without any 295 additional assumptions. Subjects judged whether a vertical target was present among one of N 296 briefly presented oriented ellipses (Fig. 4D). The distractors were drawn from a Von Mises 297 distribution centered at vertical. The width of the distractor distribution determined the level of 298 heterogeneity in the search display. Each subject was tested under three different levels of 299 heterogeneity. We again assume variable precision and an optimal decision rule (see Supplementary 300 Information). This decision rule has one free parameter, p present , specifying the subject's prior degree 301 of belief that a target will be present. We fit the three free parameters (  , τ, and p present ) to the data 302 from all three heterogeneity conditions at once and find that the model accounts well for the 303 dependencies of the hit and false alarm rates on both set size and distractor heterogeneity (Fig. 4E). induced by stimulus encoding. The models that we derived from this hypothesis account well for 312 data across a range of quite different tasks, despite having relatively few parameters. Moreover, 313 they account for a non-monotonicity that appears to exist between in the relation between set size 314 and the total amount of resources that subjects allocate for stimulus encoding. 315 While the main purpose of our study was to make a conceptual advancementby providing a 316 principled theory for a phenomenon that has thus far been approached only descriptively - resulting in lower task performance. However, these models offer no principled justification for the 343 existence of interference and some require additional mechanisms to account for set size effects; for 344 example, the model by Oberauer and colleagues requires three additional componentsincluding a 345 set-size dependent level of background noiseto fully account for set size effects (Oberauer & Lin, 346 2017). That being said, we do not deny there may be interference effects in working memory and 347 adding them to models we presented here may improve their goodness of fit. 348 Our approach shares both similarities and differences with the concept of bounded rationality 349 (Simon, 1957), which states that human behavior is guided by mechanisms that provide "good 350 enough" solutions rather than optimal ones. The main similarity is that both approaches 351 acknowledge that human behavior is constrained by various cognitive limitations. However, an 352 important difference is that in the theory of bounded rationality, these limitations are postulates or 353 axioms, while our approach explains them as rational outcomes of ecological optimization 354 processes. This suggestion that cognitive limitations are subject to optimization instead of fixed 355 may also have implications for theories outside the field of psychology. In the theory of "rational 356 inattention" in behavioral economics, agents make optimal decisions under the assumption that 357 there is a fixed limit on the total amount of attention that they can allocate to process economic data 358 (C. A. Sims, 2003). This fixed-attention assumption is similar to the fixed-resource assumption in 359 models of visual working memory and it could be interesting to explore the possibility that the 360 amount of allocable attention is the outcome of a trade-off between expected economic performance 361 and the expected cost induced by allocating attention to process economic data. 362 While our results show that set size effects can in principle be explained as the result of an 363 optimization strategy, they do not necessarily imply that encoding precision is fully optimized on 364 every trial in any given task. First, encoding precision in the brain most likely has an upper limit, that the brain settled on works well on average, but is not tailored to provide an optimal solution in 384 every possible situation. In that case, set size effects could be more rigid across environmental 385 changes (e.g., in task or reward structure) than predicted by a model that incorporates every such 386 change in a fully optimal manner. 387 working memory tasks is to use a cue to indicate which item is most likely going to be probed. 400 Previous studies that used this manipulation (Bays, 2014; Klyszejko, Rahmati, & Curtis, 2014) 401 found increased encoding precision in cued items compared to uncued items, consistent with an 402 ideal observer strategy. It would be interesting to examine whether our model can quantitatively 403 account for such data. Moreover, an intuitive argument suggests that our theory predicts set size 404 effects on the cued item to become weaker as a function of cue validity. At minimum cue validity -17 which is equivalent to using no cue, as in the experiments analyzed in this paperour model 406 predicts a decline of encoding precision with set size. At maximum validity, however, the loss-407 minimizing strategy is obviously to always encode the cued item with the level of precision that 408 would be optimal for set size 1, thus entirely eliminating a set size effect. Our model makes precise 409 quantitative predictions about this transition from strong set size effects at low cue validity to no set 410 size effects at maximum cue validity. Moreover, the predicted set size effects are likely to differ 411 between the cued and uncued items, which could be tested using the same experiment. 412 A seemingly obvious way to experimentally manipulate the neural loss function would be to 413 vary the delay period. However, the neural mechanisms underlying working memory maintenance 414 are still debated, which makes it difficult to derive model predictions for this manipulation. where Δ L ≡ L correct -L incorrect . Since Δ L and  have interchangeable effects on optimal J , we fix Δ L to 1 789 and fit only  as a free parameter. 790

791
Conditions under which optimal precision declines with set size

792
In this section, we show that when the expected behavioral loss is independent of set size (as in 793 single-probe delayed estimation and change detection), the rational model predicts optimal 794 precision to decline with set size whenever the following four conditions are satisfied: 795 1) The expected behavioral loss is a strictly decreasing function of encoding precision, i.e., an 796 increase in precision results in an increase in behavioral performance. 797 31 2) The expected behavioral loss is subject to a law of diminishing returns (Mankiw, 2004): the 798 behavioral benefit obtained from a unit increase in precision decreases with precision. This 799 law will hold when condition 1 holds and the loss function is bounded from below, which is 800 generally the case as errors cannot be negative. 801 3) The expected neural loss is an increasing function of encoding precision. 802 4) The expected neural loss per unit of precision is a non-decreasing function of precision. On 803 the premise that precision is proportional to spike rate (Paradiso, 1988;Seung & 804 Sompolinsky, 1993), this condition is satisfied if loss per spike increases with spike rate, 805 which has been found to be the case (Sterling & Laughlin, 2015).