A novel source of variation in IT population response magnitude predicts image memorability

Most accounts of image and object encoding in inferotemporal cortex (IT) focus on the distinct patterns of spikes that different images evoke across the IT population. By analyzing data collected from IT as monkeys performed a visual memory task, we demonstrate that variation in a complementary coding scheme, the magnitude of the population response, can largely account for how well images will be remembered. To investigate the origin of IT image memorability modulation, we probed convolutional neural network models trained to categorize objects. We found that, like the brain, different natural images evoked different magnitude responses from these networks, and in higher layers, larger magnitude responses were correlated with the images that humans and monkeys find most memorable. Together, these results suggest that variation in IT population response magnitude is a natural consequence of the optimizations required for visual processing, and that this variation has consequences for visual memory. Impact statement Population response magnitude predicts how well an image will be remembered, both in monkey inferotemporal cortex as well as neural networks trained to categorize objects.


31
At higher stages of visual processing such as inferotemporal cortex (IT), representations of 32 image and object identity are thought to be encoded as distinct patterns of spikes across the IT 33 population, consistent with neurons that are individually "tuned" for distinct image and object 34 properties. In a population representational space, these distinct spike patterns translate into response is often assumed to be unimportant (but see (Chang and Tsao, 2017)), and it is 39 typically disregarded in population-based approaches, including population decoding and 40 representational similarity analyses (Kriegeskorte et al., 2008). Building on that understanding, 41 investigations of cognitive processes, such as memory, appreciate the importance of equating 42 image sets for the robustness of their underlying visual representations in an attempt to isolate 43 the cognitive process under investigation from variation due to changes in the robustness of the 44 sensory input. This process amounts to matching decoding performance or representational 45 similarity between sets of images, in order to control for low-level factors (e.g. contrast, 46 luminance and spatial frequency content) and visual discriminability (Willenbockel et al., 2010). 47 Here we demonstrate that variation in IT population response magnitude has important 48 behavioral consequences for one higher cognitive process: how well images will be 49 remembered. Our results suggest that the lack of appreciation for this type of variation in IT 50 population response should be reconsidered.  where images that produce larger population responses are more memorable.

91
To test the hypothesis presented in Figure 1, we obtained image memorability scores by 92 passing images through a model designed to predict image memorability for humans (Khosla et 93 al., 2015). The neural data, also reported in (Meyer and Rust, 2018), were recorded from IT as 94 two rhesus monkeys performed a single-exposure visual memory task in which they reported 95 whether images were novel (never before seen) or were familiar (seen once previously; Figure   96 2a). In each experimental session, neural populations with an average size of 26 units were      presentations (mean proportional reduction in this spike count window = 6.2%; see also (Meyer 141 and Rust, 2018)), the correlation remained strong when computed for the images both when 142 they were novel (Pearson correlation: r = 0.62; p = 2x10 -12 ) as well as when they were familiar 143 (Pearson correlation: r = 0.58; p = 8x10 -11 ).

145
The strength of the correlation between memorability and IT response magnitude is notable 146 given the species difference, as the memorability scores were derived from a model designed to 147 predict what humans find memorable whereas the neural data were collected from rhesus 148 monkeys. In contrast to the human-based scores, which reflect the estimated average 149 performance of ~80 human individuals, our monkey behavioral data are binary (i.e. 150 correct/incorrect for each image). As such, the monkey behavioral data cannot be used in the 151 same way to concatenate neural data across sessions to create a pseudopopulation sufficiently 152 large to accurately estimate IT population response magnitudes. However, our data did allow us 153 to evaluate whether human-based memorability scores were predictive of the images that the 154 monkeys found most memorable during the single-exposure visual memory task, and we found 155 that this was in fact the case (Figure 2c).

157
While the monkeys involved in these experiments were not explicitly trained to report object 158 identity, they presumably acquired the ability to identify objects naturally over their lifetimes. The 159 correlations between IT population response magnitude and image memorability could thus 160 result from optimizations for visual memory, or it could follow more simply from the optimizations 161 that support visual processing, including object and scene identification. If it were the case that 162 5 a system trained to categorize objects and scenes (but not trained to report familiarity) could 163 account for the correlations we observe between IT response magnitude variation and image 164 memorability, this would suggest that image memorability follows from the optimizations for 165 visual (as opposed to mnemonic) processing. To investigate the origin of memorability variation, 166 we investigated the correlate of memorability in a convolutional neural network (CNN) model 167 trained to categorize thousands of objects and scenes but not explicitly trained to remember 168 images or estimate memorability (Khosla et al., 2015). We found that the correlation between     The mechanism that we describe here is also likely to be partially but not entirely overlapping we describe to any single mechanism using neural data alone. The fact that variations in 257 response magnitudes that correlate with memorability emerge from static, feed-forward, and 258 fixed networks suggests that memorability variation is unlikely to follow primarily from the types 259 of attentional mechanisms that require top-down processing or plasticity beyond that required 260 for wiring up a system to identify objects.      As an overview, three types of data are included in this paper: 1) Behavioral and neural data 325 collected from two rhesus monkeys that were performing a single-exposure visual memory task; Each trial of the monkeys' task involved viewing one image for at least 400 ms and indicating 350 whether it was novel (had never been seen before) or familiar (had been seen exactly once) 351 with an eye movement to one of two response targets. Images were never presented more than 352 twice (once as novel and then as familiar) during the entire training and testing period of the 353 experiment. Trials were initiated by the monkey fixating on a red square (0.25°) on the center of 354 a gray screen, within an invisible square window of ±1.5°, followed by a 200 ms delay before a 355 4° stimulus appeared. The monkeys had to maintain fixation of the stimulus for 400 ms, at which 356 time the red square turned green (go cue) and the monkey made a saccade to the target 357 indicating that the stimulus was novel or familiar. In monkey 1, response targets appeared at 358 stimulus onset; in monkey 2, response targets appeared at the time of the go cue. In both 359 cases, targets were positioned 8° above or below the stimulus. The association between the 360 target (up vs. down) and the report (novel vs. familiar) was swapped between the two animals. 361 The image remained on the screen until a fixation break was detected. The first image 362 presented in each session was always a novel image. The probability of a trial containing a 363 novel vs. familiar image quickly converged to 50% for each class. Delays between novel and 364 familiar presentations were pseudorandomly selected from a uniform distribution, in powers of 365 two (n-back = 1, 2, 4, 8, 16, 32 and 64 trials corresponding to mean delays of 4.5s, 9s, 18s, 36s, 366 1.2 min, 2.4 min, and 4.8 min, respectively).

368
The images used for both training and testing were collected via an automated procedure that 369 downloaded images from the Internet. Images smaller than 96*96 pixels were not considered 370 and eligible images were cropped to be square and resized to 256*256 pixels. An algorithm 371 12 removed duplicate images. The image database was randomized to prevent clustering of 372 images according to the order in which they were downloaded. In both the training and testing 373 phases, all images of the dataset were presented sequentially in a random order (i.e. without 374 any consideration of their content). During the testing phase, 'novel' images were those that 375 each monkey had never encountered in the entire history of training and testing. To determine 376 the degree to which these results depended on images with faces and/or body parts, images 377 were scored by two human observers who were asked to determine whether each image 378 contained one or more faces or body parts of any kind (human, animal or character). Conflicts 379 between the observers were resolved by scrutinizing the images. Only 19% of the images used 380 in these experiments contained faces and/or body parts.

382
The activity of neurons in IT was recorded via a single recording chamber in each monkey. 383 Chamber placement was guided by anatomical magnetic resonance images in both monkeys. 384 The region of IT recorded was located on the ventral surface of the brain, over an area that