Fixing the stimulus-as-fixed-effect fallacy in task fMRI

Most functional magnetic resonance imaging (fMRI) experiments record the brain’s responses to samples of stimulus materials (e.g., faces or words). Yet the statistical modeling approaches used in fMRI research universally fail to model stimulus variability in a manner that affords population generalization, meaning that researchers’ conclusions technically apply only to the precise stimuli used in each study, and cannot be generalized to new stimuli. A direct consequence of this stimulus-as-fixed-effect fallacy is that the majority of published fMRI studies have likely overstated the strength of the statistical evidence they report. Here we develop a Bayesian mixed model (the random stimulus model; RSM) that addresses this problem, and apply it to a range of fMRI datasets. Results demonstrate considerable inflation (50-200% in most of the studied datasets) of test statistics obtained from standard “summary statistics”-based approaches relative to the corresponding RSM models. We demonstrate how RSMs can be used to improve parameter estimates, properly control false positive rates, and test novel research hypotheses about stimulus-level variability in human brain responses.


Introduction
Consider two potential titles of a hypothetical neuroimaging paper: (1) "Inferior frontal gyrus responds more strongly to the words 'chair, ' 'house,' and 'tree' than to 'run,' 'pay,' and 'speak'"; and (2) "Inferior frontal gyrus responds more strongly to nouns than to verbs". These two titles may superficially appear to describe exactly the same set of findings, since categories like Noun and Verb are necessarily comprised entirely of individual exemplars like 'chair' and 'speak'. Yet there can be little doubt that most neuroimaging researchers forced to choose between the two titles above would opt for the latter, which makes a far more interesting scientific statement. After all, we typically do not care about individual words like 'chair', except insofar as they exemplify broader populations of items that share similar properties. As the psycholinguist Edmund Coleman observed over 50 years ago, "many studies of verbal behavior have little scientific point if their conclusions have to be restricted to the specific language materials that were used in the experiment" (Coleman, 1964;cf. Clark, 1973). The same is no doubt true of the stimuli used in modern neuroimaging studies.
Choosing between hypothetical titles like those above may seem purely a matter of preference--a researcher simply decides that she cares more about the underlying population than about the individual stimuli, and can then proceed to describe her results as such. But the conceptual move from stimulus-level to populationlevel inference is not automatically justified. It must be explicitly supported by appropriate statistical inference. In studies where a sample of participants respond to a sample of stimuli, as is the case in a vast number of fMRI studies, the correct analysis that allows generalization to both participant and stimulus populations involves fitting a mixed-effects model with crossed random effects of both participants and stimuli (Baayen et al., 2008;Judd et al., 2012). This is not a hypothetical concern: in a review of 100 random task-based fMRI articles extracted from the Neurosynth database (see Supplementary File 1 for details), we found that 63/100 (95% Jeffreys interval = [53%, 72%]) used multiple stimuli in a context where generalization over stimuli was clearly indicated. Yet while virtually every fMRI study conducted over the past 15 years has modeled subject as a random factor (Penny et al., 2003), we are aware of only two published fMRI studies that have discussed the problem of unmodeled stimulus-related variance from a methodological perspective (Bedny et al., 2007;Donnet et al., 2006), and know of no primary fMRI studies that have modeled participants and stimuli as crossed random factors.
The consequences of this stimulus-as-fixed-effect fallacy, as it is called in psycholinguistics (Clark, 1973), are potentially devastating. Strictly speaking, the p-values (or other inferential statistics) reported in the entire fMRI literature to date are valid only for the exact stimuli used in each study. The conclusions cannot be generalized to a broader population of stimuli without risking inflated Type I error (cf. Donnet et al., 2006). Previous work in other domains (and replicated here) demonstrates that this inflation can be dramatic, with the Type 1 error rate frequently exceeding 50% under realistic conditions (Judd et al., 2012;Wickens & Keppel, 1983).
Here we develop a Bayesian mixed model (the random stimulus model; RSM) that directly estimates the degree of stimulus variability in fMRI data and properly adjusts the key parameter estimates to account for uncertainty due to stimulus sampling. We then apply this model to a variety of real fMRI datasets with diverse stimulus samples and experimental designs, comparing the results from the standard statistical model that ignores stimulus variability to the results from the corresponding RSM. Our findings suggest that an unknown but possibly large fraction of published fMRI findings are likely to be false positives driven by unmodeled stimulus-level variability. We demonstrate that the magnitude of the problem can be considerably ameliorated by employing large stimulus samples and/or presenting different subjects with different stimuli --in fact, in the limiting case where every participant receives a completely unique stimulus set, the standard model is the statistically appropriate model. Finally, we show how the stimulus-level parameter estimates produced by RSMs can be used to generate and test novel research hypotheses, opening up a powerful new method for studying the neural substrates of cognition.

fMRI datasets
All analyses used publicly available data obtained from one of three sources. The HCP analyses used task fMRI data from the Human Connectome Project's "100 unrelated subject" release, accessible via the online Connectome Workbench (http://www. humanconnectome.org). These data were provided [in part] by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. All HCP analyses used the preprocessed data release (https:// db.humanconnectome.org/data/projects/HCP_900). No further processing of the time series was performed prior to region-based averaging and mixed-effects modeling (see below). Methods have been previously described in detail in (Glasser et al., 2013).
The emotion regulation dataset was obtained from OpenfMRI, and is publicly available (accession number, ds000009; https://openfmri.org/dataset/ds000009/). Note that although the full dataset includes n = 24 subjects, only 11 subjects had preprocessed data available. We therefore conducted analyses with the convenience n = 11 sample, since subject sample size was irrelevant for our purposes. Experimental design and preprocessing procedures for this dataset have been previously described in Cohen (2009;http://gradworks.umi.com/34/01/3401764.html).
The IAPS dataset previously used in Chang et al. (2015) was obtained from the NeuroVault whole-brain image repository (Gorgolewski et al., 2015). Images were downloaded via the NeuroVault API from the corresponding image collection (http:// neurovault.org/collections/503/). The dataset contains 30 trial-level estimates for each of 172 participants (30 images in total). On each trial, participants passively viewed either a negative or a neutral IAPS image (15 of each). All methods have been previously described in Chang et al. (2015).

Statistical modeling
The standard model. Consider a hypothetical fMRI experiment in which participants view 20 stimuli, half belonging to one stimulus category and half to another (as in Figure 1), and we are interested in the difference in neural response between these two categories. Let Y it be the neural response of the ith subject at the tth time point in a particular voxel or region of interest (ROI), with Figure 1. Idealized data from the standard model and the random stimulus model (RSM) for a hypothetical experiment. Stimuli belonging to one of two conditions (nouns in red, verbs in blue) are presented in alternating blocks, with the same stimuli presented in the same order to each participant. Both models incorporate subject-level variability in the magnitude of the category difference. For example, Participant 2 shows a small category difference while Participant 3 shows a large category difference. In the standard model (A), neural responses are assumed to be equal in magnitude for all stimuli in a category. The RSM (B) relaxes this assumption and estimates a separate response for each stimulus. For example, the noun "desk" elicits consistently high responses for all participants, while the noun "spoon" elicits consistently low responses for all participants.
all preprocessing already carried out and with each participant's data separately standardized. The standard statistical model used to analyze such data is: where X 1 and X 2 are the regressors representing the idealized neural responses toward the two categories of stimuli; β 0 is the fixed intercept; β 1 and β 2 are the fixed effects of the two neural regressors; p 1 and p 2 are normally distributed participant effects of the neural regressors, representing stable subject-to-subject variability in the degree of neural response toward both stimulus types; and the e terms are normally distributed, observation-level errors with an AR(2) covariance structure. The regressors X 1 and X 2 are formed by summing over the idealized neural responses toward the individual stimuli comprising each stimulus category, 20 10 1 2 1 11 and .
where x ijt is the idealized neural response of the ith participant for the jth stimulus at time t. These idealized neural responses are based on convolving a stimulus presentation sequence with a hemodynamic response function (Poldrack et al., 2011).

Random stimulus model (RSM).
The standard model posits that the stimulus-level regressors are all identical in amplitude, differing only in their presentation times ( Figure 1A), a dubious assumption in most fMRI studies. In the RSM, we relax this assumption and allow the stimulus-level regressors to have distributions of amplitudes that are to be estimated from the data ( Figure 1B); these amplitudes are common over subjects, but vary randomly per stimuli. To achieve this, we add a set of terms s j x ijt to the model, where the s j are normally distributed stimulus effects, representing stable stimulus-to-stimulus variability in the strength of the neural response. The resulting model is: To exert a non-negligible impact on our results, these specification options would need to have very different impacts on the standard model and RSM (otherwise the extensions would simply lead to the test statistics from both models increasing or decreasing more or less in unison, leaving their relative differences essentially unchanged). We are aware of no a priori reasons to expect this to be the case for any of the methodological procedures employed with any frequency in the literature, and reiterate that comparably large decreases in test statistics have been repeatedly observed in other domains of psychology when including random stimulus effects (Judd et al., 2012;Wolsiefer et al., 2016).

Simulations
We conducted an extensive series of simulations in order to validate and to better understand the properties of our proposed RSM. Our first goal was to verify that the RSM could adequately recover true parameter values. Our second goal was to identify the conditions under which using a RSM produces the greatest attenuation of standardized effects compared to the standard model. In orthogonal, ANOVA-like designs where the appropriate RSM can be fit in standard mixed modeling software, it can be shown that the standardized effect for the standard model that ignores stimulus variability will be inflated by a factor of roughly E mP nS E mP S + + + + , where n is the number of participants, m is the number of stimuli, and E, P, and S are, respectively, the error variance, participant variance, and stimulus variance (the exact expression depends on the experimental design). While we cannot safely assume that the more complicated fMRI RSM will follow a similar inflation factor, this does give us several hypotheses about the qualitative conditions under which we should expect the worst inflation in fMRI data. Specifically, the degree of inflation should increase with participant sample size, decrease with stimulus sample size, and increase with stimulus variability. In Appendix 1 we describe the results of our simulations in detail. Here we summarize the basic structure of the simulations and their results.
In each run of the simulation, we generated data according to the RSM for a block-design experiment involving participants responding to stimuli nested in two stimulus categories. The test of interest in these simulated experiments is the difference in the fixed regression coefficients for the two stimulus categories (i.e., whether there is greater activation for one stimulus category than for the other). We varied three primary factors in our simulations: the participant sample size (n = 16, 32, or 64), the stimulus sample size (m = 16, 32, or 64), and the degree of random stimulus variability (zero, moderate, or high). Note that when the random stimulus effects have zero variance, the RSM is statistically equivalent to the standard model. We included this condition in order to investigate the performance of the RSM when the standard model is the correct model. For each simulated experiment, we fit four statistical models: the standard model, the RSM, the standard SPM-style "summary statistics" model, and a fourth model that we call the Fixed Stimulus Model, which we describe in Supplementary File 1 Here we focus on comparing the performance of the standard model and RSM (though, in practice, the three non-RSM models all display essentially indistinguishable behavior across all simulations).

Literature review
For our survey of the literature using task-based fMRI, we randomly selected task fMRI papers, without replacement and with uniform sampling probability, until 100 experiments were obtained (each paper can contain 1 or more experiments). Four of the papers we sampled described two experiments, one paper described three experiments, and the rest described a single experiment, so that the 100 experiments we sampled came from 94 unique papers. We coded each experiment for (a) whether the stimuli were crossed with participants or nested in participants, (b) the type of stimuli used, (c) the total number of stimuli, (d) the number of stimulus categories, and (e) whether the study was eligible to have applied a RSM to the data obtained from the study. Generally speaking, experiments were deemed eligible to have applied a RSM if (i) the experiment used more stimuli than stimulus categories (so that the individual stimulus effects are statistically identifiable), and (ii) the sampled stimuli could not be considered to fully exhaust the population of stimuli over which generalization was intended. In the handful of cases where it was not totally clear from the text whether a RSM could have been applied, we decided to err on the conservative side and deem the study ineligible. A spreadsheet with the detailed study-level results of our survey can be found at https://github.com/PsychoinformaticsLab/nipymc (Yarkoni & Westfall, 2016). We ultimately found that 63/100 of the experiments (95% Jeffreys interval = [53%, 72%]) were eligible to have applied a RSM, and thus the published test statistics for these experiments are likely inflated relative to the more appropriate RSM test statistics.

Simulations
For each contrast of interest, we define the standardized effect z = µ/σ, where µ and σ are the mean and standard deviation, respectively, of the posterior samples for the associated parameter estimate (cf. Kruschke, 2013). When the true data generating process contained zero stimulus variability, the standard model and RSM yielded very similar standardized effects for the stimulus category difference, for all participant and stimulus sample sizes. The exception was that when the stimulus sample size was small (m = 16), the standardized effects from the RSM were slightly attenuated (by about 18%) compared to the standard model standardized effects. However, this attenuation disappeared with increasing stimulus sample size, so that at m = 32 and m = 64, the standardized effects from the two models were essentially identical, as should be the case given the lack of true stimulus variability. This provides evidence that the reduced standardized effects observed in our reanalyses of real datasets (most of which were based on a sample of 100 participants) are not simply the result of the RSM always yielding lower standardized effects. Instead, the RSM tends to yield lower standardized effects when they should in fact be lower, namely, when there is random stimulus variability in the data that is ignored by the standard model.
When the true data generating process contained moderate stimulus variability, the RSM yielded consistently lower standardized effects than the standard model. This reduction was exacerbated when the stimulus sample size was small (m = 16) and attenuated somewhat when the stimulus sample size was large (m = 64). The opposite pattern held for the participant sample size: the reduction in the RSM standardized effect was largest when the participant sample size was large (n = 64) and smallest when the participant sample size was small (n = 16). These patterns are consistent with what has been observed in previous behavioral work (Judd et al., 2012). In the best case (n = 16 and m = 64) the RSM standardized effects were still reduced by an average of 17% compared to the standard model standardized effects. In the worst case (n = 64 and m = 16), the RSM standardized effects were reduced by an average of 67%. Finally, when the true data generating process contained high stimulus variability, the same qualitative patterns held, but the reduction in standardized effects was even greater, ranging from 41% in the best case to 81% in the worst case. Importantly, only the RSM correctly estimated the variability in the average condition difference across simulated datasets; the standard errors from both the standard model and other simpler (but incorrect) approximations consistently underestimated the true variance of the condition differences across simulated datasets.
Does the amygdala preferentially respond to emotional faces? To illustrate the scope and magnitude of the stimulus-as-fixedeffects fallacy in fMRI, we first focus our attention on one of the most well-established neuroimaging findings: the role of the amygdala in affective processing. The amygdala has been shown in hundreds of fMRI studies to increase activation in response to biologically salient stimuli -most notably faces -and to show a particular strong response to negative affect-provoking stimuli, such as fearful or angry face expressions (Breiter et al., 1996;Morris et al., 1996). However, the number of stimuli used in studies demonstrating this effect is often small -in many cases, fewer than 10 stimuli per experimental condition. As we note above, this is precisely the situation in which inflated statistical significance is expected.
To quantify the effects of modeling stimulus as a random factor on the amygdala response to emotionally salient stimuli, we used data (n = 111 subjects) from the Human Connectome Project (HCP; Barch et al., 2013;Van Essen et al., 2013). In the HCP Emotion Processing Task, adapted from an earlier task used by Hariri and colleagues (Hariri et al., 2002), participants view blocks of faces (20 in total) or geometric shapes (3 in total), and make an unrelated perceptual matching judgment. The face stimuli have either fearful or angry facial expressions (10 per condition). We first analyzed the data using the standard model, where subjects (but not stimuli) were modeled as random effects. Consistent with previous reports on this dataset (Barch et al., 2013) and the broader literature, when analyzing the data under the standard model, we found a robust increase in amygdala activation for face stimuli relative to shape stimuli (z = 26), and a smaller but still notable increase for angry faces relative to fearful faces (z = 3.3). These results are illustrated in Figure 2 (top).
As noted above, the analysis under the standard model fails to account for the uncertainty inherent in the fact that the effect of stimulus category is based on a small and highly variable stimulus sample. When we modeled the data using the RSM, we found that the standardized effect for the face vs. shape contrast was reduced by 89% (from z = 26 to z = 2.8) compared to the standard model, and the standardized effect for the anger vs. fear contrast was reduced by 78% (from z = 3.3 to z = 0.7). Depending on one's statistical approach and assumptions (e.g., whether and how one corrects for multiple comparisons), one could reasonably claim that there is still a non-negligible effect in the former case; however, in the latter case it is unlikely that any formal model comparison or inferential test would provide a basis for concluding that the amygdala responds differentially to angry vs. fearful faces. Most importantly for our purposes, it should be clear that the reduction in these effects is large enough to change how researchers interpret the results. Thus, simply accounting for natural variability in the sampled stimuli was sufficient to turn a seemingly robust and possibly scientifically intriguing finding into a much less remarkable result that may not merit further consideration. The impact of explicitly modeling the face stimulus as a random effect is further illustrated in Figure 3.
Whole-brain analysis reveals differential stimulus sensitivity Next, we extended the analysis to the rest of the brain, fitting the same RSM in 100 different regions-of-interest (ROIs; we used an ROI rather than voxel-wise approach, due to the computational demands of the RSM). As Figure 4 illustrates, the impact of modeling stimulus as a random factor were no less dramatic than in the amygdala for most brain regions. For the two standardized effects (i.e., face vs. shape contrast and anger vs. fear contrast), the median ratios of the standard model z-statistic over the RSM z-statistic were 3.3 and 5.96, respectively, indicating reductions of 70% and 83% when random stimulus effects were added. When thresholding brain activity at even a relatively liberal threshold of z = 3, only 2 out of 100 regions (compared to 59 out of 100 in the classical analysis) remained statistically significant, and no region showed a significant difference between angry and fearful faces (as compared to 27 regions in the classical analysis).
Intriguingly, the fusiform face area (FFA; Kanwisher et al., 1997), which showed the most robust face-related increase in the standard model (z = 31), failed to show a significant difference . Under the standard model, there is clear separation of the estimated amygdala responses toward all three stimulus categories, with anger faces evoking a somewhat stronger response than fear faces, and both face stimuli evoking a much stronger response than the simple shape stimuli. Under the RSM, the means of the regression coefficients are about the same, but they are estimated with far more uncertainty. The result is that while the face vs. shape contrast is still clearly discernible, the anger vs. fear contrast is no longer distinguishable from sampling/measurement error (Figure 2, bottom). This is represented as subject×stimulus matrices, where each row (111 in total) represents a unique subject and each column (23 in total) represents a unique stimulus. Each entry of the matrix gives a (posterior mean) regression coefficient corresponding to the model's prediction for that subject's amygdala response toward that stimulus. Notice that the standard model assumes that a subject has the same amygdala response toward all stimuli in a given category -in other words, it assumes no random stimulus variability. While this may not be an entirely unreasonable assumption for the three relatively impoverished shape stimuli (a circle and two ovals), it is a patently absurd assumption for the faces, as a cursory visual inspection of the stimuli makes clear (panel E). When we add random stimulus effects to the standard model, resulting in the RSM, we find that random stimulus variability (evident in the within-category variance of the column means) is in fact one of the chief sources of variation in the data. The images in panel E are sorted within each emotion condition by their posterior sample means (panels C and D), which are printed below the stimulus labels. (Faces images from Human Connectome Task fMRI battery, used with permission. Copyright 2012.) . Whole-brain results for five contrasts from four different datasets when modeled with either a standard approach or a random stimulus model (RSM). Each row displays results for a different dataset and/or contrast (see main text for details). Left column: scatter plot displaying the relationship between region-of-interest (ROI)-level standardized effects from the standard model (y-axis) and RSM (x-axis). Each point represents a single ROI from a 100-region whole-brain clustering. Middle and right columns: axial slices displaying ROIlevel standardized effects (z statistics, defined as in the main text) from the standard and RSM analyses, respectively. Maps are thresholded at | z | > 3.3 -comparable to using p < .001, uncorrected, in a traditional frequentist analysis -in order to illustrate the significant drop in standardized effects in most datasets when including random stimulus effects.
between faces and shapes in the RSM. This counterintuitive result can be understood by considering the large amount of stimulus-level variability in FFA responses to faces ( Figure 5). Since the face vs. shape standardized effect in the RSM depends on the ratio between the between-condition and within-condition (i.e., stimulus-level) variances, a brain region that is extremely sensitive to different stimuli of the same modality may counterintuitively fail to show a consistent difference between faces and shapes precisely because it is extremely sensitive (but differentially so) to individual faces.
Although the resulting reduction in the standardized effect may come as an unpleasant surprise, there is an important silver lining: the ability to quantify the variance in brain activity specifically related to individual stimuli provides a powerful tool for identifying brain regions sensitive to different classes of stimuli. To illustrate, Figure 5 displays the stimulus-level variability captured by the model in each brain region. Not surprisingly, stimulus-related variability was greatest in visual cortical regions; however, a number of other brain regions also showed considerable stimulus sensitivity in response to faces, including motor cortex, anterior insula (particularly for anger faces), and portions of anterior PFC. As one might intuitively expect, the variability in responses to the 3 simple geometric shapes was muted in comparison to the response to faces.

Generalization to other datasets
To ensure that the HCP Emotion Task was not an outlier, and that our conclusions apply more generally, we repeated our random stimulus analyses on several other datasets. We fit RSM models to two other HCP functional tasks -the Social Cognition Task and the Working Memory Task -as well as a non-HCP emotion processing dataset (Chang et al., 2015). These tasks differed widely in experimental design, stimulus modality (video clips, images, and audio narratives), number of stimuli (10 in the social cognition task, 96 in the WM task), and putative psychological processes. Nevertheless, when contrasting the RSM with the classical model, standardized effects for critical comparisons were reduced considerably in all datasets (the median reductions across all 100 ROIs ranged from 12% to 83% in the datasets we examined; Figure 4 and Figure 6 In general, the rank-order stability of regions was high across the two analyses in terms of the standardized effects (mean r = 0.77). Thus, the global pattern of activity across the whole brain was, at least in the tested datasets, relatively conserved in the RSM, and the drop in standardized effects largely reflected the increased variance of the fixed-effect estimates of the experimental conditions (cf. Figure 2 The critical role of stimulus sample size Why were the critical standardized effects from the RSM model so small compared to the standard analysis, despite the relatively large participant samples used in these analyses? The likely culprits here are the relatively small stimulus sets used in these experiments. As far as the RSM is concerned, the stimuli used in a study are just as important as the human subjects---both ultimately represent sources of random variation in the data that we would like to generalize over when estimating brain responses to different experimental conditions. Most researchers would question the wisdom of conducting an fMRI study that compared, say, 5 highly variable subjects in one condition to 5 highly variable subjects in another condition, yet many researchers routinely make essentially the same mistake when sampling stimuli. The problem is exacerbated in many cases, including in many of the present datasets, when stimuli are presented in exactly the same order to all participantsan approach that conflates order effects with stimulus effects, necessarily inflating the variance seemingly accounted for by the latter. The important silver lining to this otherwise grim analysis is that standardized effect inflation in the classical model is related in predictable ways to stimulus sample size (Judd et al., 2012;Wickens & Keppel, 1983). Thus, it should be possible to minimize the gap between the standard model and the RSM by increasing the number of stimuli in one's experiment. To test this prediction empirically, we used two additional datasets (Figure 6 First, we applied the RSM to the HCP language task, which included a Math condition in which participants provided forced-choice answers to auditorily presented mental arithmetic problems. In contrast to the other HCP tasks, stimuli in the Math condition are adaptively chosen from a large set of over 7,000 candidate mental arithmetic problems based on each participant's in-task performance. We consequently predicted that RSM estimates should be very close to standard model estimates for this experimental condition. This prediction was confirmed ( Figure 6B) from the two models were very similar across the brain when comparing the Math Figure 6. Whole-brain results for three datasets modeled with either a standard approach or a random stimulus model. The interpretation is the same as in Figure 4 condition to the implicit resting baseline (mean |z| = 8.47 vs. 8.12; 4% reduction). This consistency across models contrasted sharply with the large reduction observed for the Language condition (mean |z| = 4.83 vs. 2.67; 45% reduction), which presented the same 6 stimuli to nearly all subjects. As a consequence of the loss of precision in the Language condition, standardized effects for the Math vs. Language contrast also showed a considerable decline (mean |z| = 13.75 vs. 5.96; 57% reduction). Figure 6C displays estimates from the standard model and RSM for a sample region (V5/MT in visual cortex), clearly illustrating the selective increase in uncertainty in the story condition.
Second, we analyzed an unpublished emotion regulation dataset (Cohen, 2009; http://gradworks.umi.com/34/01/3401764. html), publicly available from the OpenfMRI.org repository (Poldrack & Gorgolewski, 2015), in which 11 participants passively viewed either negative or neutral pictures. Importantly, each participant viewed 60 different stimuli (from a total set of 120). Theoretically, this "partially-crossed" design should considerably reduce the discrepancy between the RSM and classical model (Westfall et al., 2014), and this is precisely what we found: standardized effects from the two models were very similar across the brain when comparing passive viewing of neutral vs. negative images ( Figure 6A: mean |z| = 2.77 vs. 2.30 for standard model vs. RSM, respectively; median ratio = 1.18). These results confirm that there is indeed a predictable and robust relationship between the stimulus sampling scheme used in an fMRI experiment and the degree of standardized effect inflation one can expect to observe when using the standard (incorrect) model.

Stimulus-level parameter estimates as a tool for exploration
While the primary reason to include random stimulus effects in one's model is to ensure that statistical inferences can be safely generalized to new stimuli, an important secondary benefit is that this approach facilitates data exploration and hypothesis generation. The inclusion of a separate parameter for each stimulus allows one to estimate the unique pattern of whole-brain activation associated with each stimulus. Inspection of these estimates may help identify novel features of the data or design that can be subsequently tested formally.
To illustrate, consider the parameter estimates displayed for individual faces in Figure 3E. Qualitatively, there appears to be a potential trend for black faces to elicit larger amygdala responses than white faces. To formally test this hypothesis, we obtained race judgments of the 20 faces from 3 lab members blind to our hypothesis and with no knowledge of the parameter estimates. When the RSM model was recomputed with an additional fixed effect coding stimulus race, the resulting posterior estimate was suggestive of a weak race effect (z = 1.55; the other parameter estimates were all virtually unchanged). Of course, this particular analysis is circular, since the hypothesis was generated and tested using the same data. The important point, however, is that even this cursory visual inspection of the stimulus-level estimates was sufficient to suggest a scientifically interesting hypothesis that could be readily tested using independent data. Indeed, a number of previous studies (Lieberman et al., 2005) have reported race-related amygdala activation patterns consistent with our conjecture (though none included random stimulus effects, and hence the existing evidence for race effects in the amygdala is itself very likely overstated).

Discussion
We have shown that the universal failure of fMRI studies to include random stimulus effects in statistical models can have a substantial, and almost invariably deleterious, effect on reported results. At root, the problem lies in a mismatch between researcher intent and statistical implementation: neuroimaging researchers intend for their statistical conclusions to generalize across populations of stimuli similar, but not identical to, the ones they tested; however, conventional statistical procedures only allow conclusions to be drawn with respect to the exact stimuli used in each study. Our literature survey suggests that the ramifications of this discrepancy between intent and praxis are likely to be very large: in a survey of 100 articles, we found that use of a RSM was clearly indicated in over 60% of cases. This is a conservative lower bound of the true extent of the problem, as many of the remaining studies could not have used a RSM due to otherwise avoidable limitations of their experimental design (e.g., using only a single stimulus per condition).
Given that the RSM standardized effects we obtained were frequently reduced by 50% or more relative to those obtained using a standard analysis, one implication of our findings is that a large fraction of the results reported in the fMRI literature are likely to be severely inflated. Moreover, when the true mean stimulus effect is zero, variation in stimuli will generally exert some non-zero influence on brain activity (as was evident in all of the datasets we tested), which in turn will inflate Type I error. In simulations, which we describe in detail in Supplementary File 1 we found that the Type 1 error rate for the standard model in the presence of unmodeled stimulus variability, using a nominal decision threshold of α = .001, ranged from about .01 to about .4, depending on the sample sizes and the level of stimulus variability. At a threshold of α = .05 (as would be common in many hypothesis-driven ROI-level tests) the Type 1 error rate was as high as .65, still under relatively conservative assumptions (e.g., a maximum participant sample size of 64 and a minimum stimulus sample size of 16).
Although its consequences are severe, the statistical argument that we've presented -based on the idea that the experimental stimuli are typically a random factor and not a fixed factor -is conceptually subtle. While the fixed vs. random distinction is not always defined consistently (Gelman & Hill, 2007, p. 245), the classical definition of a random factor from the literature on analysis of variance is that the levels of the factor that appeared in the experiment (i.e., the stimuli that were used) do not fully exhaust the theoretical population of levels that might have been used. Importantly, this definition does not imply that the stimuli were selected haphazardly. Indeed, typically experimenters take great care in selecting an appropriate stimulus set to be used in the study. Rather, to say that the stimuli are "random" is simply to say that there are, in principle, other possible stimuli that could have served the experimenter's purposes just as well as those that were in fact used. Using this definition, our literature survey suggests that a RSM is the most statistically appropriate model to use in a majority of task fMRI studies. But exceptions do exist for certain special cases. Below we list a few necessary conditions for fitting a RSM, some more conceptual and some purely statistical.
First, the stimuli in question must be inherently discrete entities, such as distinct words or photographs. An example of stimuli that, on these grounds, would not be modeled as random would be the varying doses of some drug administered to the subjects. While the doses are nominally discrete in that only a finite number of the range of possible doses are administered, these doses represent points on a well-defined continuum, and would simply be treated as a fixed predictor or covariate in the analysis. Second, and as mentioned above, a RSM is only indicated when the stimuli used in the study do not fully exhaust the theoretical population of stimuli that might have been used. An example of an experiment that does not satisfy this condition would be a study of brain responses toward singledigit numbers, in which the study employs all possible single-digit numbers. Third, there must be at least some degree of overlap in the stimulus sets that each subject receives. In what is overwhelmingly the most common case, every subject receives the same stimulus set, and a RSM can be estimated relatively easily. Less frequent, but not uncommon, are experiments where subjects receive different subsets of stimuli from a larger stimulus pool, but there is some overlap in the stimulus sets, such that at least some of the stimuli receive responses from more than one subject. An example would be the Math stimuli in the HCP Language task. A RSM would generally be appropriate here as well. In the least common case (6/100 of the experiments in our survey), every subject receives a completely unique stimulus set, with no overlap between sets, such that stimuli are strictly nested in participants. In this case the standard model would be a statistically appropriate model, and in fact subjects would need to respond multiple times to each stimulus for a RSM to be statistically identifiable at all.
We are aware of only two previous papers that have discussed the issue of random stimulus variability in fMRI (Bedny et al., 2007;Donnet et al., 2006). While conceptually similar, these two papers take different approaches than the one described here, and it is worth noting the differences. Donnet et al. (2006) describe a model with random stimulus effects that are different for every subject, so that the corresponding variance component is more akin to a subject-by-stimulus interaction variance component than to the stimulus variance components incorporated in our models. They also do not discuss inference on the fixed effects of activation magnitude. Bedny et al. (2007) do discuss inference on the fixed effects, but they focus primarily on conducting a separate subjectwise analysis (i.e., what we have called the standard model, which ignores stimulus variance) and stimulus-wise analysis (i.e., the conceptual complement of the subject-wise analysis, which includes stimulus variance but ignores participant variance). This approach is common in psycholinguistics, and it is certainly a step up from running only the standard or subject-wise model, but it is not equivalent to the full, correct model with crossed random effects of participants and stimuli. In particular, it does not succeed in maintaining the nominal Type 1 error rate (Raaijmakers, 2003;Raaijmakers et al., 1999).
What can researchers do to address the stimulus-as-fixed-effect fallacy? Broadly speaking, there are two possible strategies. The best practice approach to address the issue is to explicitly include random effects in one's model for every factor a researcher intends to generalize over. In the present study, we used MCMC sampling to fit our mixed-effects models; however, other approaches (based on maximum likelihood estimation, variational inference, etc.) are also available (Bates et al., 2015). The primary downside of an estimation-based solution is that it is computationally intensive and may be technically demanding. At present, no major fMRI analysis package supports RSMs of the kind we employ here, limiting the ability of most researchers to produce correct inferences. While we have open-sourced the NiPyMC Python package we used to fit the models reported here (http://github.com/Psych-oinformaticsLab/nipymc; Yarkoni & Westfall, 2016), this should be viewed as a provisional (and not particularly scalable) solution until more robust and widely-used packages such as FSL, SPM, or AFNI introduce support for random stimulus effects.
Alternatively, a less effective, but much simpler approach to the problem, is to use as large a stimulus sample as is practically feasible. In previous work, we have shown that the number of stimuli can impose a hard cap on statistical power in the RSM: when stimulus samples are very small, it may be impossible to obtain statistically significant estimates of the fixed effects, no matter how many thousands of subjects one samples (Westfall et al., 2014). This is evident in the present findings, where standardized effects from fMRI experiments with only a few stimuli (such as the HCP Emotion and Social Cognition tasks) showed precipitous drops in the RSM, whereas those generated by designs with many stimuli are often negligible (e.g., Figure 6 The primary (and significant) downside of a stimulus-maximizing heuristic is that it is only a heuristic--there is no guarantee that the resulting standardized effect will closely approximate the one that would have been obtained through the explicit inclusion of random stimulus effects. In particular, if the degree of random stimulus variability is large, then a huge number of stimuli may be required before the two sets of standardized effects closely converge. Nevertheless, in the absence of analysis tools capable of correctly modeling multi-stimulus designs, we strongly encourage researchers to always include as many stimuli as possible in their designs. Importantly, unlike increases in participant sample size, adding stimuli rarely incurs any additional cost. Researchers can usually easily increase the number of stimuli by either (a) eschewing repeated presentation of a few stimuli in favor of single presentation of many different stimuli, or (b) using a "partially crossed" design where each participant responds to a different subset of stimuli (Westfall et al., 2014). These approaches allow one to enjoy the statistical power benefits of a large stimulus sample without increasing data collection requirements. In the present article, Westfall and colleagues point out a major oversight in nearly all existing GLM fMRI analyses that attempt to make a generalization about stimuli -namely the failure to treat stimuli as random effects. Far from being a technical shortcoming with little practical consequence, the authors demonstrate that the failure to account for stimulus variance can markedly (and spuriously) inflate inferential statistics, with the worst inflation occurring with large numbers of participants and small numbers of unique stimuli. The authors present an analytic solution to this problem, as well as recommending a number of design precepts that can ameliorate the problem even while still using the standard fixed-effect analysis approach. This is an important issue, which is particularly timely given growing concerns in the field about reproducibility, and the case the authors make is compelling. However, the manuscript left us with a number of outstanding questions.

Source code for
We think it would be helpful to the readers to point out that most fMRI studies are not looking to make generalizations about classes of stimuli. For example, a memory researcher will typically use a fixed set of stimuli (e.g., faces or words) for study. But the stimuli are typically counter-balanced or randomly assigned across experimental conditions (e.g., studied items or non-studied items) and across subjects, in which case, modeling random effects for the stimuli is not necessary. It's only in limited cases in which generalizations about stimuli are being claimed (e.g., greater activity for nouns compared to verbs) that it becomes necessary. The authors do mention in the introduction that a "review of 100 random task-based fMRI article extracted from the Neurosynth database . . . used multiple stimuli in a context where generalizations over stimuli was clearly indicated." [the authors indicated that details about this procedure could be found in the supplementary materials, but we could not find them] However, this strikes us as not representative of published fMRI studies. In any case, we think the authors should make it clear that the concerns they raise only apply to those studies that make generalizations about classes of stimuli.
The biggest issue that was not addressed in the current manuscript is how the stimulus-as-fixed-effect issue impacts analytic approaches other than the standard mass univariate GLM. (Indeed, the authors occasionally seem to ignore the existence of other approaches, as when they claim there is a "universal failure of fMRI studies" to acknowledge stimulus variability.) For example, it would seem that many applications of representational similarity analysis (RSA; Kriegeskorte, Mur, & Bandettini, 2008 ) are not affected by this issue. Relatedly, it is not uncommon for researchers to include various parametric regressors that may vary across stimuli within the same class; does such an approach do anything to address the stimulus-as-fixed-effect issue, and conversely, is the authors' proposed RSM approach compatible with the use of parametric regressors? 1 compatible with the use of parametric regressors?
Similarly, a recent report by Abdulrahman and Henson addressed the question of variability, not just at the stimulus level, but at the trial level. Indeed, although that paper was primarily concerned with multivariate approaches, they somewhat surprisingly found that trial-level methods (namely, the "least-squares separate" and "least-squares all" methods of Mumford, Turner, Ashby, & Poldrack, 2012 ) could sometimes outperform traditional approaches even in the context of the GLM. Westfall and colleagues should explain whether (and if so, how) the stimulus-as-fixed-effect issue impacts alternative analysis approaches like RSA and the use of parametric regressors, and whether they would expect trial-level methods to play any role in ameliorating (or exacerbating) the issue.
There were also some details which were glossed over or not reported, which would be very useful to readers. Some of these might require additional work, but others should be readily available from what the authors have done already. For example, the discussion of the simulation results was very cursory. Likewise, although the authors allude to the computational demands of their RSM approach, it would be useful to quantify exactly what these demands are, for example by providing an estimate of how long some fixed number of analyses, involving some number of stimuli or number of TRs, takes on a particular system with particular specs. Among information that might require additional analyses but would be useful to researchers, the most pressing need relates to defining what "completely unique stimulus set" means-that is, is there some parametric relationship between the degree to which each participant's stimulus set is unique and how suboptimal the standard fixed-effect approach is? E.g., in a study of 30 participants, is it unique enough if each participant sees a random subset of 99% of the stimuli in some larger corpus? 75%? 50%?
The last concern has to do with being careful in defining the scope of the proposed RSM approach. While the entire paper reflects the high degree of importance the authors place on drawing valid inferences, we nonetheless felt that the section "Stimulus-level parameter estimates as a tool for explanation" should emphasize the need for caution to an even greater degree than it already does. This is a particularly exciting possible application of the approach, but it is also one that seems particularly prone to abuse, especially by researchers who are looking for an upside to counteract the "downside" of smaller effect sizes that this approach will bring. The authors are obviously statistically sophisticated enough that they leave implicit the dangers associated with this sort of exploratory approach; they should be explicit.

Are sufficient details provided to allow replication of the method development and its use by others? Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. This is a very well-written and easy to follow paper, convincingly demonstrating the fallacy of not modelling stimulus variability. The paper conveys an important message, which has been completely overlooked in the neuroimaging field to date and is likely to become influential over time, especially as practical guidelines are provided and the code has been made publicly available.
We have a number of queries and suggestions detailed below:

Introduction
There appear to be two related issues in the paper: one is not modelling structured noise in the data, inflating standardised effects; this is the case when comparing the standard model which doesn't model stimulus variability to the RSM and other models. This may be wrong even in the situation where the researcher was agnostic as to whether the results apply in general to categories of stimuli or just the stimuli tested. The second issue is fixed versus random effect modelling of the stimuli which is presented as the main focus of the paper. If this is the case, it might be worth making the distinction clearer.
It would be beneficial to introduce to the reader that you will focus on a particular example in your result sections, e.g., amygdala response to emotional faces; it was slightly surprising to us when we approached the Results section.

Methods
In the simulations, is stimulus-to-stimulus variability kept constant across subjects? The ground truth of In the simulations, is stimulus-to-stimulus variability kept constant across subjects? The ground truth of this isn't known, but is this plausible? What is the effect of relaxing this, is it similar to increasing the number of stimuli tested? How does it interact with the number of participants and inflation of standardised effects? It would be helpful to justify this and consider its implications in the discussion. P.2 | Section fMRI dataset: Please list the tasks and specific contrasts you chose from the HCP dataset to be included in your analyses as it seems that you focused on a sub-selection (Figure 4 and Figure 6). While all information are mentioned in the text, it might be clearer to create a table summarizing sample size and number of stimuli for each dataset (not just HCP) or include this information in the two figures, n=Y, and m=X. P. 5| Section Simulations: You mention that you fitted four different models, which are also explained in detail in the Supplementary File you provide. However, it would be helpful to state in the main text that the standard model, the fixed stimulus model, and the random stimulus model are implemented as Bayesian models to facilitate comparison (as differences in results can be solely linked to the inclusion of no, fixed, or random stimulus effects in the model), while the SPM model is used to compare results against the main software in the field (but here the implementations differs).

Results
Some explicit details about the computational demand for model fitting would be interesting to provide.

Discussion
The empirical studies used for comparison had sample sizes N=100, N=72, and N=11. The study with the small sample size (N=11) however included many different stimuli types (M=60). In this respect, no study was included that had a sample size commonly found in the field (N=20-30) while also having a relatively small number of stimuli as also common in the field as outlined in your Introduction. This kind of study though would have been the best when making inferences about a "large fraction of the results reported in the fMRI literature". Considering that the problem of reduction in standardized effects is the highest for studies with large sample sizes, the conclusion that a "large fraction of the results reported in the fMRI literature are likely to be severely inflated" (p.11) may be too strong given the empirical studies employed for this study.
You may want to discuss that when the data contained zero stimulus variability, the RSM model yielded attenuated effects compared to the standard model (by 18%) for small stimulus sample size. While this may not be of practical relevance (as zero stimulus variability is highly unlikely), it would be important to reflect on this from a statistical point of view.
Are there other similar effects that might be affecting the FMRI literature? For example, what about stimulus presentation order? Both cognitive and neural processes likely have a large systematic temporal effect. Despite having large influence on the BOLD signal, this is normally not modelled, and so could also affect standardised effect sizes and interpretation. It may not be comparable, but it might be worth speculating on other potential problems, which could have similar effect, such as temporal order, in the discussion.
Currently, you take a statistician's point of view as the main focus of the paper. However, one potential implication of the work that could be discussed is that stimuli selection based on classic psychology theory may be not completely accurate. For example, while it could be that psychology informed stimuli selection would group stimulus A and stimulus B in the same category, they may not be represented in the brain that way; the neural system may be more sensitive to stimuli perturbations than could be the brain that way; the neural system may be more sensitive to stimuli perturbations than could be assessed using 'traditional' psychological tools (derived from questionnaire or behavioural data). Equally, different representations of stimuli in the brain could result in similar behavioural responses. This points towards the need for a neurally-informed cognitive taxonomy rather than relying purely on psychological definitions, that were developed largely blind to the underlying functional organisation of the brain.  Figure 2).

Is the description of the method technically sound? Yes
Are sufficient details provided to allow replication of the method development and its use by others? Yes If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. In this paper, the authors claim that the conclusions of several studies would be modified if a statistical models taking into account the variability of the presented stimuli had been considered.
Although the problem is interesting, I am not completely convinced by the conclusions and the statistical tools used to assess the results.
First of all, the authors argue that, due to the complexity of the new model, the authors can not use the standard numerical tools to perform the statistical inference (R or SAS) and so will prefer a Bayesian inference, making this choice quite opportunist. However, besides the fact that they use a Bayesian inference (including prior distribution), they base their conclusions on frequentist arguments (comparing test statistics). To my point of view, this is quite confusing. If a Bayesian framework is considered, then the hypothesis testings should be perform using Bayes Factor or any other tools taking into account the prior distribution.
Moreover, when proposing a new model, the first thing to do is to compare the old model and the new model for any given dataset (using statistical tools such as hypothesis testing, in a frequentist or Bayesian framework). I could not see such a test in the paper. The mixed effects model involves much more parameters for the same amount of observations. Before deriving conclusions on the new model, it should be interesting to put in competition the fixed effects model and the mixed effects models, to be sure that the use of the more complex model is supported by the data.
Besides, in a Bayesian context, the greater posterior variance observed on the regression parameters was completely expected (Figure 2) (more parameters for the same quantity of information leads to more incertitude a posteriori). However, without objective criteria (such as Bayes factor or hypothesis testing), I am not able to decide whether the difference between fear faces and anger faces is significant or not. (caption for Figure 2). It is not clearly stated in the paper how the authors were able to do so (even though I could find clues in the simulation section?). I could find clues in the simulation section?).
I am aware of the fact that Bayes factor are quantities difficult to estimate and that frequentist hypothesis testing in mixed effects models can be a tough issue. However, if the authors want to prove that the conclusions of statistical testing are modified when using mixed effects models, then they should perform the adequate and rigorous statistical testings.
No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above. We'd like to thank Dr. Donnet for taking the time to consider our paper carefully. For ease of reading we've pasted in her comments as boldface, bulleted text.
In this paper, the authors claim that the conclusions of several studies would be modified if a statistical models taking into account the variability of the presented stimuli had been considered. Although the problem is interesting, I am not completely convinced by the conclusions and the statistical tools used to assess the results. First of all, the authors argue that, due to the complexity of the new model, the authors can not use the standard numerical tools to perform the statistical inference (R or SAS) and so will prefer a Bayesian inference, making this choice quite opportunist. We regret that the choice to use a Bayesian model created confusion. We noted in the Methods section that the fMRI random stimulus model "cannot be fit using standard mixed modeling statistical packages, such as lme4 in R or SAS PROC MIXED, because these packages assume that each row of the dataset is associated with one and only one level of each random factor, i.e., a single participant and a single stimulus." Hence, we were forced to develop a custom model, and a Bayesian one developed with PyMC3 was the most convenient approach.
We note that most of the datasets we analyzed could in principle also have been fitted using a modified (and approximate) version of the standard "summary statistics" model; however, this would have required extensive additional analysis, as we would have needed to fit a separate first-level model for each individual subject (adding individual regressors for each individual stimulus and temporarily omitting existing condition regressors, which in the classical formulation are non-identifiable in the presence of the stimulus effects); and then fit a mixed model at the second level that models the individual stimulus effects as random effects. The former steps would have been extremely time-consuming relative to our strategy of fitting a single large model, and the latter step (i.e., modeling stimulus as a random effect) is currently not supported in any existing fMRI package, so we would have had to develop a custom model anyway. Moreover, as we noted in the supplement, "This [summary statistics] model is only an approximate version of the full RSM, as it ignores the varying degrees of uncertainty in the stimulus-level coefficients estimated at the first level (due in particular to factors such as differences in how often the stimuli were presented and how collinear the stimulus-level regressors were for each subject)".
We also note that we did in fact use the summary statistics approximation to obtain estimates for the Chang et al. (2015) dataset (see supplement for details). This was not by choice, and we would the Chang et al. (2015) dataset (see supplement for details). This was not by choice, and we would have preferred to fit the full RSM; however, as we explain in the supplement, this dataset was available in a form amenable only to the summary statistics approach (i.e., because pre-computed estimates for each individual trial were provided).
However, besides the fact that they use a Bayesian inference (including prior distribution), they base their conclusions on frequentist arguments (comparing test statistics). To my point of view, this is quite confusing. We do not believe that our conclusions are based on frequentist arguments. Notably, we have not computed any p-values nor refer to any z-scores. In the section "Does the amygdala preferentially respond to emotional faces?" we write: "we define a test statistic z = µ/σ, where µ and σ are the mean and standard deviation, respectively, of the posterior samples for the associated parameter estimate". The use of "test statistic" was a bit careless, as we are not "testing hypotheses"; we simply needed a convenient summary of the posterior. We admit that this choice of terminology could be confusing, so in our revision we have renamed this statistic the posterior standardized effect.
Another important point is that our argument is fundamentally qualitative and not quantitative; we focus on how uncertainty surrounding the estimate of interest (denominator of our standardized effect) increases dramatically in most cases. We present this increase in terms of standardized effect because that's what most people are familiar with, but nothing would change if we instead talked about the relative increases in the variances of the estimates.
If a Bayesian framework is considered, then the hypothesis testings should be perform using Bayes Factor or any other tools taking into account the prior distribution. Bayes Factors in this setting are not trivial, and any rate would not serve the broad audience, who would like to know: How severe is the random stimulus effect problem? Again, we could have measured this directly in terms of variance inflation, but since users are most familiar with signal-to-noise ratio, we felt that this was the best metric.
Moreover, when proposing a new model, the first thing to do is to compare the old model and the new model for any given dataset (using statistical tools such as hypothesis testing, in a frequentist or Bayesian framework). I could not see such a test in the paper. The mixed effects model involves much more parameters for the same amount of observations. Before deriving conclusions on the new model, it should be interesting to put in competition the fixed effects model and the mixed effects models, to be sure that the use of the more complex model is supported by the data. The reviewer is correct that no formal model assessment was used. This was intentional, as (crucially) we are not testing a hypothesis about the relative fit of different models. Rather, we're pointing out a fundamental mismatch between the conclusions researchers typically draw (which almost invariably involve assuming that it is safe to generalize conclusions over a population of stimuli) and the inferences that are actually licensed by the statistical model (which, in the absence of a variance component reflecting the random sampling of levels from some factor, are only valid for the specific levels used in the experiment). Our position is that it is the design of the experiment and the goals of the analyst--not the observed data--that determine which random effects should be included in the model. This position is consistent with the arguments of Barr, Levy, Scheepers, and Tily (2013).
It may be useful to draw an analogy to a random subject effect in a multi-subject model. No reasonable investigator would test for the significance (or BF-based evidence) for a subject random effect; while such tests may come up negative sometimes, we know our subjects are a and the desired inferences demand such a model.